0% found this document useful (0 votes)
11 views

Data Management Techniques Unit 3

Data Management Techniques unit 3

Uploaded by

farooksrec
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Data Management Techniques Unit 3

Data Management Techniques unit 3

Uploaded by

farooksrec
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

3rd Unit-Data Modeling & Design

Data modeling is the process of discovering, analyzing, and scoping data


requirements, and then representing and communicating these data requirements
in a precise form called the data model. Data modeling is a critical component of
data management. The modeling process requires that organizations discover and
document how their data fits together. The modeling process itself designs how
data fits together. Data models depict and enable an organization to understand
its data assets.

1. Introduction

There are a number of different schemes used to represent data. The six most
commonly used schemes are: Relational, Dimensional, Object-Oriented, Fact-Based, Time-
Based, and NoSQL. Models of these schemes exist at three levels of detail: conceptual,
logical, and physical. Each model contains a set of components. Examples of components
are entities, relationships, facts, keys, and attributes. Once a model is built, it needs to be
reviewed and once approved, maintained.
Data models comprise and contain Metadata essential to data consumers. Much of
this Metadata uncovered during the data modeling process is essential to other
data management functions. For example, definitions for data governance and
lineage for data warehousing and analytics.

1.1 Business Drivers


Data models are critical to effective management of data. They:

Provide a common vocabulary around data


Capture and document explicit knowledge about an organization’s data
and systems
Serve as a primary communications tool during projects
Provide the starting point for customization, integration, or even
replacement of an application

1.2 Goals and Principles


The goal of data modeling is to confirm and document understanding of different
perspectives, which leads to applications that more closely align with current and
future business requirements, and creates a foundation to successfully complete
broad-scoped initiatives such as Master Data Management and data governance
programs. Proper data modeling leads to lower support costs and increases the
reusability opportunities for future initiatives, thereby reducing the costs of
building new applications. Data models are an important form of Metadata.
Confirming and documenting understanding of different perspectives facilitates:

Formalization: Formal definition imposes a disciplined structure to data


that reduces the possibility of data anomalies occurring when accessing
and persisting data.
Scope definition: A data model can help explain the boundaries for data
context and implementation of purchased application packages, projects,
initiatives, or existing systems.
Knowledge retention/documentation: A data model can preserve
corporate memory regarding a system or project by capturing knowledge
in an explicit form. It serves as documentation for future projects to use as
the as-is version.
1.3 Essential Concepts
This section will explain the different types of data that can be modeled, the
component pieces of data models, the types of data models that can be developed,
and the reasons for choosing different types in different situations. This set of
definitions is extensive, in part, because data modeling itself is about the process of
definition. It is important to understand the vocabulary that supports the practice.

1.3.1 Data Modeling and Data Models


Data modeling is most frequently performed in the context of systems development
and maintenance efforts, known as the system development lifecycle (SDLC). Data
modeling can also be performed for broad-scoped initiatives (e.g., Business and
Data Architecture, Master Data Management, and data governance initiatives)
where the immediate end result is not a database but an understanding of
organizational data.

1.3.2 Types of Data that are Modeled


Four main types of data can be modeled. The types of data being modeled in any
given organization reflect the priorities of the organization or the project that
requires a data model:

Category information:
Resource information
Business event information:
Detail transaction information:

1.3.3 Data Model Components


As will be discussed later in the chapter, different types of data models represent
data through different conventions. However, most data models contain the same
basic building blocks: entities, relationships, attributes, and domains.
1.3.3.1 Entity
Outside of data modeling, the definition of entity is a thing that exists separate
from other things. Within data modeling, an entity is a thing about which an
organization collects information. Entities are sometimes referred to as the nouns
of an organization. An entity can be thought of as the answer to a fundamental
question – who, what, when, where, why, or how – or to a combination of these
questions. Table 7 defines and gives examples of commonly used entity categories
1.3.3.1.1 Entity Aliases
The generic term entity can go by other names. The most common is entity-type, as a
type of something is being represented (e.g., Jane is of type Employee), therefore
Jane is the entity and Employee is the entity type. However, in widespread use
today is using the term entity for Employee and entity instance for Jane.

1.3.3.1.2 Graphic Representation of Entities


In data models, entities are generally depicted as rectangles (or rectangles with
rounded edges) with their names inside, such as in Figure 29, where there are three
entities: Student, Course, and Instructor.

1.3.3.2 Domain
In data modeling, a domain is the complete set of possible values that an attribute
can be assigned. A domain may be articulated in different ways (see points at the
end of this section). A domain provides a means of standardizing the characteristics
of the attributes. For example, the domain Date, which contains all possible valid
dates, can be assigned to any date attribute in a logical data model or date
columns/fields in a physical data model, such as:

EmployeeHireDate
OrderEntryDate
ClaimSubmitDate
CourseStartDate
1.3.4 Data Modeling Schemes
The six most common schemes used to represent data are: Relational, Dimensional,
Object-Oriented, Fact-Based, Time-Based, and NoSQL. Each scheme uses specific
diagramming notations (see Table 9).

This section will briefly explain each of these schemes and notations. The use of
schemes depends in part on the database being built, as some are suited to
particular technologies, as shown in Table 10.

1.3.5 Data Model Levels of Detail


In 1975, the American National Standards Institute’s Standards Planning and
Requirements Committee (SPARC) published their three-schema approach to
database management. The three key components were:
Conceptual: This embodies the ‘real world’ view of the enterprise being
modeled in the database. It represents the current ‘best model’ or ‘way of
doing business’ for the enterprise.
External: The various users of the database management system operate
on sub-sets of the total enterprise model that are relevant to their
particular needs. These subsets are represented as ‘external schemas’.
Internal: The ‘machine view’ of the data is described by the internal
schema. This schema describes the stored representation of the
enterprise’s information
1.3.5.1 Conceptual
A conceptual data model captures the high-level data requirements as a collection
of related concepts. It contains only the basic and critical business entities within a
given realm and function, with a description of each entity and the relationships
between entities.

1.3.5.2 Logical
A logical data model is a detailed representation of data requirements, usually in
support of a specific usage context, such as application requirements. Logical data
models are still independent of any technology or specific implementation
constraints. A logical data model often begins as an extension of a conceptual data
model.
1.3.5.3 Physical
A physical data model (PDM) represents a detailed technical solution, often using
the logical data model as a starting point and then adapted to work within a set of
hardware, software, and network tools. Physical data models are built for a
particular technology. Relational DBMSs, for example, should be designed with the
specific capabilities of a database management system in mind (e.g., IBM DB2,
UDB, Oracle, Teradata, Sybase, Microsoft SQL Server, or Microsoft Access).

1.3.6 Normalization
Normalization is the process of applying rules in order to organize business
complexity into stable data structures. The basic goal of normalization is to keep
each attribute in only one place to eliminate redundancy and the inconsistencies
that can result from redundancy. The process requires a deep understanding of
each attribute and each attribute’s relationship to its primary key.
Normalization rules sort attributes according to primary and foreign keys.
Normalization rules sort into levels, with each level applying granularity and
specificity in search of the correct primary and foreign keys. Each level comprises a
separate normal form, and each successive level does not need to include previous
levels. Normalization levels include:

First normal form (1NF): Ensures each entity has a valid primary key, and
every attribute depends on the primary key; removes repeating groups,
and ensures each attribute is atomic (not multi-valued). 1NF includes the
resolution of many-to-many relationships with an additional entity often
called an associative entity.
Second normal form (2NF): Ensures each entity has the minimal primary
key and that every attribute depends on the complete primary key.
Third normal form (3NF): Ensures each entity has no hidden primary keys
and that each attribute depends on no attributes outside the key (“the key,
the whole key and nothing but the key”).
Boyce / Codd normal form (BCNF): Resolves overlapping composite
candidate keys. A candidate key is either a primary or an alternate key.
‘Composite’ means more than one (i.e., two or more attributes in an
entity’s primary or alternate keys), and ‘overlapping’ means there are
hidden business rules between the keys.
Fourth normal form (4NF): Resolves all many-to-many-to-many
relationships (and beyond) in pairs until they cannot be broken down into
any smaller pieces.
Fifth normal form (5NF): Resolves inter-entity dependencies into basic
pairs, and all join dependencies use parts of primary keys.

1.3.7 Abstraction
Abstraction is the removal of details in such a way as to broaden applicability to a
wide class of situations while preserving the important properties and essential
nature from concepts or subjects. An example of abstraction is the Party/Role
structure, which can be used to capture how people and organizations play certain
roles. Not all modelers or developers are comfortable with, or have the ability to
work with abstraction. The modeler needs to weigh the cost of developing and
maintaining an abstract structure versus the amount of rework required.
Abstraction includes generalization and specialization. Generalization groups the
common attributes and relationships of entities into supertype entities, while
specialization separates distinguishing attributes within an entity into subtype
entities

2. Activities
This section will briefly cover the steps for building conceptual, logical, and
physical data models, as well as maintaining and reviewing data models. Both
forward engineering and reverse engineering will be discussed.

2.1 Plan for Data Modeling


A plan for data modeling contains tasks such as evaluating organizational
requirements, creating standards, and determining data model storage.
The deliverables of the data modeling process include:

Diagram: A data model contains one or more diagrams. The diagram is


the visual that captures the requirements in a precise form.
Definitions: Definitions for entities, attributes, and relationships are
essential to maintaining the precision on a data model.
Issues and outstanding questions: Frequently the data modeling process
raises issues and questions that may not be addressed during the data
modeling phase.
Lineage: For physical and sometimes logical data models, it is important
to know the data lineage, that is, where the data comes from.

2.2 Build the Data Model


To build the models, modelers often rely heavily on previous analysis and modeling
work. They may study existing data models and databases, refer to published
standards, and incorporate any data requirements. After studying these inputs,
they start building the model. Modeling is a very iterative process.

2.2.1 Forward Engineering


Forward engineering is the process of building a new application beginning with
the requirements. The CDM is completed first to understand the scope of the
initiative and the key terminology within that scope. Then the LDM is completed to
document the business solution, followed by the PDM to document the technical
solution. Reverse Engineering
Reverse engineering is the process of documenting an existing database. The PDM
is completed first to understand the technical design of an existing system,
followed by an LDM to document the business solution that the existing system
meets, followed by the CDM to document the scope and key terminology within the
existing system. Most data modeling tools support reverse engineering from a
variety of databases; however, creating a readable layout of the model elements still
requires a modeler. There are several common layouts (orthogonal, dimensional,
and hierarchical) which can be selected to get the process started, but contextual
organization (grouping entities by subject area or function) is still largely a manual
process.

2.3 Review the Data Models


As do other areas of IT, models require quality control. Continuous improvement
practices should be employed. Techniques such as time-to-value, support costs, and
data model quality validators such as the Data Model Scorecard ® (Hoberman, 2009),
can all be used to evaluate the model for correctness, completeness, and
consistency. Once the CDM, LDM, and PDM are complete, they become very useful
tools for any roles that need to understand the model, ranging from business
analysts through developers.

2.4 Maintain the Data Models


Once the data models are built, they need to be kept current. Updates to the data
model need to be made when requirements change and frequently when business
processes change. Within a specific project, often when one model level needs to
change, a corresponding higher level of model needs to change. For example, if a
new column is added to a physical data model, that column frequently needs to be
added as an attribute to the corresponding logical data model. A good practice at
the end of each development iteration is to reverse engineer the latest physical data
model and make sure it is still consistent with its corresponding logical data model.
Many data modeling tools help automate this process of comparing physical with
logical.

3. Tools
There are many types of tools that can assist data modelers in completing their
work, including data modeling, lineage, data profiling tools, and Metadata
repositories.

3.1 Data Modeling Tools


Data modeling tools are software that automate many of the tasks the data modeler
performs. Entry-level data modeling tools provide basic drawing functionality
including a data modeling pallet so that the user can easily create entities and
relationships. These entry-level tools also support rubber banding, which is the
automatic redrawing of relationship lines when entities are moved. More
sophisticated data modeling tools support forward engineering from conceptual to
logical to physical to database structures, allowing the generation of database data
definition language (DDL). Most will also support reverse engineering from
database up to conceptual data model. These more sophisticated tools often
support functionality such as naming standards validation, spellcheckers, a place to
store Metadata (e.g., definitions and lineage), and sharing features (such as
publishing to the Web).

3.2 Lineage Tools


A lineage tool is software that allows the capture and maintenance of the source
structures for each attribute on the data model. These tools enable impact analysis;
that is, one can use them to see if a change in one system or part of system has
effects in another system. For example, the attribute Gross Sales Amount might be
sourced from several applications and require a calculation to populate – lineage
tools would store this information. Microsoft Excel® is a frequently-used lineage
tool. Although easy to use and relatively inexpensive, Excel does not enable real
impact analysis and leads to manually managing Metadata. Lineage is also
frequently captured in a data modeling tool, Metadata repository, or data
integration tool. (See Chapters 11 and 12.)
3.3 Data Profiling Tools
A data profiling tool can help explore the data content, validate it against existing
Metadata, and identify Data Quality gaps/deficiencies, as well as deficiencies in
existing data artifacts, such as logical and physical models, DDL, and model
descriptions. For example, if the business expects that an Employee can have only
one job position at a time, but the system shows Employees have more than one job
position in the same timeframe, this will be logged as a data anomaly. (See
Chapters 8 and 13.)

3.4 Metadata Repositories


A Metadata repository is a software tool that stores descriptive information about
the data model, including the diagram and accompanying text such as definitions,
along with Metadata imported from other tools and processes (software
development and BPM tools, system catalogs, etc.). The repository itself should
enable Metadata integration and exchange. Even more important than storing the
Metadata is sharing the Metadata. Metadata repositories must have an easily
accessible way for people to view and navigate the contents of the repository. Data
modeling tools generally include a limited repository. (See Chapter 13.)

3.5 Data Model Patterns


Data model patterns are reusable modeling structures that can be applied to a wide
class of situations. There are elementary, assembly, and integration data model
patterns. Elementary patterns are the ‘nuts and bolts’ of data modeling. They
include ways to resolve many-to-many relationships, and to construct self-
referencing hierarchies. Assembly patterns represent the building blocks that span
the business and data modeler worlds. Business people can understand them –
assets, documents, people and organizations, and the like. Equally importantly, they
are often the subject of published data model patterns that can give the modeler
proven, robust, extensible, and implementable designs. Integration patterns
provide the framework for linking the assembly patterns in common ways (Giles,
2011).

3.6 Industry Data Models


Industry data models are data models pre-built for an entire industry, such as
healthcare, telecom, insurance, banking, or manufacturing. These models are often
both broad in scope and very detailed. Some industry data models contain
thousands of entities and attributes. Industry data models can be purchased
through vendors or obtained through industry groups such as ARTS (for retail),
SID (for communications), or ACORD (for insurance).
Any purchased data model will need to be customized to fit an organization, as it
will have been developed from multiple other organizations’ needs. The level of
customization required will depend on how close the model is to an organization’s
needs, and how detailed the most important parts are. In some cases, it can be a
reference for an organization’s in-progress efforts to help the modelers make
models that are more complete. In others, it can merely save the data modeler some
data entry effort for annotated common elements.

4. Best Practices

4.1 Best Practices in Naming Conventions


The ISO 11179 Metadata Registry, an international standard for representing
Metadata in an organization, contains several sections related to data standards,
including naming attributes and writing definitions.
Data modeling and database design standards serve as the guiding principles to
effectively meet business data needs, conform to Enterprise and Data Architecture
(see Chapter 4) and ensure the quality of data (see Chapter 14). Data architects,
data analysts, and database administrators must jointly develop these standards.
They must complement and not conflict with related IT standards.
Publish data model and database naming standards for each type of modeling
object and database object. Naming standards are particularly important for
entities, tables, attributes, keys, views, and indexes. Names should be unique and
as descriptive as possible.
Logical names should be meaningful to business users, using full words as much as
possible and avoiding all but the most familiar abbreviations. Physical names must
conform to the maximum length allowed by the DBMS, so use abbreviations where
necessary. While logical names use blank spaces as separators between words,
physical names typically use underscores as word separators.
Naming standards should minimize name changes across environments. Names
should not reflect their specific environment, such as test, QA, or production. Class
words, which are the last terms in attribute names such as Quantity, Name, and
Code, can be used to distinguish attributes from entities and column names from
table names. They can also show which attributes and columns are quantitative
rather than qualitative, which can be important when analyzing the contents of
those columns.

4.2 Best Practices in Database Design


In designing and building the database, the DBA should keep the following design
principles in mind (remember the acronym PRISM):

Performance and ease of use: Ensure quick and easy access to data by
approved users in a usable and business-relevant form, maximizing the
business value of both applications and data.
Reusability: The database structure should ensure that, where
appropriate, multiple applications can use the data and that the data can
serve multiple purposes (e.g., business analysis, quality improvement,
strategic planning, customer relationship management, and process
improvement). Avoid coupling a database, data structure, or data object to
a single application.
Integrity: The data should always have a valid business meaning and
value, regardless of context, and should always reflect a valid state of the
business. Enforce data integrity constraints as close to the data as
possible, and immediately detect and report violations of data integrity
constraints.
Security: True and accurate data should always be immediately available to
authorized users, but only to authorized users. The privacy concerns of all
stakeholders, including customers, business partners, and government
regulators, must be met. Enforce data security, like data integrity, as close
to the data as possible, and immediately detect and report security
violations.
Maintainability: Perform all data work at a cost that yields value by
ensuring that the cost of creating, storing, maintaining, using, and
disposing of data does not exceed its value to the organization. Ensure the
fastest possible response to changes in business processes and new
business requirements.

5. Data Model Governance

5.1 Data Model and Design Quality Management


Data analysts and designers act as intermediaries between information consumers
(the people with business requirements for data) and the data producers who
capture the data in usable form. Data professionals must balance the data
requirements of the information consumers and the application requirements of
data producers.
Data professionals must also balance the short-term versus long-term business
interests. Information consumers need data in a timely fashion to meet short-term
business obligations and to take advantage of current business opportunities.
System-development project teams must meet time and budget constraints.
However, they must also meet the long-term interests of all stakeholders by
ensuring that an organization’s data resides in data structures that are secure,
recoverable, sharable, and reusable, and that this data is as correct, timely, relevant,
and usable as possible. Therefore, data models and database designs should be a
reasonable balance between the short-term needs and the long-term needs of the
enterprise.

5.1.1 Develop Data Modeling and Design Standards


As previously noted (in Section 4.1) data modeling and database design standards
provide guiding principles to meet business data requirements, conform to
Enterprise and Data Architecture standards, and ensure the quality of data. Data
modeling and database design standards should include the following:

A list and description of standard data modeling and database design


deliverables
A list of standard names, acceptable abbreviations, and abbreviation rules
for uncommon words, that apply to all data model objects
A list of standard naming formats for all data model objects, including
attribute and column class words
A list and description of standard methods for creating and maintaining
these deliverables
A list and description of data modeling and database design roles and
responsibilities
A list and description of all Metadata properties captured in data
modeling and database design, including both business Metadata and
technical Metadata. For example, guidelines may set the expectation that
the data model captures lineage for each attribute.
Metadata quality expectations and requirements (see Chapter 13)
Guidelines for how to use data modeling tools
Guidelines for preparing for and leading design reviews
Guidelines for versioning of data models
Practices that are discouraged

5.1.2 Review Data Model and Database Design Quality


Project teams should conduct requirements reviews and design reviews of the
conceptual data model, logical data model, and physical database design. The
agenda for review meetings should include items for reviewing the starting model
(if any), the changes made to the model and any other options that were considered
and rejected, and how well the new model conforms to any modeling or
architecture standards in place.

5.1.3 Manage Data Model Versioning and Integration


Data models and other design specifications require careful change control, just
like requirements specifications and other SDLC deliverables. Note each change to
a data model to preserve the lineage of changes over time. If a change affects the
logical data model, such as a new or changed business data requirement, the data
analyst or architect must review and approve the change to the model.
Each change should note:

Why the project or situation required the change


What and How the object(s) changed, including which tables had columns
added, modified, or removed, etc.
When the change was approved and when the change was made to the
model (not necessarily when the change was implemented in a system)
Who made the change
Where the change was made (in which models)

5.2 Data Modeling Metrics


There are several ways of measuring a data model’s quality, and all require a
standard for comparison. One method that will be used to provide an example of
data model validation is The Data Model Scorecard ®, which provides 11 data model
quality metrics: one for each of ten categories that make up the Scorecard and an
overall score across all ten categories. Table 11 contains the Scorecard template.

6. Works Cited / Recommended


Ambler, Scott. Agile Database Techniques: Effective Strategies for the Agile Software
Developer. Wiley and Sons, 2003. Print.
Avison, David and Christine Cuthbertson. A Management Approach to Database
Applications. McGraw-Hill Publishing Co., 2002. Print. Information systems ser.
Blaha, Michael. UML Database Modeling Workbook. Technics Publications, LLC, 2013.
Print.
Brackett, Michael H. Data Resource Design: Reality Beyond Illusion. Technics
Publications, LLC, 2012. Print.
Brackett, Michael H. Data Resource Integration: Understanding and Resolving a
Disparate Data Resource. Technics Publications, LLC, 2012. Print.
Brackett, Michael H. Data Resource Simplexity: How Organizations Choose Data
Daoust, Norman. UML Requirements Modeling for Business Analysts: Steps to
Modeling Success. Technics Publications, LLC, 2012. Print.
Date, C. J. An Introduction to Database Systems. 8th ed. Addison-Wesley, 2003. Print.
Date, C. J. and Hugh Darwen. Databases, Types and the Relational Model. 3d ed.
Addison Wesley, 2006. Print.
Date, Chris J. The Relational Database Dictionary: A Comprehensive Glossary of
Relational Terms and Concepts, with Illustrative Examples. O’Reilly Media, 2006. Print.
Dorsey, Paul. Enterprise Data Modeling Using UML. McGraw-Hill Osborne Media,
2009. Print.
CHAPTER 6
Data Storage and Operations
Introduction:
Data Storage and Operations includes the design, implementation, and support of
stored data, to maximize its value throughout its lifecycle, from creation/acquisition
to disposal. Data Storage and Operations includes two sub- activities:

Database support focuses on activities related to the data lifecycle, from


initial implementation of a database environment, through
obtaining,backing up, and purging data. It also includes ensuring the
database performs well. Monitoring and tuning are critical to database
support.
Database technology support includes defining technical requirements
that will meet organizational needs, defining technical architecture,
installing and administering technology, and resolving issues related to
technology.

Database administrators (DBAs) play key roles in both aspects of data storage and
operations. The role of DBA is the most established and most widely adopted data
professional role, and database administration practices are perhaps the most
mature of all data management practices. DBAs also play dominant roles in data
operations and data security.
1.1 Business Drivers
Companies rely on their information systems to run their operations. Data Storage
and Operations activities are crucial to organizations that rely on data. Business
continuity is the primary driver of these activities. If a system becomes unavailable,
company operations may be impaired or stopped completely. A reliable data
storage infrastructure for IT operations minimizes the risk of disruption.

1.2 Goals and Principles


The goals of data storage and operations include:

Managing the availability of data throughout the data lifecycle


Ensuring the integrity of data assets
Managing the performance of data transactions

Data Storage and Operations represent a highly technical side of data management.
DBAs and others involved in this work can do their jobs better and help the overall
work of data management when they follow these guiding principles:

Identify and act on automation opportunities: Automate database


development processes, developing tools, and processes that shorten each
development cycle, reduce errors and rework, and minimize the impact on
the development team.
Build with reuse in mind: Develop and promote the use of abstracted and
reusable data objects that prevent applications from being tightly coupled
to database schemas.
Understand and appropriately apply best practices: DBAs should promote
database standards and best practices as requirements, but be flexible
enough to deviate from them if given acceptable reasons for these
deviations.
Connect database standards to support requirements: For example, the
Service Level Agreement (SLA) can reflect DBA-recommended and
developer-accepted methods of ensuring data integrity and data security.
Set expectations for the DBA role in project work: Ensuring project
methodology includes onboarding the DBA in project definition phase can
help throughout the SDLC.
1.3 Essential Concepts
1.3.1 Database Terms
Database terminology is specific and technical. In working as a DBA or with DBAs,
it is important to understand the specifics of this technical language:

Database: Any collection of stored data, regardless of structure or content.


Some large databases refer to instances and schema.
Instance: An execution of database software controlling access to a certain
area of storage.
Schema: A subset of a database objects contained within the database or
an instance. Schemas are used to organize objects into more manageable
parts.
Node: An individual computer hosting either processing or data as part of
a distributed database.
Database abstraction means that a common application interface (API) is
used to call database functions, such that an application can connect to
multiple different databases without the programmer having to know all
function calls for all possible databases.

1.3.2 Data Lifecycle Management


DBAs maintain and assure the accuracy and consistency of data over its entire
lifecycle through the design, implementation, and usage of any system that stores,
processes, or retrieves data. The DBA is the custodian of all database changes.
While many parties may request changes, the DBA defines the precise changes to
make to the database, implements the changes, and controls the changes.
Data lifecycle management includes implementing policies and procedures for
acquisition, migration, retention, expiration, and disposition of data. It is prudent
to prepare checklists to ensure all tasks are performed at a high level of quality.

1.3.3 Administrators
The role of Database Administrator (DBA) is the most established and the most
widely adopted data professional role. DBAs play the dominant roles in Data
Storage and Operations, and critical roles in Data Security, the physical side of data
modeling, and database design.
DBAs do not exclusively perform all the activities of Data Storage and Operations.
Data stewards, data architects, network administrators, data analysts, and security
analysts participate in planning for performance, retention, and recovery. These
teams may also participate in obtaining and processing data from external sources.

1.3.3.1 Production DBA


Production DBAs take primary responsibility for data operations management,
including:

Ensuring the performance and reliability of the database, through


performance tuning, monitoring, error reporting, and other activities
Implementing backup and recovery mechanisms to ensure data can be
recovered if lost in any circumstance
Implementing mechanisms for clustering and failover of the database, if
continual data availability data is a requirement
Executing other database maintenance activities, such as implementing
mechanisms for archiving data

1.3.3.2 Application DBA


An application DBA is responsible for one or more databases in all environments
(development / test, QA, and production), as opposed to database systems
administration for any of these environments. Sometimes, application DBAs report
to the organizational units responsible for development and maintenance of the
applications supported by their databases. There are pros and cons to staffing
application DBAs.
1.3.3.3 Procedural and Development DBAs
Procedural DBAs lead the review and administration of procedural database
objects. A procedural DBA specializes in development and support of procedural
logic controlled and execute by the DBMS: stored procedures, triggers, and user-
defined functions (UDFs). The procedural DBA ensures this procedural logic is
planned, implemented, tested, and shared (reused).
Development DBAs focus on data design activities including creating and
managing special use databases, such as ‘sandbox’ or exploration areas.
1.3.3.4 NSA
Network Storage Administrators are concerned with the hardware and software
supporting data storage arrays. Multiple network storage array systems have
different needs and monitoring requirements than simple database systems.
1.3.4 Database Architecture Types
A database can be classified as either centralized or distributed. A centralized
system manages a single database, while a distributed system manages multiple
databases on multiple systems. A distributed system’s components can be
classified depending on the autonomy of the component systems into two types:
federated (autonomous) or non-federated (non-autonomous). Figure 55 illustrates
the difference between centralized and distributed.

1.3.4.1 Centralized Databases


Centralized databases have all the data in one system in one place. All users come
to the one system to access the data. For certain restricted data, centralization can
be ideal, but for data that needs to be widely available, centralized databases have
risks. For example, if the centralized system is unavailable, there are no other
alternatives for accessing the data.
1.3.4.2 Distributed Databases
Distributed databases make possible quick access to data over a large number of
nodes. Popular distributed database technologies are based on using commodity
hardware servers. They are designed to scale out from single servers to thousands
of machines, each offering local computation and storage. Rather than rely on
hardware to deliver high-availability, the database management software itself is
designed to replicate data amongst the servers, thereby delivering a highly
available service on top of a cluster of computers. Database management software
is also designed to detect and handle failures. While any given computer may fail,
the system overall is unlikely to. Virtualization / Cloud Platforms
1.3.5 Database Processing Types
There are two basic types of database processing. ACID and BASE are on opposite
ends of a spectrum, so the coincidental names matching ends of a pH spectrum are
helpful. The CAP Theorem is used to define how closely a distributed system may
match either ACID or BASE.
1.3.5.1 ACID
The acronym ACID was coined in the early 1980’s as the indispensable constraint
for achieving reliability within database transactions. For decades, it has provided
transaction processing with a reliable foundation on which to build.

Atomicity: All operations are performed, or none of them is, so that if one
part of the transaction fails, then the entire transaction fails.
Consistency: The transaction must meet all rules defined by the system at
all times and must void half-completed transactions.
Isolation: Each transaction is independent unto itself.
Durability: Once complete, the transaction cannot be undone.

Relational ACID technologies are the dominant tools in relational database storage;
most use SQL as the interface.
1.3.5.2 BASE
The unprecedented increase in data volumes and variability, the need to document
and store unstructured data, the need for read-optimized data workloads, and
subsequent need for greater flexibility in scaling, design, processing, cost, and
disaster recovery gave rise to the diametric opposite of ACID, appropriately termed
BASE:

Basically Available: The system guarantees some level of availability to the


data even when there are node failures. The data may be stale, but the
system will still give and accept responses.
Soft State: The data is in a constant state of flux; while a response may be
given, the data is not guaranteed to be current.
Eventual Consistency: The data will eventually be consistent through all
nodes and in all databases, but not every transaction will be consistent at
every moment.
1.3.6 Data Storage Media
Data can be stored on a variety of media, including disks, volatile memory, and
flash drives. Some systems can combine multiple storage types. The most
commonly used are Disk and Storage Area Networks (SAN), In-Memory, Columnar
Compression Solutions, Virtual Storage Area Network VSAN, Cloud-based storage
solutions, Radio Frequency Identification (RFID), Digital wallets, Data centers and
Private, Public, and Hybrid Cloud Storage.
1.3.6.1 Disk and Storage Area Networks (SAN)
Disk storage is a very stable method of storing data persistently. Multiple types of
disk can exist in the same system. Data can be stored according to usage patterns,
with less-used data stored on slower-access disks, which are usually cheaper than
high performance disk systems.
1.3.6.2 In-Memory
In-Memory databases (IMDB) are loaded from permanent storage into volatile
memory when the system is turned on, and all processing occurs within the
memory array, providing faster response time than disk-based systems. Most in-
memory databases also have features to set and configure durability in case of
unexpected shutdown.
1.3.6.3 Columnar Compression Solutions
Columnar-based databases are designed to handle data sets in which data values
are repeated to a great extent. For example, in a table with 256 columns, a lookup
for a value that exists in a row will retrieve all the data in the row (and be somewhat
disk-bound). Columnar storage reduces this I/O bandwidth by storing column data
using compression – where the state (for example) is stored as a pointer to a table
of states, compressing the master table significantly.
1.3.6.4 Flash Memory
Recent advances in memory storage have made flash memory or solid state drives
(SSDs) an attractive alternative to disks. Flash memory combines the access speed
of memory-based storage with the persistence of disk-based storage.

1.3.7 Database Environments


Databases are used in a variety of environments during the systems development
lifecycle. When testing changes, DBAs should be involved in designing the data
structures in the Development environment. The DBA team should implement any
changes to the QA environment, and must be the only team implementing changes
to the Production environment. Production changes must adhere strictly to
standard processes and procedures.
While most data technology is software running on general purpose hardware,
occasionally specialized hardware is used to support unique data management
requirements. Types of specialized hardware include data appliances – servers built
specifically for data transformation and distribution. These servers integrate with
existing infrastructure either directly as a plug-in, or peripherally as a network
connection.

2. Activities
The two main activities in Data Operations and Storage are Database Technology
Support and Database Operations Support. Database Technology Support is
specific to selecting and maintaining the software that stores and manages the data.

2.1 Manage Database Technology


Managing database technology should follow the same principles and standards for
managing any technology.
The leading reference model for technology management is the Information
Technology Infrastructure Library (ITIL), a technology management process model
developed in the United Kingdom. ITIL principles apply to managing data
technology.

2.1.1 Understand Database Technology Characteristics


It is important to understand how technology works, and how it can provide value
in the context of a particular business. The DBA, along with the rest of the data
services teams, works closely with business users and managers to understand the
data and information needs of the business.

2.1.2 Evaluate Database Technology


Selecting strategic DBMS software is particularly important. DBMS software has a
major impact on data integration, application performance, and business
productivity. Some of the factors to consider when selecting DBMS software
include:

Product architecture and complexity


Volume and velocity limits, including streaming rate
Application profile, such as transaction processing, Business Intelligence,
and personal profiles
Specific functionality, such as temporal calculation support
Hardware platform and operating system support
Availability of supporting software tools
Performance benchmarks, including real-time statistics
Scalability
Software, memory, and storage requirements
Resiliency, including error handling and reporting

2.1.3 Manage and Monitor Database Technology


DBAs often serve as Level 2 technical support, working with help desks and
technology vendor support to understand, analyze, and resolve user problems. The
key to effective understanding and use of any technology is training. Organizations
should make sure they have training plans and budgets in place for everyone
involved in implementing, supporting, and using data and database technology.

2.2 Manage Databases


Database support, as provided by DBAs and Network Storage Administrators
(NSAs), is at the heart of data management. Databases reside on managed storage
areas. Managed storage can be as small as a disk drive on a personal computer
(managed by the OS), or as large as RAID arrays on a storage area network or SAN.
Backup media is also managed storage.
DBAs manage various data storage applications by assigning storage structures,
maintaining physical databases (including physical data models and physical
layouts of the data, such as assignments to specific files or disk areas), and
establishing DBMS environments on servers.

2.2.1 Understand Requirements


2.2.1.1 Define Storage Requirements
DBAs establish storage systems for DBMS applications and file storage systems to
support NoSQL. NSAs and DBAs together play a vital role in establishing file
storage systems. Data enters the storage media during normal business operations
and, depending on the requirements, can stay permanently or temporarily. It is
important to plan for adding additional space well in advance of when that space is
actually needed. Doing any sort of maintenance in an emergency is a risk.
2.2.1.2 Identify Usage Patterns
Databases have predictable usage patterns. Basic types of patterns include:

Transaction-based
Large data set write- or retrieval-based
Time-based (heavier at month end, lighter on weekends, etc.),
Location-based (more densely populated areas have more transactions,
etc.)
Priority-based (some departments or batch IDs have higher priority than
others)

2.2.1.3 Define Access Requirements


Data access includes activities related to storing, retrieving, or acting on data
housed in a database or other repository. Data Access is simply the authorization to
access different data files.
Various standard languages, methods, and formats exist for accessing data from
databases and other repositories: SQL, ODBC, JDBC, XQJ, ADO.NET, XML, X
Query, X Path, and Web Services for ACID-type systems. BASE-type access method
standards include C, C++, REST, XML, and Java 36. Some standards enable translation
of data from unstructured (such as HTML or free-text files) to structured (such as
XML or SOL).

2.2.2 Plan for Business Continuity


Organizations need to plan for business continuity in the event of disaster or
adverse event that impacts their systems and their ability to use their data. DBAs
must make sure a recovery plan exists for all databases and database servers,
covering scenarios that could result in loss or corruption of data, such as:

Loss of the physical database server


Loss of one or more disk storage devices
Loss of a database, including the DBMS Master Database, temporary
storage database, transaction log segment, etc.
Corruption of database index or data pages
Loss of the database or log segment filesystems
Loss of database or transaction log backup files
2.2.2.1 Make Backups
Make backups of databases and, if appropriate, the database transaction logs. The
system’s Service Level Agreement (SLA) should specify backup frequency. Balance
the importance of the data against the cost of protecting it. For large databases,
frequent backups can consume large amounts of disk storage and server resources.
In addition to incremental backups, periodically make a complete backup of each
database.

2.2.2.2 Recover Data


Most backup software includes the option to read from the backup into the system.
The DBA works with the infrastructure team to re-mount the media containing the
backup and to execute the restoration. The specific utilities used to execute the
restoration of the data depend on the type of databased.
Data in file system databases may be easier to restore than those in relational
database management systems, which may have catalog information that needs to
be updated during the data recovery, especially if the recovery is from logs instead
of a full backup.
.

2.2.3 Develop Database Instances


DBAs are responsible for the creation of database instances. Related activities
include:

Installing and updating DBMS software:


Maintaining multiple environment installations, including different
DBMS versions:
Installing and administering related data technology:

2.2.3.1 Manage the Physical Storage Environment


Storage environment management needs to follow traditional Software
Configuration Management (SCM) processes or Information Technology
Infrastructure Library (ITIL) methods to record modification to the database
configuration, structures, constraints, permissions, thresholds, etc. DBAs need to
update the physical data model to reflect the changes to the storage objects as part
of a standard configuration management process.
During the configuration identification
Configuration change control
Configuration status accounting
Configuration audits

2.2.3.2 Manage Database Access Controls


DBAs are responsible for managing the controls that enable access to the data.
DBAs oversee the following functions to protect data assets and data integrity:

Controlled environment:
Physical security:
Monitoring:
Controls:
2.2.3.3 Create Storage Containers
All data must be stored on a physical drive and organized for ease of load, search,
and retrieval. Storage containers themselves may contain storage objects, and each
level must be maintained appropriate to the level of the object. For example,
relational databases have schemas that contain tables, and non-relational databases
have filesystems that contain files.
2.2.3.4 Implement Physical Data Models
DBAs are typically responsible for creating and managing the complete physical
data storage environment based on the physical data model. The physical data
model includes storage objects, indexing objects, and any encapsulated code
objects required to enforce data quality rules, connect database objects, and achieve
database performance.
2.2.3.5 Load Data
When first built, databases are empty. DBAs fill them. If the data to be loaded has
been exported using a database utility, it may not be necessary to use a data
integration tool to load it into the new database. Most database systems have bulk
load capabilities, requiring that the data be in a format that matches the target
database object, or having a simple mapping function to link data in the source to
the target object.
2.2.3.6 Manage Data Replication
DBAs can influence decisions about the data replication process by advising on:

Active or passive replication


Distributed concurrency control from distributed data systems
The appropriate methods to identify updates to data through either
timestamp or version numbers under Change Data Control process

2.2.4 Manage Database Performance


The Database performance depends on two interdependent facets: availability and
speed. Performance includes ensuring availability of space, query optimization, and
other factors that enable a database to return data in an efficient way. Performance
cannot be measured without availability. An unavailable database has a
performance measure of zero.

2.2.5 Manage Test Data Sets


Software testing is labor-intensive and accounts for nearly half of the cost of the
system development. Efficient testing requires high quality test data, and this data
must be managed. Test data generation is a critical step in software testing.

2.2.6 Manage Data Migration


Data migration is the process of transferring data between storage types, formats,
or computer systems, with as little change as possible.
Data migration is a key consideration for any system implementation, upgrade, or
consolidation. It is usually performed programmatically, being automated based on
rules.
Many day-to-day tasks a storage administrator has to perform can be simply and
concurrently completed using data migration techniques:
Moving data off an over-used storage device to a separate environment
Moving data onto a faster storage device as needs require
Implementing an Information Lifecycle Management policy
Migrating data off older storage devices (either being scrapped or off-
lease) to offline or cloud storage

3. Tools
In addition to the database management systems themselves, DBAs use multiple
other tools to manage databases. For example, modeling and other application
development tools, interfaces that allow users to write and execute queries, data
evaluation and modification tools for data quality improvement, and performance
load monitoring tools.
3.1 Data Modeling Tools
Data modeling tools automate many of the tasks the data modeler performs. Some
data modeling tools allow the generation of database data definition language
(DDL). Most support reverse engineering from database into a data model. Tools
that are more sophisticated validate naming standards, check spelling, store
Metadata such as definitions and lineage, and even enable publishing to the web.

3.2 Database Monitoring Tools


Database monitoring tools automate monitoring of key metrics, such as capacity,
availability, cache performance, user statistics, etc., and alert DBAs and NSAs to
database issues. Most such tools can simultaneously monitor multiple database
types.

3.3 Database Management Tools


Database systems have often included management tools. In addition, several third-
party software packages allow DBAs to manage multiple databases. These
applications include functions for configuration, installation of patches and
upgrades, backup and restore, database cloning, test management, and data clean-
up routines.
3.4 Developer Support Tools
Developer Support tools contain a visual interface for connecting to and executing
commands on a database. Some are included with the database management
software. Others include third-party applications.

4. Techniques

4.1 Test in Lower Environments


For upgrades and patches to operating systems, database software, database
changes, and code changes, install and test on the lowest level environment first –
usually development. Once tested on the lowest level, install on the next higher
levels, and install on the production environment last. This ensures that the
installers have experience with the upgrade or patch, and can minimize disruption
to the production environments.

4.2 Physical Naming Standards


Consistency in naming speeds understanding. Data architects, database
developers, and DBAs can use naming standards for defining Metadata or creating
rules for exchanging documents between organizations.
ISO/IEC 11179 – Metadata registries (MDR), addresses the semantics of data, the
representation of data, and the registration of the descriptions of that data. It is
through these descriptions that an accurate understanding of the semantics and a
useful depiction of the data are found.
The significant section for physical databases within that standards is Part 5 –
Naming and Identification Principles, which describes how to form conventions for
naming data elements and their components.

4.3 Script Usage for All Changes


It is extremely risky to directly change data in a database. However, there may be a
need, such as an annual change in the chart of accounts structures, or in mergers
and acquisitions, or emergencies, where these are indicated due to the ‘one-off’
nature of the request and/or the lack of appropriate tools for these circumstances. It
is helpful to place changes to be made into update script files and test them
thoroughly in non-production environments before applying to production.

5. Implementation Guidelines

5.1 Readiness Assessment / Risk Assessment


A risk and readiness assessment revolves around two central ideas: risk of data loss
and risks related to technology readiness.

Data loss: Data can be lost through technical or procedural errors, or


through malicious intent. Organizations need to put in place strategies to
mitigate these risks. Service Level Agreements often specify the general
requirements for protection.
Technology readiness: Newer technologies such as NoSQL, Big Data,
triple stores, and FDMS require skills and experience readiness in IT.

5.2 Organization and Cultural Change


DBAs often do not effectively promote the value of their work to the organization.
They need to recognize the legitimate concerns of data owners and data consumers,
balance short-term and long-term data needs, educate others in the organization
about the importance of good data management practices, and optimize data
development practices to ensure maximum benefit to the organization and minimal
impact on data consumers.

6. Data Storage and Operations Governance


6.1 Metrics
Data Storage metrics may include:

Count of databases by type


Aggregated transaction statistics
Capacity metrics, such as

Amount of storage used


Number of storage containers
Number of data objects in terms of committed and uncommitted block
or pages
Data in queue

Storage service usage


Requests made against the storage services
Improvements to performance of the applications that use a service

Operational metrics may consist of:

Aggregated statistics about data retrieval time


Backup size
Data quality measurement
Availability

6.2 Information Asset Tracking


Part of data storage governance includes ensuring that an organization complies with
all licensing agreements and regulatory requirements. Carefully track and conduct
yearly audits of software license and annual support costs, as well as server lease
agreements and other fixed costs. Being out of compliance with licensing agreements
poses serious financial and legal risks for an organization.
6.3 Data Audits and Data Validation
A data audit is the evaluation of a data set based on defined criteria. Typically, an audit
is performed to investigate specific concerns about a data set and is designed to
determine whether the data was stored in compliance with contractual and
methodological requirements. The data audit approach may include a project- specific
and comprehensive checklist, required deliverables, and quality control criteria.

You might also like