Data Management Techniques Unit 3
Data Management Techniques Unit 3
1. Introduction
There are a number of different schemes used to represent data. The six most
commonly used schemes are: Relational, Dimensional, Object-Oriented, Fact-Based, Time-
Based, and NoSQL. Models of these schemes exist at three levels of detail: conceptual,
logical, and physical. Each model contains a set of components. Examples of components
are entities, relationships, facts, keys, and attributes. Once a model is built, it needs to be
reviewed and once approved, maintained.
Data models comprise and contain Metadata essential to data consumers. Much of
this Metadata uncovered during the data modeling process is essential to other
data management functions. For example, definitions for data governance and
lineage for data warehousing and analytics.
Category information:
Resource information
Business event information:
Detail transaction information:
1.3.3.2 Domain
In data modeling, a domain is the complete set of possible values that an attribute
can be assigned. A domain may be articulated in different ways (see points at the
end of this section). A domain provides a means of standardizing the characteristics
of the attributes. For example, the domain Date, which contains all possible valid
dates, can be assigned to any date attribute in a logical data model or date
columns/fields in a physical data model, such as:
EmployeeHireDate
OrderEntryDate
ClaimSubmitDate
CourseStartDate
1.3.4 Data Modeling Schemes
The six most common schemes used to represent data are: Relational, Dimensional,
Object-Oriented, Fact-Based, Time-Based, and NoSQL. Each scheme uses specific
diagramming notations (see Table 9).
This section will briefly explain each of these schemes and notations. The use of
schemes depends in part on the database being built, as some are suited to
particular technologies, as shown in Table 10.
1.3.5.2 Logical
A logical data model is a detailed representation of data requirements, usually in
support of a specific usage context, such as application requirements. Logical data
models are still independent of any technology or specific implementation
constraints. A logical data model often begins as an extension of a conceptual data
model.
1.3.5.3 Physical
A physical data model (PDM) represents a detailed technical solution, often using
the logical data model as a starting point and then adapted to work within a set of
hardware, software, and network tools. Physical data models are built for a
particular technology. Relational DBMSs, for example, should be designed with the
specific capabilities of a database management system in mind (e.g., IBM DB2,
UDB, Oracle, Teradata, Sybase, Microsoft SQL Server, or Microsoft Access).
1.3.6 Normalization
Normalization is the process of applying rules in order to organize business
complexity into stable data structures. The basic goal of normalization is to keep
each attribute in only one place to eliminate redundancy and the inconsistencies
that can result from redundancy. The process requires a deep understanding of
each attribute and each attribute’s relationship to its primary key.
Normalization rules sort attributes according to primary and foreign keys.
Normalization rules sort into levels, with each level applying granularity and
specificity in search of the correct primary and foreign keys. Each level comprises a
separate normal form, and each successive level does not need to include previous
levels. Normalization levels include:
First normal form (1NF): Ensures each entity has a valid primary key, and
every attribute depends on the primary key; removes repeating groups,
and ensures each attribute is atomic (not multi-valued). 1NF includes the
resolution of many-to-many relationships with an additional entity often
called an associative entity.
Second normal form (2NF): Ensures each entity has the minimal primary
key and that every attribute depends on the complete primary key.
Third normal form (3NF): Ensures each entity has no hidden primary keys
and that each attribute depends on no attributes outside the key (“the key,
the whole key and nothing but the key”).
Boyce / Codd normal form (BCNF): Resolves overlapping composite
candidate keys. A candidate key is either a primary or an alternate key.
‘Composite’ means more than one (i.e., two or more attributes in an
entity’s primary or alternate keys), and ‘overlapping’ means there are
hidden business rules between the keys.
Fourth normal form (4NF): Resolves all many-to-many-to-many
relationships (and beyond) in pairs until they cannot be broken down into
any smaller pieces.
Fifth normal form (5NF): Resolves inter-entity dependencies into basic
pairs, and all join dependencies use parts of primary keys.
1.3.7 Abstraction
Abstraction is the removal of details in such a way as to broaden applicability to a
wide class of situations while preserving the important properties and essential
nature from concepts or subjects. An example of abstraction is the Party/Role
structure, which can be used to capture how people and organizations play certain
roles. Not all modelers or developers are comfortable with, or have the ability to
work with abstraction. The modeler needs to weigh the cost of developing and
maintaining an abstract structure versus the amount of rework required.
Abstraction includes generalization and specialization. Generalization groups the
common attributes and relationships of entities into supertype entities, while
specialization separates distinguishing attributes within an entity into subtype
entities
2. Activities
This section will briefly cover the steps for building conceptual, logical, and
physical data models, as well as maintaining and reviewing data models. Both
forward engineering and reverse engineering will be discussed.
3. Tools
There are many types of tools that can assist data modelers in completing their
work, including data modeling, lineage, data profiling tools, and Metadata
repositories.
4. Best Practices
Performance and ease of use: Ensure quick and easy access to data by
approved users in a usable and business-relevant form, maximizing the
business value of both applications and data.
Reusability: The database structure should ensure that, where
appropriate, multiple applications can use the data and that the data can
serve multiple purposes (e.g., business analysis, quality improvement,
strategic planning, customer relationship management, and process
improvement). Avoid coupling a database, data structure, or data object to
a single application.
Integrity: The data should always have a valid business meaning and
value, regardless of context, and should always reflect a valid state of the
business. Enforce data integrity constraints as close to the data as
possible, and immediately detect and report violations of data integrity
constraints.
Security: True and accurate data should always be immediately available to
authorized users, but only to authorized users. The privacy concerns of all
stakeholders, including customers, business partners, and government
regulators, must be met. Enforce data security, like data integrity, as close
to the data as possible, and immediately detect and report security
violations.
Maintainability: Perform all data work at a cost that yields value by
ensuring that the cost of creating, storing, maintaining, using, and
disposing of data does not exceed its value to the organization. Ensure the
fastest possible response to changes in business processes and new
business requirements.
Database administrators (DBAs) play key roles in both aspects of data storage and
operations. The role of DBA is the most established and most widely adopted data
professional role, and database administration practices are perhaps the most
mature of all data management practices. DBAs also play dominant roles in data
operations and data security.
1.1 Business Drivers
Companies rely on their information systems to run their operations. Data Storage
and Operations activities are crucial to organizations that rely on data. Business
continuity is the primary driver of these activities. If a system becomes unavailable,
company operations may be impaired or stopped completely. A reliable data
storage infrastructure for IT operations minimizes the risk of disruption.
Data Storage and Operations represent a highly technical side of data management.
DBAs and others involved in this work can do their jobs better and help the overall
work of data management when they follow these guiding principles:
1.3.3 Administrators
The role of Database Administrator (DBA) is the most established and the most
widely adopted data professional role. DBAs play the dominant roles in Data
Storage and Operations, and critical roles in Data Security, the physical side of data
modeling, and database design.
DBAs do not exclusively perform all the activities of Data Storage and Operations.
Data stewards, data architects, network administrators, data analysts, and security
analysts participate in planning for performance, retention, and recovery. These
teams may also participate in obtaining and processing data from external sources.
Atomicity: All operations are performed, or none of them is, so that if one
part of the transaction fails, then the entire transaction fails.
Consistency: The transaction must meet all rules defined by the system at
all times and must void half-completed transactions.
Isolation: Each transaction is independent unto itself.
Durability: Once complete, the transaction cannot be undone.
Relational ACID technologies are the dominant tools in relational database storage;
most use SQL as the interface.
1.3.5.2 BASE
The unprecedented increase in data volumes and variability, the need to document
and store unstructured data, the need for read-optimized data workloads, and
subsequent need for greater flexibility in scaling, design, processing, cost, and
disaster recovery gave rise to the diametric opposite of ACID, appropriately termed
BASE:
2. Activities
The two main activities in Data Operations and Storage are Database Technology
Support and Database Operations Support. Database Technology Support is
specific to selecting and maintaining the software that stores and manages the data.
Transaction-based
Large data set write- or retrieval-based
Time-based (heavier at month end, lighter on weekends, etc.),
Location-based (more densely populated areas have more transactions,
etc.)
Priority-based (some departments or batch IDs have higher priority than
others)
Controlled environment:
Physical security:
Monitoring:
Controls:
2.2.3.3 Create Storage Containers
All data must be stored on a physical drive and organized for ease of load, search,
and retrieval. Storage containers themselves may contain storage objects, and each
level must be maintained appropriate to the level of the object. For example,
relational databases have schemas that contain tables, and non-relational databases
have filesystems that contain files.
2.2.3.4 Implement Physical Data Models
DBAs are typically responsible for creating and managing the complete physical
data storage environment based on the physical data model. The physical data
model includes storage objects, indexing objects, and any encapsulated code
objects required to enforce data quality rules, connect database objects, and achieve
database performance.
2.2.3.5 Load Data
When first built, databases are empty. DBAs fill them. If the data to be loaded has
been exported using a database utility, it may not be necessary to use a data
integration tool to load it into the new database. Most database systems have bulk
load capabilities, requiring that the data be in a format that matches the target
database object, or having a simple mapping function to link data in the source to
the target object.
2.2.3.6 Manage Data Replication
DBAs can influence decisions about the data replication process by advising on:
3. Tools
In addition to the database management systems themselves, DBAs use multiple
other tools to manage databases. For example, modeling and other application
development tools, interfaces that allow users to write and execute queries, data
evaluation and modification tools for data quality improvement, and performance
load monitoring tools.
3.1 Data Modeling Tools
Data modeling tools automate many of the tasks the data modeler performs. Some
data modeling tools allow the generation of database data definition language
(DDL). Most support reverse engineering from database into a data model. Tools
that are more sophisticated validate naming standards, check spelling, store
Metadata such as definitions and lineage, and even enable publishing to the web.
4. Techniques
5. Implementation Guidelines