Data Governance Book
Data Governance Book
Related terms:
Data Governance, Data Model, Master Data Management, Metadata, Big Data
Project, Metadata Management
Many data-quality projects will require data traceability to track information and
ensure that its usage is proper. Newly deployed or replaced interfaces might benefit
from a data traceability effort to verify that their role within the life cycle is seamless
or evaluate whether it affects other intermediate components. Data traceability
might also be required in an auditing project to demonstrate transparency, com-
pliance, and adherence to regulations.
Understanding changes in data requires understanding the data chain, the rules
that have been applied to data as it moves along the data chain, and what effects the
rules have had on the data. Data lineage includes the concept of an origin for the
data—its original source or provenance—and the movement and change of the data
as it passes through systems and is adopted for different uses (the sequence of steps
within the data chain through which data has passed). Pushing the metaphor, we
can imagine that any data that changes as it moves through the data chain includes
some but not all characteristics of its previous states and that it will pick up other
characteristics through its evolution.
Data Warehouses II
Charles D. Tupper, in Data Architecture, 2011
If this is done manually, the data discovery process will require months of human
involvement to discover cross-system data relationships, derive transformation log-
ic, assess data consistency, and identify exceptions.
The data discovery products like Exeros and Sypherlink Harvester are software prod-
ucts that can mine both databases and applications to capture the data and metadata
to define the core of a common business language and store it for actionable activity.
It would take very little effort to turn the result into a corporate dictionary.
If you do look into the cost of ownership of data, I can guarantee that the total
amount spent will be surprising, and there is a good chance that management will
tell you they do not believe the number. However, it is quite common for the total
cost of data to be four to five times higher than thought.
• DI_Job_ID—the data integration job identifier is the job ID that the data
integration tool generated. This identifier is a foreign key to the data inte-
gration tool’s processing metadata. If that metadata is available, then this link•
provides a powerful mechanism to analyze data integration processing and
performance down to the level of a table’s row.
SOR_ID—This is the SOR identifier that will tie this row to a particular SOR. •
Use this when the table the row is sourced from multiple systems of records
and enables the row to be tied to the specific SOR.
DI_Create_Date—This is the date and time that this row was originally created•
in this table. Often a database trigger is used to insert the current time, but
the data integration job could also insert the current time directly.
DI_Modified_Date—This is the most recent date and time that this row has
been modified in this table. Often a database trigger is used to insert the
current time, but the data integration job could also insert the current time
directly. It is often a standard practice to populate this column with an initial
value far in the future such as “9999-12-31” rather than leaving it a NULL to
avoid queries with NULLs when analyzing this column.
Enterprise Architecture
Enterprise architecture is a broad topic, and we think most companies now realize
the importance of architecture as a foundation for success. A good architecture is
rarely noticed but a bad architecture can restrict flexibility and numb the business.
There are a few key areas where data governance and enterprise architecture need
to collaborate:
• Data sourcing
• Business rule application for business rules that address data-quality issues
Integrating data governance processes into architecture will require changing es-
tablished architecture process. If enterprise architecture processes are not mature at
your organization, you may find that data governance needs to fit into ad-hoc and
potentially poorly understand and executed architecture processes. The specific way
you integrate data governance with architecture processes will vary:
• Ensure that data governance checklists are provided to the architecture group.
• Change the organizational structure for the data governance group and colo-
cate the data governance and architecture resources into one overall group.
This approach has other ramifications and issues but it is an often-encountered
model in many companies. We generally recommend against this approach
because it can greatly impact the effectiveness of the core data governance
jobs.
You may need to implement all of these approaches over time. Many companies
start with the easier process of assigning a data governance liaison to the archi-
tecture group then seeking to influence the approval and architecture development
process through checklists and approval changes. The Playbook does not dictate an
answer to data-sourcing questions, but it does provide insight into what context a
selected architecture must operate in to be consistent with data governance and the
operations process.
Enabling this automation adds to the types of metadata that must be maintained
since governance is driven from the business context, not from the technical im-
plementation around the data. For example, the secrecy required for a company's
financial reports is very high just before the results are reported. However, once
they have been released, they are public information. The technology used to store
the data has not changed. However, time has changed the business impact of
an unauthorized disclosure of the information, and thus the governance program
providing the data protection has to be aware of that context.
Similar examples from data quality management, lifecycle management and data
protection illustrate that the requirements that drive information governance come
from the business significance of the data and how it is to be used. This means
the metadata must capture both the technical implementation of the data and the
business context of its creation and use so that governance requirements and actions
can be assigned appropriately.
Earlier on in this chapter, we introduced the concept of the managed data lake
where metadata and governance were a key part of ensuring a data lake remains a
useful resource rather than becoming a data swamp. This is a necessary first step in
getting the most value out of big data. However, from the different big data solutions
reviewed in this chapter, big data is not born in the data lake. It comes from other
systems and contexts. Metadata and governance needs to extend to these systems,
and be incorporated into the data flows and processing throughout the solution.
Metadata play a very different role than other data warehouse data and are important
for many reasons. For example, metadata are used as a directory to help the decision
support system analyst locate the contents of the data warehouse, and as a guide to
the data mapping when data are transformed from the operational environment to
the data warehouse environment. Metadata also serve as a guide to the algorithms
used for summarization between the current detailed data and the lightly summa-
rized data, and between the lightly summarized data and the highly summarized
data. Metadata should be stored and managed persistently (i.e., on disk).
Metadata Management
Mark Allen, Dalton Cervo, in Multi-Domain Master Data Management, 2015
Figure 10.7 depicts a simplified data lineage to convey this idea. Notice the applica-
tion UI is being used as a connecting point. Business terms are mapped to labels on
the screens of multiple applications, which are mapped to databases, which in turn
can potentially be mapped to many other elements. This daisy-chain effect allows
any metadata object to be a starting point that can be navigated from wherever data
are flowing.
If you think you need DG technology, make sure you can actually implement and
support the tool. Even if you can identify with the above use cases, you also must
ensure that your organization is ready to use a DG tool, as readiness is a huge factor
in the decision-making process and the success of a DG program.