0% found this document useful (0 votes)
362 views11 pages

Data Governance Book

Uploaded by

Abhishek Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
362 views11 pages

Data Governance Book

Uploaded by

Abhishek Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Data Lineage

Related terms:

Data Governance, Data Model, Master Data Management, Metadata, Big Data
Project, Metadata Management

View all Topics

Data Quality Management


Mark Allen, Dalton Cervo, in Multi-Domain Master Data Management, 2015

Data Lineage and Traceability


Data lineage states where data is coming from, where it is going, and what transfor-
mations are applied to it as it flows through multiple processes. It helps understand
the data life cycle. It is one of the most critical pieces of information from a metadata
management point of view, as will be described in Chapter 10.

From data-quality and data-governance perspectives, it is important to understand


data lineage to ensure that existing business rules exist where expected, calcula-
tion rules and other transformations are correct, and system inputs and outputs
are compatible. Data traceability is the actual exercise to track access, values, and
changes to the data as they flow through their lineage. Data traceability can be used
for data validation and verification as well as data auditing. In summary, data lineage
is the documentation of the data life cycle, while data traceability is the process of
evaluating that the data is following its life cycle as expected.

Many data-quality projects will require data traceability to track information and
ensure that its usage is proper. Newly deployed or replaced interfaces might benefit
from a data traceability effort to verify that their role within the life cycle is seamless
or evaluate whether it affects other intermediate components. Data traceability
might also be required in an auditing project to demonstrate transparency, com-
pliance, and adherence to regulations.

> Read full chapter


Data Management, Models, and Meta-
data
Laura Sebastian-Coleman, in Measuring Data Quality for Ongoing Improvement,
2013

Data Lineage and Data Provenance


Data lineage is related to both the data chain and the information life cycle. The word
lineage refers to a pedigree or line of descent from an ancestor. In biology, a lineage
is a sequence of species that is considered to have evolved from a common ancestor.
But we also think of lineage in terms of direct inheritance from an immediate
predecessor. Most people concerned with the lineage of data want to understand two
aspects of it. First, they want to know the data’s origin or provenance—the earliest
instance of the data. (The word provenance in art has implications similar to lineage;
it refers to a record of ownership that can be used as a guide for a work’s authenticity
or quality.) Second, people want to know how (and sometimes why) the data has
changed since that earliest instance. Change can take place within one system or
between systems.

Understanding changes in data requires understanding the data chain, the rules
that have been applied to data as it moves along the data chain, and what effects the
rules have had on the data. Data lineage includes the concept of an origin for the
data—its original source or provenance—and the movement and change of the data
as it passes through systems and is adopted for different uses (the sequence of steps
within the data chain through which data has passed). Pushing the metaphor, we
can imagine that any data that changes as it moves through the data chain includes
some but not all characteristics of its previous states and that it will pick up other
characteristics through its evolution.

Data lineage is important to data quality measurement because lineage influences


expectations. A health care example can illustrate this concept. Medical claims sub-
mitted to insurance companies contain procedure codes that represent the actions
taken as part of a patient’s health care. These codes are highly standardized in hi-
erarchies that reference bodily systems. Medical providers (doctors, nurses, physical
therapists, and the like) choose which procedure codes accurately reflect the services
provided. In order to pay claims (a process called adjudication), sometimes codes
are bundled into sets. When this happens, different codes (representing sets) are
associated with the claims. This process means that specific values in these data fields
are changed as the claims are processed. Some changes are executed through rules
embedded in system programming. Others may be the result of manual intervention
from a claim processor. A person using such data with the expectation that the codes
on the adjudicated claims are the very same codes submitted by a doctor may be
surprised by discrepancies. The data would not meet a basic expectation. Without
an understanding of the data’s lineage, a person might reasonably conclude that
something is wrong with the data. If analysis requires codes as submitted and only
as submitted, then adjudicated claim data would not be the appropriate source for
that purpose.

> Read full chapter

Data Warehouses II
Charles D. Tupper, in Data Architecture, 2011

Standard or Corporate Business Language


On the integration project, like master data management or data lineage definition
or application and data consolidation, it is necessary to know what data you have,
where it is located, and how it is related between different application systems.
Software products exist today to move, profile, and cleanse the data. There are
also products that address discovery and the debugging of business rules and
transformation logic that mean they are different systems from one another.

If this is done manually, the data discovery process will require months of human
involvement to discover cross-system data relationships, derive transformation log-
ic, assess data consistency, and identify exceptions.

The data discovery products like Exeros and Sypherlink Harvester are software prod-
ucts that can mine both databases and applications to capture the data and metadata
to define the core of a common business language and store it for actionable activity.
It would take very little effort to turn the result into a corporate dictionary.

It is critical after the compilation that the accumulated result be opened up to


all enterprise businesses to resolve and define data conflicts and definitional
issues. Even this can be done expeditiously with the use of a Wikipedia-type tool
that allows clarifications to be done in an open forum. This both accomplishes the
standardization of the language and resolves issues, while educating the corporation
as a whole.

> Read full chapter


Engagement
John Ladley, in Data Governance (Second Edition), 2020

Ramification and benefits


Understanding the data environment is critical for understanding how to get DG
operational. Most DG programs will eventually require some sort of data lineage
or provenance, that is, tracking where data starts, goes, is used, and who used it.
Very often a DG program will show immediate value when an understanding of the
data environment is presented to company risk management. You are assembling
the type of material that regulators will continue to request and insist on in greater
detail as time moves on.

If you do look into the cost of ownership of data, I can guarantee that the total
amount spent will be surprising, and there is a good chance that management will
tell you they do not believe the number. However, it is quite common for the total
cost of data to be four to five times higher than thought.

> Read full chapter

Data Integration Processes


Rick Sherman, in Business Intelligence Guidebook, 2015

Table and Row Updates


Data integration job audit data tracks the flow of data through the BI data archi-
tecture at the grain of the table. It is a best practice to track row level audit data
to better manage it, enable data lineage analysis, and assist in improving system
performance. The template schema as depicted in Figure 12.16 enables this type of
audit data. The schema includes:

FIGURE 12.16. Data integration table—job audit columns.

• DI_Job_ID—the data integration job identifier is the job ID that the data
integration tool generated. This identifier is a foreign key to the data inte-
gration tool’s processing metadata. If that metadata is available, then this link•
provides a powerful mechanism to analyze data integration processing and
performance down to the level of a table’s row.
SOR_ID—This is the SOR identifier that will tie this row to a particular SOR. •
Use this when the table the row is sourced from multiple systems of records
and enables the row to be tied to the specific SOR.
DI_Create_Date—This is the date and time that this row was originally created•
in this table. Often a database trigger is used to insert the current time, but
the data integration job could also insert the current time directly.
DI_Modified_Date—This is the most recent date and time that this row has
been modified in this table. Often a database trigger is used to insert the
current time, but the data integration job could also insert the current time
directly. It is often a standard practice to populate this column with an initial
value far in the future such as “9999-12-31” rather than leaving it a NULL to
avoid queries with NULLs when analyzing this column.

> Read full chapter

Data Governance as an Operations


Process
Lowell Fryman, ... Dan Meers, in The Data and Analytics Playbook, 2017

Enterprise Architecture
Enterprise architecture is a broad topic, and we think most companies now realize
the importance of architecture as a foundation for success. A good architecture is
rarely noticed but a bad architecture can restrict flexibility and numb the business.

It may seem obvious that enterprise architecture is important to data governance.


However, rarely is data governance represented in enterprise architecture discus-
sions or approvals. The technical architecture choice and the set of applications you
select to interact with business users greatly affect the availability and quality of
data. Using a data governance lens, selecting an architecture is actually choosing
the best applications to manage your organization’s data and ensure it is available
to others—important variations on the core data governance jobs. If the data
governance-operating model is newly implemented, its data governance is probably
not integrated into enterprise architecture.

There are a few key areas where data governance and enterprise architecture need
to collaborate:
• Data sourcing

• Tracking data quality explicitly

• Creating control points to support data monitoring

• Data modeling/data architecture that supports data lineage processing

• Master and reference data management

• Business rule application for business rules that address data-quality issues

• Third-party contracting and data transparency

Integrating data governance processes into architecture will require changing es-
tablished architecture process. If enterprise architecture processes are not mature at
your organization, you may find that data governance needs to fit into ad-hoc and
potentially poorly understand and executed architecture processes. The specific way
you integrate data governance with architecture processes will vary:

• Ensure that data governance checklists are provided to the architecture group.

• Change the architecture approval process to ensure that sign-off by a data


governance leader is required.
• Ensure that there is a designated liaison to the architecture group.

• Provide architect training on data governance objectives and approaches.

• Change the organizational structure for the data governance group and colo-
cate the data governance and architecture resources into one overall group.
This approach has other ramifications and issues but it is an often-encountered
model in many companies. We generally recommend against this approach
because it can greatly impact the effectiveness of the core data governance
jobs.

You may need to implement all of these approaches over time. Many companies
start with the easier process of assigning a data governance liaison to the archi-
tecture group then seeking to influence the approval and architecture development
process through checklists and approval changes. The Playbook does not dictate an
answer to data-sourcing questions, but it does provide insight into what context a
selected architecture must operate in to be consistent with data governance and the
operations process.

> Read full chapter


Architecting to Deliver Value From a
Big Data and Hybrid Cloud Architec-
ture
Mandy Chessell, ... Tim Vincent, in Software Architecture for Big Data and the
Cloud, 2017

3.12 Metadata and Governance


Metadata is descriptive data about data. In a data warehouse environment, the
metadata is typically limited to the structural schemas used to organize the data in
different zones in the warehouse. For the more advanced environments, metadata
may also include data lineage and measured quality information of the systems
supplying data to the warehouse.

A big data environment is more dynamic than a data warehouse environment


and it is continuously pulling in data from a much greater pool of sources. It
quickly becomes impossible for the individuals running the big data environment to
remember the origin and content of all the data sets it contains. As a result, metadata
capture and management becomes a key part of the big data environment. Given the
volume, variety and velocity of the data, metadata management must be automated.
Similarly fulfilling governance requirements for data must also be automated as
much as possible.

Enabling this automation adds to the types of metadata that must be maintained
since governance is driven from the business context, not from the technical im-
plementation around the data. For example, the secrecy required for a company's
financial reports is very high just before the results are reported. However, once
they have been released, they are public information. The technology used to store
the data has not changed. However, time has changed the business impact of
an unauthorized disclosure of the information, and thus the governance program
providing the data protection has to be aware of that context.

Similar examples from data quality management, lifecycle management and data
protection illustrate that the requirements that drive information governance come
from the business significance of the data and how it is to be used. This means
the metadata must capture both the technical implementation of the data and the
business context of its creation and use so that governance requirements and actions
can be assigned appropriately.

Earlier on in this chapter, we introduced the concept of the managed data lake
where metadata and governance were a key part of ensuring a data lake remains a
useful resource rather than becoming a data swamp. This is a necessary first step in
getting the most value out of big data. However, from the different big data solutions
reviewed in this chapter, big data is not born in the data lake. It comes from other
systems and contexts. Metadata and governance needs to extend to these systems,
and be incorporated into the data flows and processing throughout the solution.

> Read full chapter

Data Warehousing and Online Analyti-


cal Processing
Jiawei Han, ... Jian Pei, in Data Mining (Third Edition), 2012

4.1.7 Metadata Repository


Metadata are data about data. When used in a data warehouse, metadata are the data
that define warehouse objects. Figure 4.1 showed a metadata repository within the
bottom tier of the data warehousing architecture. Metadata are created for the data
names and definitions of the given warehouse. Additional metadata are created
and captured for timestamping any extracted data, the source of the extracted data,
and missing fields that have been added by data cleaning or integration processes.

A metadata repository should contain the following:

A description of the data warehouse structure, which includes the warehouse


schema, view, dimensions, hierarchies, and derived data definitions, as well
as data mart locations and contents.
Operational metadata, which include data lineage (history of migrated data
and the sequence of transformations applied to it), currency of data (active,
archived, or purged), and monitoring information (warehouse usage statistics,
error reports, and audit trails).
The algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggrega-
tion, summarization, and predefined queries and reports.
Mapping from the operational environment to the data warehouse, which
includes source databases and their contents, gateway descriptions, data parti-
tions, data extraction, cleaning, transformation rules and defaults, data refresh
and purging rules, and security (user authorization and access control).
Data related to system performance, which include indices and profiles that
improve data access and retrieval performance, in addition to rules for the
timing and scheduling of refresh, update, and replication cycles.
Business metadata, which include business terms and definitions, data own-
ership information, and charging policies.

A data warehouse contains different levels of summarization, of which metadata is


one. Other types include current detailed data (which are almost always on disk),
older detailed data (which are usually on tertiary storage), lightly summarized data,
and highly summarized data (which may or may not be physically housed).

Metadata play a very different role than other data warehouse data and are important
for many reasons. For example, metadata are used as a directory to help the decision
support system analyst locate the contents of the data warehouse, and as a guide to
the data mapping when data are transformed from the operational environment to
the data warehouse environment. Metadata also serve as a guide to the algorithms
used for summarization between the current detailed data and the lightly summa-
rized data, and between the lightly summarized data and the highly summarized
data. Metadata should be stored and managed persistently (i.e., on disk).

> Read full chapter

Metadata Management
Mark Allen, Dalton Cervo, in Multi-Domain Master Data Management, 2015

Connecting the Business and Technical Tracks


The management of business and technical/operational metadata is quite differ-
ent. But obviously, there is a connection. Behind business rules and definitions
lie data elements, which exist in multiple systems throughout the enterprise. A
mapping can be created between business terms and their equivalent technical
counterparts. Through data lineage, it is possible to establish this relationship.
The result is astounding, as one can locate a given business term and trace it to
multiple applications, data sources, interfaces, reports, models, analytics, reports,
and other elements. It is the ultimate goal of metadata management: search for
a business term, learn and understand its definition, and track it throughout the
entire enterprise. Imagine how powerful this information is to data governance, data
quality, data stewards, and business and technical teams.

Figure 10.7 depicts a simplified data lineage to convey this idea. Notice the applica-
tion UI is being used as a connecting point. Business terms are mapped to labels on
the screens of multiple applications, which are mapped to databases, which in turn
can potentially be mapped to many other elements. This daisy-chain effect allows
any metadata object to be a starting point that can be navigated from wherever data
are flowing.

Figure 10.7. Simplified data lineage

> Read full chapter

Architecture and design


John Ladley, in Data Governance (Second Edition), 2020

Readiness for tools


Your strategy work may have indicated some sort of role for tools. Before you
proceed, you need to confirm your program will benefit from a tool, and you can
effectively operate that tool. Here are a few scenarios to guide your thinking:

1. Highly regulated industry—Data lineage and discovery will support compli-


ance. Obviously, metadata tools will document meaning. You still do not need
to go buy tools until you know what you operating model looks like, but it will2.
not be long before a tool will be most helpful.
Master data initiatives—A common, major data initiative is MDM. MDM flat3.
out will not be sustainable, and therefore wastes a LOT of money, without
DG. But supporting tools are not necessarily mandated until the DG activities
are underway. Usually the MDM vendor supplies some sort of metadata. The
useful metadata around MDM is often mapping old things to new. The master
data should clear up semantical differences across business functions, so the
need to manage common data definitions, standards, lineage, and reference
data makes mapping and glossary type products handy.
Advanced analytics/Big Data activity—This is an interesting area, as a lot of 4.
benefit can come out of a data science area without any DG oversight at all,
but only to a point.At some point a data scientist will say “we are getting slowed
down by data quality.” Or inconsistent definitions, etc. Quite often, the data
scientists, while quite expert on statistical methods, have no clue about data
management. I have had data scientists tell me that “there may be an issue
with data quality here. Have you heard of this?” At this point they want to write
their own tool, but data discovery and data quality tools and statistical model
management enters the discussion instead (hopefully).
Artificial Intelligence/Machine Learning—Probably the only area where I will
get keenly interested in tools well before other scenarios is AI. That is because
AI, depending on the application of course, can go very well, or horribly
wrong. And sometimes it is hard to tell the difference. Given distortions in
AI based on model bias, data quality, and the operationalizing of erroneous
models, AI often requires proactive data profiling, discovery, and significant
understanding of data lineage.

If you think you need DG technology, make sure you can actually implement and
support the tool. Even if you can identify with the above use cases, you also must
ensure that your organization is ready to use a DG tool, as readiness is a huge factor
in the decision-making process and the success of a DG program.

> Read full chapter

ScienceDirect is Elsevier’s leading information solution for researchers.


Copyright © 2018 Elsevier B.V. or its licensors or contributors. ScienceDirect ® is a registered trademark of Elsevier B.V. Terms and conditions apply.

You might also like