Process of Data Quality.
Process of Data Quality.
Why You Don’t Get What You Thought You Asked For
By Michael P. Meier
Forward
This book is about relationships. There, I said it. I’m going to make the case that business
management principles such as alignment, governance, leadership, and yes, quality, are based
almost exclusively on relationships, both in the modeling sense and in human relations. Then I’ll
show you what goes into a high-functioning relationship. Just as neglected or unrecognized or
misunderstood relationships can cause problems in our personal lives, they can cause disastrous
problems in our business lives. I’m going to come at this through the back door of information
(data) quality.
Like all data people1, I would like to think that the whole world is vitally concerned with the
quality of the information delivered to them. I’ll begin here by acknowledging this to be a
fantasy. That doesn’t mean that the world shouldn’t be vitally concerned—only that in many
cases the much more pressing concerns may be avoiding starvation or imprisonment or violent
death or losing a promotion or a bonus.
Still, one would think that in the U.S. and Europe at least, there might be some general and
consistent effort made toward ensuring the reliability and credibility of information. To those of
1
Data People are those who understand (though there are degrees of understanding) that this thing called data is real and is not limited by
example. In other words, there is a real difference between data and this data.
us who are data people, it can look as though there is indeed concerted effort. Alas though,
delusion—no matter how persistent—is still delusion.
In order that you might know more of the perspectives that inspire and give form to this book,
the Myers-Briggs personality assessment puts me well into the Intuitive category in the profile
INTP2. The Gallup Strength Finder (Rath, 2007), taken in 2013, reveals [Maximizer,
Connectedness, Ideation, Strategic, Learner] as my top five strengths. These strengths can be
looked at in two ways
1. As filters through which we view the world
2. As tools that we use to attempt to manage our worldi
If you have a copy of The Gallup Strength Finder you will find guidance on how to “interpret”
and work with strengths other than your own. I attempt here to appeal to readers who are using
filters other than these. My hope is that at least one of the explanations, examples or illustrations
will succeed in passing understanding through your filters.
In 1982, upon “retiring” from the Air Force and a career as a German/Russian voice processing
specialist and intelligence analyst, I set myself a goal of understanding system development,
gaining command of all facets. Later, quality became part of the quest when the press began to
publish articles asking why system development results were frequently not well received. A
“study3” (The CHAOS Report) executed by the Standish Group in 1995 was widely quoted and
executives began asking why they were investing so much money into projects that were likely
to be “successful“ only 16% of the time.
I quickly determined that information systems are precisely about information—that is, data.
Today this is not a universally held view. In system development it is frequently the view of the
developers that an information system is a big puzzle in which all pieces have to fit together
without damage to one another. In this view, the data is merely a component of the puzzle but
one which can be manipulated. Data often becomes the chewing gum that makes it look like the
pieces fit together but, as anyone knows who has actually used chewing gum to hold something
together, it leaves a residue that can make more permanent fixes problematic or even impossible.
Chewing gum fixes also have a tendency to fail at the most inopportune of times.
While I studied software engineering, its methods and tools, and the economics of software
development and project management, I kept my eye firmly on the prize—data. I will use the
terms data and information interchangeably for the most part, always recognizing that there is no
information without data but that the converse is not true.
As it turns out, the approach was sound because information/data is the core of every information
system and it is wrapped in layers upon layers of technology, science, psychology, communica-
2
INTP
Seek to develop logical explanations for everything that interests them. Theoretical and abstract, interested more in ideas than in social
interaction. Quiet, contained, flexible, and adaptable. Have unusual ability to focus in depth to solve problems in their area of interest. Skeptical,
sometimes critical, always analytical.
3
You are left to decide for yourselves as to the value of this survey/study. The fact remains, though, that it did cause a stir in the trade press that
leaked over into mainstream management publications The fact is that the perception of quality in a system project is tied directly to delivery
delays. Perceived Quality is inversely proportional to the number and duration of delays.
tion, business, management and economics. The information systems of an enterprise touch
everyone in it at every level. The task of managing the systems and the information wrapped in
them is as complex a task as any that exists on the planet today. It may be that to future
archeologists, historians, archivists and anthropologists, the large scale systems in use today will
look much as the pyramids of ancient Egypt look to us. How did they do that with the tools they
had available to them?
I’ll go out on a limb here and assert that your data/information problems including inconsistency,
reliability, utility and simple lack of confidence and credibility are symptoms of widespread
problems in your business. A few years ago, it seemed that something called data governance
may have the potential to bring some of the information problems under control. What I learned
when I began to immerse myself in the world of data governance was that, like other business
ideals such as alignment, there is nothing other than an idea. Please note that I do not belittle
ideas as powerful forces. Certainly liberty, freedom, justice are ”only” ideas. At the same time
we need to recognize that those ideas are nothing if they do not become part of a vision that, in
turn, motivates action.
The appendices tell the story of my journey. The bottom line is that it is not possible to assert
governance in one community in the absence of governance in the surrounding communities.
Research into corporate governance will yield a treasure trove of organizational charts with
different labels on the boxes and extra dotted lines being the chief differentiators. The sad
conclusion was that governance in a business context is non-existent. Wait, though, there is one
form of governance that does exist almost universally in corporations and that is feudalism. You
may recall that in the feudal system, what is right and true is whatever the king or duke or baron
said was right or true. This insight caused me to adopt an altogether new approach to [data]
quality.
This is my magnum opus, the result of 30 years (as of today) in the information systems industry
and more than twice that in life. This work applies life experience to information systems issues
in general but specifically to issues of data and its quality.
We should all be able to accept that an information system of any utility at all in any industry
whatsoever is of value only to the extent that what we receive from it (information) is reliable
and credible. Anything less is the equivalent of a magic 8 ball or a Ouija board.
Prologue
[Apologies to James Michener, the master of the historical saga]
The business landscape lay unchanged for thousands of years. A Babylonian merchant’s bill of
lading captured in cuneiform on a clay tablet, once translated, would have been easily
recognizable to a merchant anywhere in the world of the early twentieth century. The status quo
seemed safe. The skies of commerce were clear, broken only by the occasional nimbus that
grudgingly leaked a bit of change to keep everyone on their toes.
Human to human communication, progressing from impressions in wet clay to ink on paper to
print to wire-limited electronic pulses, began to accelerate at a breathtaking rate. The shockwave
of improving communication capabilities pushed everything ahead and created a vacuum behind
that began to distort the landscape.
Timeliness of business communication had seen great improvement thanks to the telegraph,
telephone and radio. Goods still moved physically but more rapidly than ever before thanks to
the railroad and the independence from wind power on the seas.
The big money was in charge and barriers were in place to keep the financial flow in its designed
channels. As always, insuring an adequate flow of wealth into “my” purse was the primary
concern. Skills were valued (though not too highly) and those possessing them could rely on a
steady, if unspectacular income.
Record-keeping was based on long-established practice with data captured on the pages of ledger
books in pencil or ink. Those who created the records were often called upon to interpret them
as well. Even in medicine, nurses were taught how to keep records concerning their patients
though these were always subject to change by the physicians, each of whom had his (invariably
his) own likes and dislikes which had to be accommodated.
When people disagreed concerning the meaning of the records of their business, it didn’t matter
much since there was little actual sharing of data—after all, and assuming that both were ethical,
of what use would it be to Aloysius to have the ledger books of Ambrose’s business? Of course,
if Aloysius was negotiating with Ambrose to buy his business, the stakes were increased.
Gradually, beginning with the harnessing of electricity, the world of business information (data)
began to expand. The great wars of the 20th Century influenced the process as well. Mechanical
computation machines came into existence and then into common usage. The complexity of
these machines and the expense that resulted, limited these machines to special purpose use by
the very wealthy and governments.
With the advent of the transistor near the mid-point of the century, the expensive calculators
were rendered obsolete and the world of information exploded in a supernova, the extent of
which is still being discovered. The rate of improvement in communications—already
breathtaking—accelerated at an undreamed of rate. Analog voice transmissions, broadcasts of
real-time images of action and packetized digital limited only by the receiver’s ability to decode,
contributed to create new shockwaves and vacuum (push and pull) forces that tested the ability
of business to adapt.
Anything could now be treated as information, discrete units of which were called data. The
term came to be applied to the entire spectrum of information expressible as encoded changes in
voltage.
The information age had begun and for the first time in history data began to be felt as a financial
component of business.
Began is the operative word in the previous sentence. Within three decades a new field, software
engineering economics, had come into being and new sub-disciplines of software engineering
called validation and verification (V&V) were developed to begin to apply programming logic to
creating programs that would identify error conditions as they ran and, if forced to terminate,
would do so gracefully.
Things moved along at a hectic pace as we advanced from
programming by flipping switches
encoding machine (or assembly) language commands on punched cards
“Higher-level” (more like natural) languages which were themselves treated as data and
given as input to into software programs called compilers that emitted machine language
as the output
It became possible to combine simple programs in a spreadsheet cell, link it other cells for input
and output and, without knowing anything about software engineering, create arbitrarily complex
logical constructs that could be saved or transmitted with or without their accompanying input
data. Finally it became possible to graphically manipulate icons representing programs on a
computer monitor, connect them using various kinds of connectors and essentially create a
pictorial spreadsheet.
But the spreadsheet worked fine when digits were entered from the keypad until a decimal point
(or a zero or a parenthesis or a quote or…) was inserted. Then the result became #Err. This
moment is when data first appeared on the cognitive radar of the average person.
Your Mission
You’ve decided you are the one to take on and solve all those data quality problems that plague
your organization. Well, somebody has to do it, right? The lack of data quality is costing us a
lot of money. Holding on to that money for the company would mean a more profitable
company and who doesn’t want a better profit margin?
There are a few things you should know before you commit yourself to this.
1. Data Quality is like dusting. Some of you will understand this and others won’t. What I
mean to say is that there will never be an end to it.
2. Data Quality is NOT about technology. Technology is the spotlight that makes [the lack
of] data quality so visible (and so expensive).
3. In times of rapid change, Data Quality issues are inevitable. Today, a company that isn’t
changing is dying so… The good news is that if you get good at this, your career is as
secure as any career can be.
4. Even though Data Quality is NOT about technology, you’re going to find that you will
need a very good foundation in technology processes and especially the processes
employed in your company in order to have any chance of identifying the right places to
apply pressure.
5. An approach to Data Quality that flits from one bunch of low-hanging fruit to another is
going to become a net additional cost (not what you want to be associated with).
6. In practice, there is no difference between a Data Quality program and a Data
Governance program. The goals are only slightly different and the methods may be
identical. The goal of Data Governance is the establishment of an auditable process
leading to consistently high quality and reliable data. The goal of Data Quality is
consistently high-quality and reliable data (which will require auditable processes).
Auditable means provably consistent.
7. You are going to learn that no one (and I mean absolutely NO one) wants to talk about
data. Learn to talk about other things and use them to illustrate the concepts you want to
teach.
8. You will not be able to do this alone. It’s going to take leadership on your part to
mobilize support and participation across the company.
9. It’s a really good idea to be able to communicate the data quality vision for consumption
by any audience. This will require you to be able to express it so that your audience gets
it (in their language, using their metaphors…)
10. Finally, your Vision is the only thing you will have to sustain you in this so make sure
that it is clear in your mind (and heart).
Two Approaches
There are two ways to approach data quality. Both involve a process that looks like this.
Efforts focused on the top of the pyramid (bad customer addresses for example) are generally
less costly on a per project basis and can be turned around faster. While efforts involving the
lower segments of the pyramid will have broader scope and higher costs. In both cases, the long-
term rewards are proportional to the cost.
Resolving bad addresses to reduce the cost of mailings, for example, will be effective for only a
relatively short time before it must be done again (like dusting). To extend the time before a new
project team must be formed to repeat the fix, we could move downward to (for example) the
Technology Processes. We could institute checks to insure that validity of data (such as
customer address) must be guaranteed by the front-end (user interface) of the data collection
system.
Note that the sharp end of the pyramid can be attacked independently by one person or a very
small team while attacking lower (more foundational) levels will require coordination. For
instance, it would do little good in terms of reducing mailing cost to institute processes to
guarantee good addresses on input but leaving existing bad addresses uncorrected. All this
would accomplish would be to fix the cost at a certain level defined by the number of known
incorrect addresses. These considerations are quite apparent and lead to a high frequency of
what, from an historical perspective, can best be considered as misguided choices. The efforts
near the top are often seen as “low-hanging fruit.”
Harvesting low-hanging fruit is not necessarily bad although, as a standard practice, it
discourages us from developing the infrastructure we need to harvest and use the bulk of our
information resource.
4
More accurately, throughput or volume per time interval is the measure of technology. Think of your organization as a pipe. A pipe of a given
internal diameter is capable of carrying a certain volume of liquid content. Now think of technology as the means of increasing the pressure to
make more of that liquid flow out the end of the pipe in a day. Like pipe, organizations are constructed differently, have surfaces that are more or
less smooth, more or less resistant to corrosion, etc. It is most often the pipe that fails and not the pressurizing technology.
data. We won’t create experts but we will provide enough background so that you will grasp the
issues. Another purpose is to apply principles already known to the business world to data
quality. Data Quality (DQ) is a hot topic and it represents significant cost. It will be worth the
effort to get a better handle.
Alignment of all parts of an organization has emerged in the business literature as something of a
mythical objective. Executives are urged to optimize the performance of their organization(s) by
aligning the component functions so that they can operate in concert instead of in competition
with one another. In concert is a nice metaphor in fact. A popular vision that is called up is the
orchestra prior to a performance. If you’ve been there you’ll recall the noise that fills the venue
as each performer makes certain that his instrument is tuned and ready to play. Then the
conductor asks for attention and everything becomes quiet and the silence is filled by anticipa-
tion. On the downbeat the anticipation is replaced with music. This is the very picture of
alignment.
What the metaphor tends to obscure are the hours of relationship building that have preceded the
downbeat. Each individual musician has learned to relate to those around him and the conductor
has made certain that each now possesses a single vision of the work and the impressions that the
performance will produce for the customer. Individual mastery is simply the entrance
requirement. The performance demands mature relationships at several levels and those
relationships cannot come into being by accident.
Like so many business concepts, alignment has intuitive appeal and many consulting firms have
become wealthy by convincing executives that they have the key to obtaining alignment. Maybe
you have learned by now that they can’t deliver it. The best they can do is coach you through its
creation, no matter how long it takes.
The one thing you can expect from this book is that it will, on every subject, reflect the reality of
the information world. If it doesn’t reflect your reality, you are encouraged to take a closer look
at what you believe is real. So much of what we read today is best described as wishful thinking.
No wishes here. We’re going to start from the assumption that there may not be any such thing
as organizational alignment other than as an intuitively appealing concept. Take a deep,
cleansing breath now and roll up your sleeves because we’re going to create alignment
everywhere we go. We’re going to do it from the bottom up and the inside out.
We’ll spotlight some of the critical relationships to be developed.
We’ll learn how to build a high quality, effective relationship.
We’ll reveal some motivations of key players that will strengthen your hand as you
negotiate the relationship.
We’ll offer hints at what an aligned organization looks like in operation.
If your role is that of executive management, you only have to learn one sentence, supported by
one goal and implemented by committed action. The sentence is, “How can I help?” Of course
you should be satisfied that the person you’re talking with really intends to improve alignment.
If that person has a plan to get two or more segments (or even individuals) of your business
cooperating with each other—that’s alignment.
We’ll discuss this in more detail in the New Testament but the executive summary is this:
Alignment will happen when people understand what you’re looking for.
People will understand what you’re looking for when you can point to a handful of
examples.
The secret, then, is to
create or find a handful of examples.
The value in being able to say, “This is what I want,” can’t be overestimated.
planning for data if there is no way to implement the process(es) that will generate it and
maintain its integrity. Likewise there is no reason to try to design a process for which the input
data will not be available.
As we think about process, keeping the data in mind will help us stay closer to reality. We have
a tendency, when thinking about process, to fantasize. We document the process we’re supposed
to have or the process we wish we had and make it the basis of our data management plans. If
your data management plans never go anywhere, the first place to look is the processes that are
the foundation. If they aren’t real, measureable and consistent (within specified parameters),
start over by injecting reality. This is the most prevalent cause of stagnation and perceived
failure.
You can see that much non-productive effort (cost) can be avoided. The same thought processes
will go into planning and scheduling as you make sure that the enabling data and processes are in
place and synchronized before beginning on dependent ones.
A quick example will make the case for the importance of consistency. The figure above shows
a watershed and indicates how water flows across and through a landscape. A landscape is
described by its topography which includes such information as elevations, physical character-
istics (sand, rock, etc.) and any man-made features (dams, roads, buildings…). Knowing
something about a watershed and its topography can help us avoid making some serious mistakes
if we are planning to build something. Heavy rains in the mountains near Boulder, CO, in 2013
resulted in massive inconsistencies in water run-off and are evidence that investing in
consistency guarantees can pay off in a big way.
Certainly the topography of the area had been mapped and major building projects over decades
had considered the flow of run-off, whether from snow melt or rain. Government (or
governance to use a term more familiar to the data landscape) had invested in check dams,
culverts, ditches, retention ponds, levees, etc. to make the flow more predictable in order to
minimize the risk for property owners.
Likewise, Data Management seeks to understand the organization’s topography and to constrain
the flow of data in order to make certain that everyone has the information they need and that
certain kinds of data can be diverted for special uses. We also want to insure that we don’t get
overwhelmed by sudden surges in known kinds of data or the sudden appearance of a new kind
of data.
Think now of each process in your organization as a pipe, culvert, ditch, creek or river. These
channel the flow to produce a planned result. Boulder may want to take steps to increase the
capacity of the surrounding watershed to carry what is now recognized as a new peak flow. We
might do that in our organization as well by adding FTEs to absorb increasing flow. The
downside is that the flow may have inherent inconsistency so that our new FTE’s may find
themselves next month with nothing to do. This is a far bigger problem for Boulder than it is for
the average business organization because we can use technology to increase the throughput
capacity of our processes. If we do it right, it will be scalable so that, for example, doubling
capacity next year will cost only a small fraction of the original effort.
Now you have an idea of the role of data management and a better idea of its value in the overall
organizational plan.
There are several practices that will make up your Data Management effort including:
data design—insuring that you understand what data you (will) need
data capture—making sure that you have or are able to get all the data you will need
data storage—creating and operating the mechanisms to file your data and retrieve
it efficiently
data reporting—creating the ability to retrieve your data in context to paint the
kinds of pictures you need
While none of these practices requires the use of technology, it is almost certain that you will
find yourself incorporating technology and becoming increasing dependent on technology if for
no other reason than because the sheer volume of data in any typical business soon becomes
overwhelming without the use of some sort of automation. This gives rise to some additional
practices as part of your Data Management effort. You may find yourself wanting to improve
your grasp of:
database management systems and their use—this skill is often labeled as “database
administration” or DBA
basic programming principles
basic file management and mass storage principles
basic networking concepts
You’ll find that creating consistency within these practices and then holding on to it will
consume most of your efforts. As a side note, it is far easier to create consistency (or any
change) than it is to hold on to it. If you don’t start out with the intention of making your process
changes permanent, entropy caused by the travails of life in business will soon make sure that
you’re right back where you started (or worse). When you first start, you won’t see this energy
expenditure as valuable, but it doesn’t take many iterations of rebuilding your data files from
scratch to begin to grasp the need.
Consistency is created through the use of “standardized” procedures. In a small company, this is
relatively easy. As the owner, you have only to create an effective procedure and then see that it
is used by you and a small number of employees. In a larger company this is much more
involved and may consume the efforts of multiple employees in entirely dedicated departments
such as Quality Assurance or Data Architecture or Software Methodologies or Technology
Infrastructure… In both the large and the small companies we invest in the standardization of
process because the payoff is so big.
As you begin to educate yourself, you’ll encounter terms such as Data Governance, Data
Integrity, Data Quality, Master Data Management, Metadata Management, Data and System
Architecture and so on. This book will provide a context for everything above and give you
some ideas for what to expect (or demand) from these various functions and practices.
This is the place for a word of caution. The more you learn about data and its management, the
greater the risk of vertigo. Just when you think you have everything nailed down, something
really important is certain to change. You may add a new product line that changes your
definition of customer. The government may issue a new regulation that forces you to handle
your data differently or to add one or more new types of data. There is an old story about the
cardiac surgeon who had taken his car in to have some repair made. The mechanic found out
that his customer was a heart surgeon and remarked that their jobs were much alike. They both
had to keep the engine going. The surgeon replied that there was a big difference—he had to do
his repairs while the engine was running.
My own feeling is that the data manager’s job is an order of magnitude more difficult because,
not only is the engine always running, it is also evolving at a blinding pace AND the patient is
wide awake and acutely aware of any pain that a data operation may cause. Add to that the fact
that the patient may be actively fighting against the operation and you have a task, the
satisfaction of which ought to be worth some serious money. Of course, it’s only data so you
may be willing to save the money and do it yourself. A mistake by the surgeon could mean
death for the patient. A mistake by the data manager wouldn’t result in anyone’s death and
probably not even the company’s death.
This is what makes this work necessary.
5
Please note: simplistic is a perception as opposed to simple which is obtained by comparison. Relationships in large organizations tend to be
seen through the filter of the org chart which hides the truth of their complexity and their importance.
The fact that this seems to be acceptable is evidence of the degree to which relationships have
been “simplified.” For example, the friend relationship, in real life a constantly changing and
infinitely complex human interaction, has become a simple link in the social media world. It
isn’t possible (or even useful) to blame social media for the problem though. Recall Ford’s
quality improvement program in the 80’s. Ford had seen the quality advances made by the
Japanese auto makers and noted a preference among potential customers for Ford cars with
Japanese components. A key component was the Quality Circle. (Ishikawa, 1981) The Quality
Circle was an attempt to recreate lost relationship in order to solve specific problems. A side
effect that was noted but never really measured was the reduction or elimination of suboptimi-
zation. When we can actually see someone else suffering we tend not to make our own life
easier at their expense.
Ford has its share of troubles in getting the Quality Circles to work because of the level of
mistrust that had grown up during the period when actual relationship was absent. Another
possible reason for the difficulty was that the goal (defined by management) was improved
quality. Instead of focusing on the relationships and allowing them influence quality they
attempted to use a quality focus motivate better relationship. At any rate, what had worked well
in Japan was more difficult to implement in the US.
We might productively ask why the quality circle concept worked in Japan but not in the US.
Since all of the research seems to be management-based rather than social- or relationship-based
we might find it difficult 30 years later to replicate the scenario. An anecdote from my own
experience may offer some insight.
In conversing with and application manager concerning the meaning and value of some pieces of
data that came out of that application, I was told that at any point in time the data was unreliable
but that over time it was reliable. I must have looked confused because an explanation was
offered in the form of a little story. At some time before, this manager had been doing data entry
with their staff and had received a batch of forms that were filled out incorrectly. They had
simply entered the incorrect data into the system, knowing that it would create work for someone
else. This action was rationalized with the remark that they were expected to clear (enter) the
forms by a certain time each day and could not take the time to investigate and correct the
discrepancies.
The department that had sent the forms had no process to correct them after the fact since the
patient was no longer present. The department that received the data after entry was accustomed
to getting incorrect data and had several processes and tools for clearing up discrepancies. “So
you see,” I was told, “we have processes for making the data correct but we don’t know exactly
when it becomes correct.” They were able to rely on its correctness eventually (for example
when the billing was sent out to the insurance company) but even then incorrect data was a fact
of life and everyone downstream had processes for correcting it.
The bottom line was that the idea of consistently reliable data was a non-starter. If this story
makes sense to you and you’re OK with the bottom line, then you can stop reading here. If you
begin to glimpse the cost of poor data quality here then you will definitely want to continue.
Because the data resource is an essential component to every process, its management becomes
visible to almost everyone. The only reason to apply the restriction “almost” is because in the
trenches data management problems often look like something else. Some will get blank looks
and scratch their heads when they encounter the term but if we can get them together and endure
the initial finger-pointing and accusations, we will demonstrate to their satisfaction that data
management—or its lack—is the real culprit.
Because of the incomprehensible complexity of the data management environment, effective
data management will become a question of priorities. In this world, the big picture is all the
more important because it will not be possible to advance at the same rate or even according to a
comprehensive plan across the entire enterprise. It will become absolutely critical to advance in
areas of opportunity without causing undue stress on blocked or currently stable areas.
Data management becomes a symphony in which the equivalent of strings, woodwinds, brass
and percussion receive planning and direction that is appropriate both individually and as a
whole.
Below is a diagram showing the major components of Data Management and their inter-
relationships. We can move from one of these component functions to another and be very
productive if we keep in mind that they are symbiotic. In case “symbiotic” causes any
confusion, I mean that when one is healthy, all can be healthy and when one is enfeebled, none
can achieve its full potential. Each produces something that the others need and each needs
things generated by the others. To be perfectly clear, effective data management is a goal that
will not be attained until we have attained some level of competency in all of the symbiotic
disciplines.
The widest scope addressed in our big picture is summarized in the box at the top left of the
diagram. Everything else is rooted in the recognition that data and information and their
associated processes are resources in the same way that financial capital or skills and experience
are resources.
We will walk through this diagram and examine each function individually. Note that the small
box in the upper left provides the context for the diagram. “Data as resource” is an underlying
assumption without which we would find it difficult to justify any investment at all..
Please note, also, that these are functions and not projects. They are analogous to sales or
accounting. By extension, Data Management is likewise not a project but a function. Functions
are continuous. You’ll never be able to have a Data Management Wrap Party.
We can improve our abilities to do these functions and thereby improve the quality of the
product. If we like parties, we can set goals for improvement and celebrate the achievement of
those goals.
Process Management
We’ll start with Process Management since process is something with which most are already
familiar. Note that Process Management is an enabler for all of the other functions. If we can’t
adequately manage our processes, then we will not be able to
create and maintain a [data] warehouse
make use of an information infrastructure
insure data quality
govern
make use of meta data to do the above
The diagram below shows how Process Management may be decomposed into more manageable
components. These components, and the components of the other data management functions,
are processes.
Process management is a foundation for
governance activities. Eventually, data
governance can be subsumed into process
Process
management. Virtually all activities that we
Management currently think of as data governance are
really part of process management.
I.S./I.T. processes MUST BE INCLUDED.
The definitions of the processes are not as important at this point as is the idea that a function is
composed of processes and that it is the processes themselves that make the function useful.
Imagine, for example, that the Accounting or Bookkeeping function in your business was
defined only by its name. You may have experienced this in fact. Many start-ups have nothing
more than a title to give direction to activities concerned with keeping track of income and
expenses. Often the owner tries to do this
A process is a sequence of activities that is repeated. For example, a recipe represents a process.
It’s worth noting here that a checklist IS NOT a process. A box on a checklist indicates
completion of some process. Completion of a checklist makes the pre-flight process an auditable
one. A pilot’s preflight checklist doesn’t specify how to establish that the aileron functions—
only that it’s operation has been verified. For this reason, a checklist is only as good as the
discipline of the person who uses it. It’s obvious that if the completed preflight checklist is
found at the crash site an investigator would have to begin by learning something about the
pilot’s nature, the circumstances during the preflight, and what kind of training he or she had
received—this before any investigation into the aircraft itself.
Processes are frequently represented as process flow charts and these charts (or diagrams) are
used both to document the process for management purposes and to educate or train those who
are to execute the process. The diagram is especially useful when a process is executed
infrequently (as a recipe).
The art of Data Management is the ability to juggle data and process while walking a tightrope of
resource constraints over a torrent of business needs and market fluctuations.
It will be necessary to create new process as well as to examine and repair existing ones. At first
glance this seems an easy task—after all, we say, processes are merely what we do. Nothing
could be further from the truth and if there is one key thing to take away from this book it is this:
Principle: Your processes are the heart of your business and you are
about to do open heart surgery.
There is a corollary to this admonition and it is so closely linked as to be inseparable. We can
replace the word process with the word data and nothing will be lost in terms of truth or
importance or urgency. A picky person might replace heart with blood in order to preserve the
Corollary: Your data are the lifeblood of your business and you are
about to do a transfusion.
metaphor, but the bottom line is that one has no meaning without the other.
Of course any business owner or executive will say that profits (or at least a reliable revenue
stream) are the lifeblood of a business and that is also true. This book is concerned in large part
with demonstrating the relationship between process/data and profit. Improvement in data
management will yield improvement in profits.
The heart/blood metaphor can be extended. Great care must be taken when doing a blood
transfusion in order to avoid killing the recipient by mixing blood of different types. Mixing
data of different types or different meaning (semantics) can be just as deadly to your business.
The death won’t be sudden, it will be gradual as sub-systems bog down and become increasingly
expensive. Often there is a human cost as managers and employees are held responsible for the
cost increases, over which they have absolutely no control. Frustration and turnover increase.
Governance
Governance Components
We must accumulate
definitions but the heavy We will define an overall
lifting is in creating a architecture to allow
Definitions Architecture
mechanism for making governance mechanisms to be
the definitions available inserted where needed.
for review and use.
Governance (frequently called Data Governance or DG) is the function that is all about
consistency. Think of it as the role of government. We want and need consistency and
predictability in our lives and businesses need those properties to an even greater extent than
individuals. In the midst of chaos (complete absence of consistency) planning is not even
possible. We are restricted to reacting to the most recent stimulus. Common functions can’t be
implemented because they rely on planning. The best we can do is to make resources available
and hope they are used. Waste is a given when consistency is absent.
It is quite common when business leaders speak of governance to have them point to an
organization chart as though such a chart embodies governance. It does not. We should always
bear in mind that governance is essential in any community if we don’t want the alternative. The
alternative is anarchy or chaos in which might makes right.
Governance comes in many forms, some of which are indistinguishable from might makes right
(or anarchy). The one most often practiced in corporations is a feudal model in which a
“nobility” holds absolute power and “rules” (governs) by fiat. This is what is meant when org
chart is equated with governance.
This isn’t meant to be a treatise on forms of governance so let’s look at some ways of handling a
subset of governance, that is data governance, within the bounds of the prevailing form of
overall governance. One thing we have to recognize is that consistency is the goal. Everyone
will agree that they can adapt to virtually anything that is predictable. No matter how much we
may want to be adaptable though, adaptation is impossible in an environment of uncertainty.
Attempting to conform to uncertainty leads to unbearable stress and all of its side effects.
People—even our best employees or managers—will adapt to unbearable stress by ignoring or
tuning out that which is perceived as causing the stress. The most conscientious employees will
not be able to do that and will simply shut down or leave. What this means for the person
responsible for data governance is that consistency efforts will be created or imposed by the will
of the “noble” closest to the process. This, in turn, requires that the data management function
hold the biggest of big picture perspectives since it will be necessary to ask of a given manager
or supervisor only things that make sense to them and do them no harm. At the same time, all
such requests must be orchestrated in such a way that overall consistency of data creation, use
and management processes is achieved and maintained.
To all of you Data Governance people who think that DG is about creating and enforcing
standards, I remind you that History is chock full of lessons about trying to apply the rule of law
(that is what you are trying to do) in a feudal system. Even monarchs have been unable to make
it happen (Tuchman, 1978). Rule of Law is only possible when there is faith that my life will be
improved under the law. Where this faith is lacking, there is no rule of law and there are no
standard processes or auditable results.
Information Infrastructure
Infrastructure is a familiar concept. In the technology world it most often refers to the servers
(processing capacity), storage (disk farms) and networking (with or without cables)—in other
words, the hardware, used in our enterprise. From a data management perspective, the hardware
is only part of the infrastructure and it all falls within Automation.
Infrastructure should be thought of as the enabling or foundational components of a function.
Such components must be accessible on an as-needed basis similar to the water supply (or
electricity or sewer) in your home or to the road system in your community. When infrastructure
fails it is often viewed as catastrophic.
Remember that all of this is functioning in the feudal system that is corporate governance today.
We must be constantly marketing (eliciting support) to each manager (and supervisor) with
targeted efforts recognizing the unique accountabilities and constraints that define their world.
Monday, June 10, 2013
Information Infrastructure Created Fri., Oct. 10, 2008
Without significant
Define the capabilities that
automation, we cannot hope
will be needed to implement
to keep the data shelves
Capabilities and maintain the Automation
stocked. Continuously
infrastructure. Monitor
monitor for new
continuously for adequacy.
opportunities.
Information infrastructure
Planning must be constant as those accountabilities and constraints change. Even if one—or a
handful—remain constant, the overall picture is like the sea or the atmosphere. Change is
constant and is the only reliable or predictable property within our big picture.
Metadata Management
No surprise that Data Management carries its own data management burden. This burden is
known as metadata management. Metadata is most often defined as data about data. How
appropriate that we, ourselves must practice what we preach.
Page 5
Everything that we ask of any other function in the enterprise, we must already have asked of
Metadata must be acknowledged
and managed in accordance with Metadata management
Meta Data the data it describes. This may
mean separate facilities and
synchronization processes.
ourselves. We will be the pilot for all processes, standards, and tools. We will validate all
training, measurement, and reporting.
Because Data Management is so outward-focused, this will be the most difficult of all DM
functions to implement.
Data Quality
Consistency is the essence of quality and is the outcome of good management practices (read
processes).
One of the most important overlooked components is remediation. When we’re reporting some
Quality data must be a Data quality
product of Data
Management, The
Data Quality
assurance of quality is a
central reason for doing
data management.
Overlaps governance.
Overlaps with Meta Data Without designed and
and Governance. documented processes,
Process
Standards Consistency is the goal Management
there is no framework for
without which there can be applying standards.
no improvement. Processes are designed
around the standards.
metrics or statistics (let’s hope that those are synonyms), we often spend a fair amount of time
looking for suspect values so that we can exclude them from our report. We often fail to
recognize that those values should be reflected in the report—just under a different heading. For
example, suppose we need a report showing how much time (on average) was spent with patients
by each physician in a clinic and we want to report by month to even out fluctuations caused by
time off, etc. The first thing we realize is that we can’t answer the central question because the
actual time that the physician was in the exam room is not recorded.
We get the OK to use a standard time for a visit based on the coding for the visit (simple,
complex, new patient, physical…). Based on this we then find that we can’t generate the report
for the prior month until after the 15th of the month because that is the deadline for the coders.
We note that this is not a guarantee but merely a goal and that it is likely that a percentage of
visits will be uncoded when the report is run.
How much of this discovered information belongs on the report? How will missing information
be accounted for on the report? How many disclaimers can be included without damaging the
perception of reliable information that we work so hard to create?
Now imagine that, instead of time, we want to track blood pressure by billing code. If you aren’t
involved in health care, you may be asking, “what does a billing code have to do with health?”
Remember what we said in the governance section. Consistency by fiat depends on making it
worthwhile to comply with a particular process. As it turns out, one of the few common
denominators in health care is compensation (sometimes known as money). In order to receive
compensation for a visit, the visit must be coded because that’s how insurance companies
determine what the visit is worth. In order to get the visit coded at the appropriate level, the
physician must supply enough information for the coder to make a determination.
Because of this, coding information gets more attention than most other kinds of information
about the visit, it has been the focus of attention longer, and it is the most reliable. Please note
that we have not yet addressed the quality of the blood pressure data. Now that you understand
why billing codes are so important in the reporting of health data, we continue.
When we try to average blood pressures, we find that the report won’t run. It terminates
unexpectedly with a data type error. As an investigative tool, we run a profile on the systolic and
diastolic fields in our Visit table and learn that there are many values in those fields that could
not possibly be valid. We know that these values should be integers (counting numbers with no
fractional parts. Yet we see decimal fractions (measuring rather than counting numbers). How
can this be? Further investigation shows that the fields are typed (stored in the database) as
character strings. Typing is what tells the computer how to handle the values of a field. Certain
operations are possible on text fields but not possible on number fields.
Well, no problem, we can always do an explicit conversion of the text field to a number field
before attempting the calculation of the average. But wait, there is no conversion to numeric
(text to number). We have to convert to either an integer (counting) type or a floating point6
(measuring) type and we have both present. Now the question is how do floating point numbers
get into a box that should contain only integers? To answer this there are two paths to follow.
Programming is far less expensive the less it concerns itself with what people actually do. If the
concern is only what the system should do, the developer only has to talk to non-technical people
about specifications. If we have to worry also about how different people will use the system,
the cost doubles (at a minimum). As it happens, the nurse enters the vital signs, among which is
blood pressure, from notes they have taken and they typically don’t look at the screen when
doing so. A missing or extra <tab> puts the cursor in a box designated for systolic pressure
when they are entering temperature.
This should be fixable though—right? The short answer is, No. Recall again that consistency
only happens when the important person feels a benefit. The physician looking at the screen
6
In mathematics these are called real numbers.
frequently didn’t even notice the error. They were looking for a set of values and their mind
immediately put what they were seeing into an appropriate template for “Vitals.” When it is
suggested that this kind of error causes problems for someone else the response may well be that
they don’t want to mess with a process that gets them the vitals that they want.
Now what can be done in terms of remediation? We have to spend a lot more on our report
because we need to filter out everything that can’t possibly be a blood pressure value. We can’t
go looking somewhere else for that blood pressure since we aren’t allowed to make assumptions
where a patient’s health information is concerned—that’s a good rule by the way.
So now we have to include a disclaimer that the displayed average is not the true average since n
visits had no usable blood pressure values. When you visit with the nurses you should be
prepared for some pushback, too. They’ll insist that they took and recorded the BP—and they’ll
be correct. The fact that it didn’t get into the appropriate field in a database will be largely
irrelevant to them. And whose problem is it, anyway? The physician’s? The nurse’s? The
programmer’s? The business analyst’s? The data steward’s? Data Quality’s? Data
Governance’s? Data Management’s? Here is an orphan. An issue without a problem or an
owner. The usual course in these situations is, “You saw the problem—it’s yours.” So the
unfortunate report writer has some big decisions to make. What is the probability that the
resulting report will be viewed as a high quality product? Of course we can increase that
probability if we make the report look like there are no problems.
This is what a data quality issue looks like in practice. This example is from real life and is used
because it involves all of the normal suspects. Perhaps you’ll understand better now why your
report requests take so long to turn around. You’ll find that your DQ problems will involve the
same suspects
human tendencies
economics
loose process
mis-filed or mis-typed information
To put a bow on this particular case study, dramatic improvement was realized only after
financial incentives became available from insurers for consistency in gathering vitals
during a patient visit
tools were provided for nursing supervisors and managers to see mis-filed values in real
time
nursing managers and supervisors followed up with the same tools to insure that the
problem was fixed
It is clear that technology could not be held responsible for the problem. It could be said that
when the patient’s chart was on paper there was no problem so, in that sense, technology is
precisely the problem. That path is a dead end given that the healthcare business is going to
depend on databases to meet the needs of patients and no one is going to be able to go back to
paper. Where does the problem lie? It lies in the tendency for all of us to simplify and
streamline wherever possible. And the beneficiary of that simplification and streamlining is
never “we” or “us.” The beneficiary is always “me.” This is the picture of suboptimization.
Two Sides of a Coin
You may have inferred at this point that data management is equal parts data and process. If so,
bravo! The implication is intentional and the inference is, therefore, intended. Data and process
are two sides of the same coin. Every process both consumes and produces data and every bit of
data comes from some process.
You may feel that this is pummeling an already deceased equine but be assured—there is no
possibility of over-emphasizing the importance of these concepts. Bitter experience has shown
over and over again that failure to grasp the essential relationship between data and process is the
cause of virtually all the problems that we see in initiatives such as process improvement, data
quality, data governance and so on.
You can’t depend on your I.T. department (if you have one) to help you out of this or even to
adequately explain it to you. I.T. has yet to realize after all these years that implementing a
development methodology is a data management problem. That is to say that it is a process
management problem.
The next time your CIO or IT Director is pressing for process management or a data quality
initiative or governance, you might casually ask how it has worked for her. If you really want to
turn the screws, ask for evidence of improvement in the form of data. This is not to question
motives but to test commitment. We have seen many attempts at governance or management
improvement fail because this foundation was compromised though trivialization, inattention or
simply self-delusion.
A typical approach might contain these steps or phases:
1. What isn’t working?
2. Why? What is causing sub-optimal performance?
3. What should we do? What changes are needed?
4. Implement the changes.
5. Measure the improvement.
Invariably—that is, 100% of the time—we will stall on phase 4, lose momentum and the
initiative will die.
What is the cause of the stalling, you ask. Well, I’m glad you asked; the answer will vary from
investigator to investigator but the root cause is always a combination of:
economics
commitment (or lack of commitment)
suboptimization
Agile development (Alliance, 2013) is a perfect example. Hold that thought, though, and we’ll
visit it in more detail a bit later.
About Quality
What is Quality?
Quality is a topic that generates animated and occasionally heated discussion whenever it is
raised. Zen and the Art of Motorcycle Mechanics [Pirsig] has spawned an international forum on
the metaphysics of quality (MOQ) and the discussion there is wide-ranging and includes
representation from all the major trajectories of philosophy.
Cultural
Absolute
(universal)
Beauty
Quality
Elegance Originality
Essence
Personal Comparative
Better of Best of
In the late 1970’s and 1980’s there was a groundswell of interest in quality worldwide. This
came about because Japanese manufacturing, since 1945 the poster child for “cheap junk,”
suddenly (or so it seemed) began to woo customers away from US manufacturers. Automobiles
and then electronics taught us to view “made in Japan” with new respect. It seemed that
consumers were willing to pay extra for something they couldn’t find in US-made goods—the
perception of quality. Deming became a household name. In 1987 the Malcolm Baldrige
National Quality Award Program was established by an act of Congress. The purpose was to
promote the quality of US companies, in no small part because of the growing reputation for
quality of Japanese manufacturers.
Dr. W. Edwards Deming had been brought to Japan as a census consultant by Gen. Douglas
MacArthur and he introduced the concepts of Statistical Process Control (SPC), first developed
at Bell Labs by Dr. Shewhart, to Japan (Deming, 1982). Since the end of WWII, Japan had been
trying to get its manufacturing sector going again but by the late 1950’s had succeeded only in
flooding world markets with what was mostly recognized as junk. “Made in Japan” signified in
1957 something that was cheaply made of inferior materials and destined within hours for the
trash.
Titles such as Quality is Free (Crosby, 1979) and Quality Without Tears (Crosby, 1984) made
“zero defects,” “do it right the first time (DIRTFT)” and “price of nonconformance” part of the
business lexicon. In Search of Excellence (Peters, 1987) did the same for “business process
overhead” and “management by wandering/walking around.”
Ford adopted “quality circles” under the slogan “Quality is Job One.” After a rocky start
accompanied by critical press and stories of employee dissatisfaction, Ford pushed SPC
(statistical process control) out to its vendors and cut materials costs and rework dramatically. In
a decade they went from Fix Or Repair Daily to a highly respected product identity that was
competing on a quality basis with the Japanese giants, Toyota and Honda.
It is worthy of note that, in the 25 years since the Malcolm Baldrige National Quality Program
was launched, only 90 companies have qualified for the award. This despite the fact that, ”Up to
18 awards are given annually across six eligibility categories: manufacturing, service, small
business, education, health care, and nonprofit.” (NIST) It would be interesting to test the
award’s name recognition today. Quality seems to have lost substantial luster in the past 25
years. Today (2013) we like to talk about quality but the will to address the root causes of
defective product is largely lacking.
Try initiating a conversation about quality over lunch. Unless your lunch partners are all from
the Quality Department, you are going to be nonplussed at the response. First, you’ll get a
chorus of negative remarks about Quality being a drag on production or engineering. Then,
when they realize you’re serious about having the conversation, you’ll get a lot of NOTHING.
They haven’t even thought about quality.
If you force the issue, you’ll hear about appearance, costs, machine limitations, raw materials,
customers…but it becomes quite clear that quality is not seen as something over which they have
control. Now what?
The trade press (whatever the industry) is salted with stories of quality debacles although they
are not always identified as such. When gas tanks rupture in collisions or when baggage is lost
or the wrong limb is amputated we recognize a problem. It is unfortunate that the problem is
most often labeled as an error or mistake. Why is this unfortunate? Again, I’m glad you asked.
7
See Appendix A for a discussion of Errors, how to recognize them, correct them, remediate them
A Quality Function aims to eliminate #2 and #3. This is accomplished by formalizing the
production process and inserting measurement at critical points. Each of these measurements is
known as a KPI (Key Process Indicator). When one of these KPIs begins to trend “out of
control” (Deming, 1982), immediate action is taken to correct the problem. When we do this
right, we avoid all scrap and rework and produce only the kind of quality product we want to put
our name on.
When we say, “Oops!,” the focus is on remembering enough about the situation to be able to
avoid the mistake in the future. We all do this dozens of times a day. The problem, of course, is
that this is primarily an individual process. It also goes by the name Experience. As we know,
some people are better at learning from experience than others. Sometimes we try to preserve
and pass on experience through constructs like checklists. A smart pilot walks around the
aircraft with a pre-flight checklist in hand EVERY time the aircraft is about to leave the ground.
5000 (or 50,000) feet in the air is a really bad place to utter, “Oops!”
When, for example, lost baggage is viewed as a quality issue, then the organization will devote
itself (to the limit of available resources) to identifying the processes involved, making changes
in those processes, and training employees in the execution of the new processes.
On the other hand, if we view lost baggage incidents as employee issues (errors), we put up
posters calling all employees to greater vigilance. We humiliate employees by documenting
performance issues. We post the number of bags handled without a loss or the number of days
without a baggage issue. All of these are intended to motivate employees to avoid mistakes.
Deming’s Red Bead Experiment (CITEC, 2012) demonstrates very clearly the futility of
focusing on employee performance or motivation.
Note the “limit of available resources” loophole. One of the contributions of Peters and Crosby
was to make a financial case for investing in this kind of quality effort. The cost of non-
conformance is a concept that opened eyes and checkbooks to get beyond the error paradigm and
make a real difference.
8
There is a joke about a business that set out to increase market share by means of price reductions. The rationale was that, “We’ll lose a little
on each sale but we’ll make it up in volume.” This is a long-term joke but a short term “strategy.”
role due to its capacity for executing millions of process steps in the blink of an eye (a non-
technical term that can be translated as milli- micro- or even nano-seconds).
Having consumed the ideas presented in this book, you will recognize that you are not in control,
have never been in control and will never be in control. You will realize that control is not the
goal and is not necessary. Further, you will understand that you have in your hands right now all
the quality you can use and that when you can use better quality it will be there for you.
You will have absorbed many insights that will aid you as you seek to make your way in the
overlapping worlds of business and technology. You will be a more relaxed, peaceful and
productive person.
A Goal
The original impetus for this book came from the field of data and information quality. There
was a notion that principles of science and mathematics could be applied to data in the following
way:
We desire a definition of data/information quality that will facilitate the creation of automated
algorithms for the purpose of assigning a scalar value to the quality of an arbitrary data item or
data set. Assignment of a scalar “quality value” would allow anyone to choose between
equivalent data sources based on quality.
In simpler terms, we would like to be able to measure the quality of data so that we can compare
one with another.
Such a definition would also allow those who control such data sources to better market and to
receive appropriate compensation for the use of those sources having a greater quality value.
The initial purpose of this investigation was to determine whether such a definition is possible.
Many of us felt the need to establish inflexible boundaries around data quality for several
reasons.
To increase the level of productivity in the field which, like many technology-related
disciplines, was in process of splintering into ever smaller sub-fields
To be able to apply technology to the process of creating and preserving quality
To provide commonly understood paths for research and scholarly efforts to create
deeper foundations for business applications
The Lesson(s)
The lesson to be learned is summed up in another aphorism that is found in every language:
If I had only known…
This will be the lament of one who is experiencing a result that was not foreseen. We all believe,
and for good reason, that the more information we have and the more reliable it is, the better our
decisions will be. Better decisions correlate strongly with hypothetical, or predicted, results.
When we get the result we hoped for, we claim to have made a good decision.
It is not as apparent but even more important to understand that results are but one way to judge
the quality of a decision. The legal system in the U.S. makes use of the standard of “the
reasonable person” to assist in evaluating the quality of a decision. A decision is “good” or
defensible if a hypothetical reasonable person in possession of the same information and in the
same circumstance would decide in the same way.
You may be thinking that it’s obvious that we don’t know your Board of Directors or your boss
(or your spouse). “Aye, there’s the rub” was written for Hamlet by Shakespeare in acknow-
ledgement of the effect of the unknown. If your job or career depends on the result of the
decisions you make, you will feel acute pressure as you contemplate a decision.
We agree that mistakes should be celebrated as a tool of learning while simultaneously being
ever eager to publicly flog (or worse) those whose decision has negatively impacted our life. In
fact, we don’t even need a demonstrable negative impact—the perception, or even the belief is
sufficient.
All of this—fear of the impact of the unknown, fear of being held accountable for a decision with
resulting loss of credibility or status or power, and fear of the trustworthiness of the information
we have available—is what leads to today’s concern for DQ (data quality).
You Are Here: Dimensions of Data Quality
Bear with me for just a bit as we get a bit more into mathematical concepts than will be
comfortable for most. The goal is simply to paint a picture to show the cause of our difficulties
in bringing data quality under control.
Things that can be measured are typically measured along accepted dimensions and these
dimensions are standardized so that, for example, if I say that a box measured 3x4x5 anyone can
imagine not only the basic shape (length, width, height) of the box but its relative size. Of
course we need to specify units for our measurements to form the complete picture.
An initial approach then would consist in identifying dimensions of data quality that might be
measured. Both the dimensions and the units of measure would have to be consistent across all
types of data or information so that comparisons could be made.
As part of our investigation of data quality and our attempt to define exactly what it is, we should
start by considering how to pin it down within the context of our universe. You may have heard
that some of the most recent theories into the nature of the universe have concluded that there are
not four dimensions but 19 or so. When we try to pin down data quality, it is easy to understand
the need for more than four dimensions.
What at first seems to be a trivial task (I lost money or opportunity because of this data) becomes
complex very quickly when we try to stop this kind of loss from happening in the future. As you
know from your studies of arithmetic in elementary school, we are able to quantify things by
assigning their quantity to a number line which is a simple scale that is a sequence of integer (or
scalar) values originating at zero or nil and continuing for an interval equivalent to the quantity
or count (imagine a ruler with marks only at inch intervals). Let’s imagine that we can precisely
quantify data quality—that is our aim after all.
How many scalars—but let’s make them vectors9 because we know that they are different and
don’t all run along sequences of integers—will it take? What are the ordered sets that define
them? In other words, can we define the dimensions of data quality?
Discrete pieces of data cannot be said to possess quality
A piece of data is like a speck of dust. All that can be said about it is, “There it is.” It is not until
these specks begin to aggregate that properties (including credibility or quality) begin to emerge.
I grant that those miracle workers of qualitative analysis on CSI (whether Las Vegas, Miami or
New York City) solve television crimes every week by deducing context from virtual specks of
dust. Realize that those analytical results are possible only by aggregating all of the discrete
specs of dust into a context that makes sense and points to a culprit. Nevertheless, if we come
across the speck of data, “red,” while we can bound it by discovering its definition and potential
applications, we can’t possibly assign to it anything remotely resembling a quality value.
9
A vector includes direction as well as quantity or measurement. Vectors are very useful because we can add them together to get a result that is
also a vector. A vector sum (or product—they can also be multiplied) can tell us a lot about what is actually happening. Imagine a vector
representing the force that moves your car. Now imagine another vector which represents the forces opposing that movement (friction, braking
force, wind resistance…). Since these two vectors progress in opposite directions, their sum is the difference in their magnitudes. Of course we
need to get the units the same so we would convert horsepower to joules (jōōlz—a popular unit for engineers and physicists). When the opposing
force is greater than force causing the motion, we come to a stop and it is possible to determine how long the stopping process will take.
Even when we can associate it without any ambiguity to another piece of data, “wagon,” it is still
extremely risky to assert that [red, wagon] has any particular meaning let alone quality of
meaning. Another way of looking at this is that a word has meaning(s) and those meanings can
be altered by associating other words. A word alone is completely dependent on language and
culture.
This ideogram may be readily understood by a person versed in the Chinese language and culture
but it carries virtually no meaning for anyone who doesn’t read Chinese.
As we continue to aggregate new pieces of data to what we already have, we build a framework
or context within which it becomes possible to make judgments with respect to quality. We may,
for example, be able to say whether this context (aggregation of data pieces) has physical
properties or is a virtual construct. We might suggest uses or applications of this context and we
could begin to develop expectations around those uses.
getting a headache). Each type must be represented by an axis. A vector might be defined by
two ordered sets of values, where each value represents a point on one of the axes. It is not
possible to depict in two dimensions on paper something that exists in n dimensions but they are
certainly not representable as single vectors, though they might bend to the idea of a heap of
vectors related by context (illustrated by the simple drawing above).
We will not be able to define either Definition or Purpose such that they can be quantified. Can
you see a way to make the values for these critical “dimensions” a series? Do we have the same
problem with other dimensions? What we could do is break Description into constituent parts
(e.g., name, definition, cardinality…) and, in fact, many software tools exist that do this. None
of those tools insist on a value for these parts, however10. They leave it up to the user whether or
not a value will be supplied, trusting one supposes, in the desire for quality that lives within each
of us. What happens to our quality assessment when some of the information needed is missing?
People as a Dimension
As long as we’ve started down the path of dimensions, don’t we need to account for people in
some way? Any purpose that we might define will be dependent on one or more persons. If
we’re very lucky, we may find that many people share a single purpose. Realistically, though,
when is the last time, especially in a data quality context, that you encountered a situation in
which everyone involved agreed on the purpose of a data set?
There is an almost irresistible need to use data for more than its intended purpose. (Wand, 1996)
Very often, the architects and designers are called to a meeting and asked to add some additional
functionality to the [information] system. Frequently they are simply informed that this will be
10
That is, supplying values for these attributes is not mandatory in the tools’ data schema.
done. It’s a rare situation when the architect says “no.” Sometimes the architects might dig their
heels in a bit and insist on additional time to examine the implications. If this is the message you
are getting, you should listen. If it isn’t what you are hearing, you might want to begin planning
for delays anyway.
We have been disparaging the single-purpose data set (data silo) for more than two decades now
and it’s easy to become confused over the issues involved. There is a difference between
data as sharable (enterprise) resource
functional compatibility of a data asset.
The first is about access and availability. The second is about interoperability.
Comparison
Once we have identified the critical dimensions of data quality, we must develop at least one
way of assigning a quality value using those dimensions. We require a value that can be
compared in such a way that one data set may be credibly said to be of greater or lesser quality
than another.
Ideally, we would be able to rank several data sets. We might do this by comparison to a “gold
standard” of quality and assessing the subject data set’s variance from the standard.
We see an opportunity to use money to guide our thinking. After all, people have been able to
use money for millennia and its evolution has been studied thoroughly. Of course we have
recently seen the money equivalent of a data quality disaster when a key type of money
(mortgages and their derivatives) suffered a loss of credibility that caused its quality and
therefore its value to fall right through the basement floor.
With this in mind, it would be useful to be able to monitor the quality value of a data set over
time. We need to know whether our quality value is remaining constant, increasing or decreas-
ing. In order to do this, it would only be necessary to monitor those dimensions that change with
time.
Time may not be the only variable or dimension useful for tracking changes in quality. Any of
the dimensions might equally well be selected for monitoring. Our dimensions must be well
defined so as to enable comparison of individual dimensions.
We can divide the problem space into three domains or channels called detection, mitigation and
remediation. Detection involves everything we do or use to recognize that a quality issue exists.
This can take the form of profiling—generally producing a statistical analysis of an entire data
set. The analysis can be of a pass/fail nature, counts of all the values in a field/column, or some
more complex analysis based on combinations of fields/columns.
Mitigation includes the processes and tools used to correct or repair quality problems. Care must
be taken in mitigation activities to avoid making the situation worse. To insure that this doesn’t
happen, we need a solid understanding of the meaning(s) and use(s) of the data. In some cases,
we have no opportunity to change data, even though we might be improving its consistency
and/or overall quality. This is particularly true in medical, legal and scientific applications.
The final domain, remediation means what we do to prevent further quality problems.
Remediation might include improvement in processes, standard procedures, policies, methods,
education or a host of other activities that affect data and its quality.
None of these channels are effective in isolation. These activities must be executed in concert in
order to achieve the desired result—a consistently reliable data and information environment
spanning the enterprise.
11
A critical mass is the amount of fissionable material needed to sustain a chain reaction (in a reactor or in a nuclear weapon). In common usage
it is that number of adherents to an idea that enables the idea to become self-sustaining.
system development without such support can become hopelessly mired in complexity and data
overload. Naturally, the involvement of multiple people implies multiple ideas about the “right”
way to do things.
First of all, they had goals in common. The goals included
Consistency
Project management
Quality of result
Documentation
They really only differed because of the kinds of suboptimization resulting from market
competition. Each chose to focus on some aspect that they could claim to be handling better than
anyone else. Years of competition resulted in too many choices for the development manager or
programmer and the almost inevitable choice to rely on the talent they had already invested in.
Sufficient critical mass was created around each of the main players to make them well-known
and financially successful (though they may quibble about this particular characterization).
The problem was that they had priced themselves out of the market. They were accepted in
larger organization but most of the development was happening in smaller organizations that
often couldn’t afford the products that made the methodologies feasible and productive.
Eventually, driven by that need to improve the community, they got together and began to
discuss how they could combine the critical masses to positively affect the community as a
whole. The product of this collaboration, the Agile Alliance, set out to promote a development
methodology that could be scaled to the organization using commonly available tools.
In one respect they have enjoyed success—it is doubtful that a developer or development
manager exists in the world today who has not heard of Agile and is familiar with its unique
dialect of scrum, story, and sprints. In some other critical aspects they have been less successful.
It’s an exciting concept for developer/programmers. Just look at the poster with all of the
implied motion. Remember humans (and all other animals) are attracted to movement. A close
look reveals nothing that we can relate to data management. Oh, sure, we find the term
architecture within the Strategic loop but it’s quite a leap from this context of activity, [iterative]
movement and instant gratification to the concepts we have been discussing here.
Agile is a compromise—a negotiated solution—in which those who command the most attention
have come away with all of the concessions. I’m not saying Agile isn’t a valid choice for
development. I am noticing that consistency appears nowhere on the poster, nor do standards or
governance. From a data management perspective, it’s difficult to see any points where I could
interface my processes with those of Agile.
In an ideal world, the CIO would be sitting on this and making sure that the interfacing did
happen, was working and producing consistency. In the many worlds we live in, we can’t even
rely on the existence of a CIO role and certaily not on the willingness or ability of that role to
make these things happen.
Agile seems to suffer from all the problems concerning economics, hidden agendas and
personalities as any other business endeavor. Understanding these problems is the reason for the
Old Testament. One individual is collecting reports of Agile issues with the intent of finding
mitigation approaches. You can see these reports at
http://www.mountaingoatsoftware.com/blog/please-help-me-list-the-problems-with-using-agile-or-scrum.
Agile poster
The following examples are included here in case this web page should disappear. Syntax,
grammar and spelling are unedited.
No Documentation at all please this is agile and it mean no documentation at all
No vision from Product owner "You have this hurdle let me get back" and we end up spilling
that task to next sprint and eventually the scapegoat
Kick the testers and analysts off the teams. They don't know the technologies we're using and
if they aren't writing code they just slow us down. Testing can be done after we've finished
the real work
Kick the developers out of the Sprint Reviews. These meetings are for program managers and
customers and only serve to distract developers from their real work
Sprints, no one ever Sprinted a world record in a Marathon
Do we have to estimate user stories when we have a fixed budget to meet the customer
requirements?
Estimation paralyzes team member(s) with fear of being wrong (especially when there seems
to be no penalty for estimation “errors”)
It is a metric, and if we get it wrong we will be punished. (e.g in bonus/ appraisal).
Paradoxically, we are told weekly that story points are only a measurement of complexity,
our productivity is defined by the number of story points. Agile may say this is not right, but
the business pays our pay cheque. And guess what: We do not have time for lunch any more -
Story points drop, then your salary may go to someone else
These are but a few suggestions. From the perspective of 30 years in the industry, my
impression is that these are the same things developers have always complained about except
that the dialect used to describe them has changed. A big problem in getting useful feedback is
that the managers seldom if ever respond and probably never even know that these forums exist.
We could theorize at length about the reasons, but the bottom line is that a key point of view is
missing.
Use Agile if you so desire, just make sure that you are integrating development and data
management processes. The purpose of this book is to provide you with the tools to be able to at
least direct this if not actually do it.
Agile is exactly like other development methodologies in this respect: the process is far more
exacting than its users are prepared to be. This is the least appreciated dimension of all things
having to do with technology
Active management and governance are required.
People and technology have their own needs and tendencies and, without constant vigilance, will
move in those directions. If you are responsible for management or governance and your goals
are not fully aligned with those of the methodology and the people using the methodology, then
your own goals are likely to eventually supersede those goals to the detriment of all concerned.
You now have the foundation for an appreciation of what “data quality” and “data management”
mean. One pass isn’t enough to make anyone an expert but it will provide sign posts to the in-
depth knowledge that will get you to expertise. You are ready to see the world with new eyes.
In the beginning…
The goal here is to provide an overview of the history of “data processing.” By doing this it is
hoped that some readers will accept the challenge and change their focus from technology to
people. Those who do this will equip themselves for the long haul. They will become the
community that provides stability, continuity and consistency to the quest for quality information.
A sailboat can be an enjoyable and effective mode of transportation. It does, however, have a
limitation in that it can only progress when the wind blows. The manner in which progress is
made also depends on the wind. Sometimes progress must be made against the prevailing winds
and we are forced to tack repeatedly, never challenging the wind directly but using some of its
energy to achieve our goal by an indirect route.
Tacking (progress against a prevailing wind) Easy sailing, a beam reach
Anyone who develops information technology knows that the trip isn’t all a beam reach. We
have to be alert to changes in wind direction as well as changes in destination.
Imagine how much more difficult it would be if our boat had no centerboard or ballast. We need
people who will be the centerboard to keep us on track and ballast to keep us from capsizing as
well as rudder and sail.
The purpose of this book (Book Three) is to plant the seeds that will produce such people and to
arm them with new tools and approaches more suitable for navigating in the Information Age.
We cannot assume clear sailing. We must be prepared to change direction all too frequently
while keeping the goal constant.
There is no human endeavor that is not subject to external forces. There is also no external force
that can affect a clearly identified goal. When people share commitment to a goal and don’t
insist on suboptimizing, all goals are eventually attainable. Along the way it’s going to take four
kinds of people:
1. The Leader
2. The Manager
3. The Governer (one who governs, not the elective office)
4. The Doer
The Leader
A leader appears when most needed. This person is able to function when others are petrified
with indecision. This is the person who can prioritize and execute the details while preserving
the big picture. He or she is able to assess resource availability “instantly” and direct the
available resources to preserve the viability of big picture goals.
The Leader is able to accomplish this by decomposing the big picture goal into multiple sub-
goals or objectives and selecting the one that:
Has a good probability of success in the available timeframe
Has the best risk-reward value
Produces the result that is nearest the goal result
The Leader gathers as much information as the situation allows, assesses the credibility and
utility of the information, and makes one or more decisions in a time frame over which she often
has no control. Situations involving human life-or-death frequently must be handled within a
very limited timeframe. Situations in which the future of a business enterprise is at stake are
usually much more complex and have more extended time allotments.
Creating Leaders has been a goal for centuries, if not millennia. In a given life-or-death
situation, it is possible (though not necessarily likely) for almost anyone to emerge alive without
the exercise of any leadership abilities whatsoever. As the scenario becomes more complex,
greater need for leadership emerges. Often the acknowledgement of a leader becomes a survival
situation in itself. If there is no consensus concerning leadership, a ship full of leaders may still
sink.
From a data management standpoint, decision timeframes measured in minutes are irrelevant
unless the data is implicitly credible and of known utility—in other words, unless the situation
was anticipated. This is another thing that distinguishes leaders—they anticipate and prepare for
the moment when leadership will be required. They also recognize that the crisis is not likely to
take a form that is exactly like the one(s) they anticipate so they work hard to gain command of
all the resources that could be needed.
They prepare in advance for the information that will be needed when the crisis comes; making
sure that it is useful and credible.
The Manager
The Manager is handed an objective which is often part of a much bigger goal. The Manager’s
job is to achieve the objective at the lowest possible cost.
The critical information needed by the Manager includes
Current cost of production
Current production rate(s)
Cost of materials
Inventory
Labor cost(s)
as well as others of more transitory interest. From a data management standpoint, the common
theme is currency or timeliness. The Manager can make decisions based on trends of historical
costs but to really achieve objectives at the lowest possible cost up-to-the-minute data is needed.
If I have built an organization that is flexible and adaptable, I can take advantage of declining
costs in one area to offset increasing cost in another. To do this I need to know what the costs
are NOW rather than as of last month’s close. It seems obvious that timeliness of information
availability is important to anyone in business but to a Manager, it has real value and the closer
the manager is to the production line, the more value it has.
The Governer
This person is sometimes known—usually without much affection—as a bureaucrat. When we
speak of Data Governance we are talking about a bureaucracy. Now wait! Before you skip this
section, vowing to have nothing to do with Data Governance, you should understand the function
of the bureaucracy and the bureaucrat.
Let’s examine governance in general. We are all familiar with the concept of governance and we
regularly hold elections to determine who the Leaders of the governance effort will be. Note that
we do not elect the bureaucrats themselves though we do expect the elected leaders to create or
at least preserve a streamlined bureaucracy.
The function of governance is to make certain that the community has what it needs to function,
that it is secure from attack from within and without, and that the community experiences as few
unpleasant surprises as possible. The role of bureaucracy is to make certain that those things
continue regardless of who is elected leader. In other words—consistency.
Bureaucrats do not make the rules and they do not enforce the rules, they simply make sure that
the rules are carried out. If a governer (or bureaucrat) is given a procedure to follow, they will
follow it. This can be frustrating for the Managers and Doers but they also benefit because the
alternative to bureaucrats is autocrats (my way or the highway) or anarchy (might makes right).
A vital bureaucracy insulates the community from changes in leadership. In business today,
changes in leadership are frequent and can seem quite arbitrary. Governance in the form of a
bureaucracy is essential to consistency. As we shall, see consistency is essential to data
management. One small quibble though with the notion of data governance—all governance
must take place at the process level. Governance demands process and can’t exist without
process. The term data governance is confusing at best and misleading in fact. We cannot
govern data though we must govern the processes that surround data.
Governers are focused on trends. They need to know that whatever they have in their hands
meets the process specification. Then they need to know whether the process is stable,
improving or deteriorating. Business today focuses on leadership to the detriment of
consistency. It isn’t a good thing when everyone is asked to make decisions. First, many
employees don’t want to make decisions. Second, most employees don’t have the experience or
knowledge to make good decisions. Finally, consistency is impossible when everyone is
deciding.
Most of the decisions in an organization (or a community) must be embedded in process. Doing
this empowers everyone who executes the process. An example will shed light here.
In a small hospital setting there are a handful of surgeons and 3-4 operating rooms. Outpatient
surgeries (cataracts, arthroscopies, and joint implants are the bulk of the procedures. There is a
process by which a patient is scheduled for a procedure to be performed by a surgeon in an OR.
The process involves a basic health assessment to determine whether the patient is a candidate
for anesthesia or has any infection that could pose a risk for the patient or the hospital.
All of the paperwork (data) associated with the various assessments, histories and questionnaires
goes to a ward secretary whose instructions are to collect the paperwork and make sure that all
the pieces are present the morning of the surgery.
The ideal (fantasy) process calls for the secretary to cancel the procedure if certain data is
missing or has values outside of acceptable ranges. Of course when the secretary does cancel,
she is immediately accosted for wasting the resources that have already been allocated. The
secretary hears, “We could have worked around that.” If sufficient time is available, the surgeon
or anesthesiologist may review the data and, with the benefit of greater knowledge and
experience, override the “decision” of the secretary. This only has to happen once or twice
before the secretary recognizes that the process she has been given is a fantasy. She begins to
gather the data as early as possible and pass it on to the medical staff.
Now she is berated for not canceling, thereby wasting resources. When it is pointed out to the
surgeons and anesthetists that they are asking a $15/hour, relatively uneducated (compared to the
surgeons) person to make a medical decision without any guidance, they realize their error.
They rebuild the process so that decisions are much more automatic and are made much earlier.
The process has decisions built into it and branches as appropriate to involve the proper medical
decision makers as soon as possible. The secretary no longer has to make decisions and is now
happy with her job which is now simply filing, copying and transmitting information.
Surgical teams are now assured of a procedure when they report for work and patients are no
longer sent home after appearing at 5:30 a.m. for pre-operative preparations. Good processes
make happy communities.
The Doer
Everyone is a Doer at times. In fact no one is any of these roles all the time and everyone is each
of these roles at some time. What we’re talking about here is tendency. In particular, a Doer is
someone who is given an objective and achieves it.
How does Data Management accommodate Doers? We can’t predict what kind of information
the Doer will need nor when or where it will be needed. In this respect the Doer and the Leader
have similar profiles. We find, however, that Doers need information at a finer level of detail
than do Leaders.
While a Leader or a Manager may be able to use inventory information at the level of dollar
value as a whole or by type, the Doer needs to know whether he can supply a customer with the
two units being requested. A lot of data management and IT time is spent satisfying the needs of
managers and leaders but the Doer is really the one who butters the bread.
Many roles in a business have a tendency to think in commodity terms about data and
information. We speak of records and terabytes and pages and other collective terms. None of
that is useful to the Doer. They need only one piece of information and it had better be credible
and delivered quickly and reliably. No other role is so dependent on the quality of the
information they are handed. No other role gets less attention from the data people.
When the Doer gets information that is suspect, they can’t just toss it back and demand better.
They have to roll up their sleeves and try something else or they have to verify the information or
they have to actually create the information they need. For example, the Doer may have to go to
the warehouse and actually count the number of pieces of a product that are currently available.
In more than one scenario involving more than one industry we have seen data that ranged from
guesses to estimates to survey results that says your Doers are spending from 20-60% of their
time on the job getting the quality they need in the information they work with. For most of
them this is not directly part of their objective and is considered a negative impact on their
productivity. Here is an opportunity for the Manager or Leader to make their numbers look
much better, essentially at zero cost!
The Opportunity
Let’s spell it out. The opportunity is to take that 20-60% of labor cost and make it into additional
productivity. It’s zero cost because we can simply redirect already lost productivity into
relatively short-term efforts that will improve processes with resulting improvement in
information quality. It requires surprisingly little effort to make significant improvement in the
credibility of our information.
The most important reason for the lack of quality we currently experience is the notion that if I
take the time to fix a problem it will mean that I may not meet my objective. “Somebody will fix
it.” Or “It’s not my job.” Or “We’ll fix it later.” Are all equivalent to burning money. There is a
name for the problem—it’s called suboptimization and we will discuss it in greater detail in
Book Eight.
37 26 36 (information)
Dress form, 37, 26, 3612
The fact is that almost everyone seems to need a context in order to discuss data. We are
comfortable discussing data about sports team performance or the financial performance of a
12
This, in fact, is the principle behind XML in which a set of data is tagged with a context so that it can be readily interpreted.
company or market sector. All of these data are known by a name such as RBI, ERA, P/E Ratio,
Days in A/R, FGPct, FTPct, Yds/Game, Yds/Att, Comp/Att, TD/Int… Most people bog down
quickly and lose interest when the discussion turns to describing the data itself.
The reason for this is that data, like money, numbers, law, process, and executable logic (to
name but a few of the really useful ones we see every day), is an abstraction.
We can describe these things and provide examples (instances or manifestations) but we cannot
manipulate them directly. Abstractions are extremely useful and yet, because they are ideas, they
can create much confusion. When we confuse the idea with the example, we get instant
confusion.
We are accustomed to calling characters such as “4”, “8”, etc., numbers and for most
purposes, that causes no problems. Sometimes, though, we need to be more precise
and remember that those characters are numerals (Arabic numerals to be exact) and
that they represent numbers.
Arabic numerals by themselves can only represent one kind of number called an
integer (a counting number). This is why, in days gone by, there were so many names
for units of measure.
There was no way to determine, express or even conceive of a decimal fraction (a real
number). A field dimension was x feet and y inches or x chains and y rods (or yards,
furlongs, leagues...). At some point the notion of less than became fractions. The first
fractions were the ratio of integers such as 1 part in 2 or ½. Thus a new kind of
number was created—the rational number.
There’s no need to go deeper into numbers here but let’s apply the same thinking to money.
Most of us think of money and get a mental picture of coins or bills (currency). Some of us
might form an image of a bank statement or a Quicken screen or even a check or deposit slip.
Only those who understand money as an abstraction are equipped to imagine equities, credit
default swaps, futures, mortgages or debt in general as money.
Money instances
We have a similar situation with respect to data. Those who are able to use data once it is given
form, whether as a headline, a report, an article or a chart, are on one side (let’s call it the
concrete side) Those who are able to manage and create data or to recognize it in manifestations
that don’t look like data are on the other (the abstract side).
Obviously, many of us are in the in-between territory where we wander back and forth, some-
times catching a glimpse of the larger world but content for the most part to confine ourselves to
headlines, tweets and statuses. Those on the concrete side accept the data as it appears and are
often confused when (or if) they act on their data and find that things don’t turn out as expected.
On the abstract side there is continuous concern over the reliability of data. This group
recognizes that when we build with inferior products (unreliable data) the result will be
unreliable. They want to increase the reliability (confidence, trust, predictability) in their lives
by identifying and eliminating whatever it is that causes unreliability.
Over the millennia we humans have improved our understanding of number and developed
mathematics, which is a system of reliable rules—that is, rules that consistently produce results
that are objectively and pragmatically useful. One of the means by which this was accomplished
was the introduction and refinement of an algebra. An algebra is a system of symbolic notation
and operations that can be used to uncover and record the nature of the abstraction.
We see again and again that abstractions are “tamed” by the introduction of some symbolic
formalism that allows for precision in communication about the abstraction. A flow chart, for
example is much more precise than a narrative (natural language) description of a process. This
is particularly true when the flow chart is produced by someone with a grasp of the abstraction.
Executable logic yields to a combination of process and data flow diagrams.
Not so very long ago, there was considerable discussion and debate over the efficacy of one
symbolism over another. There were several different versions that each spotlighted some
specific aspect of process or data flow. Now we choose our tools, not because of the rigor of the
symbolic algebra at their root, but because of their cost or their ease of use. We are expected to
use these tools and so we do use them, without ever recognizing their place in the world of
abstraction.
Perhaps no warning is needed, but to avoid any liability situation and to increase your appetite
for what it to come, this is a good time to point out that thing people (those who live in the world
of the concrete) are often viewed by the idea people (those who can live in abstraction) as their
legitimate prey. This is easily seen when we consider swindles and confidence (con) games.
One person has all the power based on the ability to make unreliable data (usually about money)
look reliable. The concrete person accepts the data at face value and winds up transferring
money to a person he will never see again.
The case of Bernie Madoff illustrates the spectrum involved. His prey were not all
currency/concrete people. Some had a degree of sophistication about money and data and were
functioning in the abstract world. All were temporarily blinded (having recovered their ability to
see since Madoff’s bankruptcy and conviction) by the desire to increase their money holdings
(wealth).
Perhaps you don’t see the relevance of these warnings now, but be assured that this happens
every day in the world in which we live. It isn’t always about money. Sometimes it’s about
control or power. Sometimes it’s about influence or promotion or simply credit or appreciation.
People use data to gain an advantage and our job is to help them (on this side of legal).
We might quibble about ethics or morality, but we must be clear about what it is we’re doing.
We must be aware that even “bad” data (known to be unreliable) can be and is being put to use.
What do we call a person who makes a living by inducing people to part with something of value
in exchange for something of lesser value? This occupation is known as sales and the difference
in values is known as profit. Please, this is not a value judgment—the customer gets to satisfy a
need and both parties agree with their eyes (mostly) wide open. The purpose here is simply to
make sure that we’re all clear about our purpose.
If marginally reliable, the same thing as known unreliable, data can be put to use in legally
returning a profit and in fact may be the key to that profit, then what is the value of making all
data reliable? The real value of a data quality effort may be the ability to discern which data are
reliable and the level of reliability.
We’ll revisit this later, but keep this thought on the back burner of your mind as you proceed.
Lately we have seen a new kind of data user emerge, the hoarder. “Big Data” is a concept
created to appeal to the information hoarder. “What might you be missing?” is the question that
engages them.
Imagine that we live in a sea of data. We are like fish in that sea who need data to live. Or
maybe we are dolphins in that sea who come to the surface to live and only immerse themselves
to find food. Most of our data quality effort is like the person who attacks any impurity that he
sees as he swims along. He has absolutely no impact on the quality of the sea. By the way, if
it’s more comfortable for you, you can replace “sea” with “lake” or even “pond.” The big data
people are using specially designed strainers which they place in interesting currents within the
sea.
The person who desires to clean up the sea must become knowledgeable about many things that
are not, themselves, of the sea. Because the sea is the result of every process within our world,
governance of those processes must be instituted and consistently applied over an extended
period of time before she will see the results she is after. Even when the desired results become
visible, relaxing our beaurocratic vigilance will result in immediate loss of quality.
Data Quality—More Than Meets the Eye
As we move forward toward a view of data quality that allows us to create and use a language
specific to DQ issues, descriptions and solutions, let’s take a minute here to examine the
behavior of data.
Certainly, one of the attributes of quality data is that it is well-behaved. In other words it consist-
ently delivers value according to principles that are applicable because of its type, domain,
range, relationships, maturity, purpose(s)…
It is useful at this point to differentiate between static and dynamic properties of data. Any DQL
(data quality language) that we might define should work well where static properties are
concerned. When we begin to consider dynamic properties, the task becomes much more
complex. The greater the number of dynamic properties, the greater will be the complexity.
Our chances of designing a DQL will be significantly greater if we can restrict ourselves to static
properties only. Before we can do that, we have to understand the dynamic properties and assess
their relative importance. Can we carve them out of the discussion? Will excluding them
compromise our DQL’s capabilities?
Looking back at the list in paragraph 2 above, the first three properties (type, domain, range)
might be thought of as static. These are the focus of our modeling efforts or, if we only pretend
to do modeling, of our programming efforts. There is a tangent here that we’ll resist for now, but
at some point we have to come back to it. The question of how data is initially defined is huge
and the effect of initial definition on the lifetime of a datum and in particular on its quality is not
to be underestimated.
For now, though, we’ll put that on the back burner. We expect the individual pieces of data to
possess a definition (usually called a description), and our DBMS requires that we say what kind
of data it is (in terms of structure and permitted computational uses). Is it variable length text
strings, a specified number of characters, integer, floating point, money, date/time, etc.
It is vital that we remember that, even though data/ information and its management can be
defined and discussed without reference to technology, technology can have a lot to say about
the costs associated with failure to manage. The human mind can take a set of numerical values
(ex. 1, 1.5, twenty, pi, -12, 3/8) and make immediate sense of them. If asked, we could add them,
multiply them, average them, order them or perform a host of other operations on the set. The
computer CAN NOT. If we command a computer to add (1.5, twenty), we will get a failure. If
we command our computer to average the values found in a field or column in a table and one of
the values is 3..2, it will fail.
Technology demands consistency. Humans like consistency but can function in its absence.
Therein lie all of our problems. Humans must exert extra effort to achieve the consistency that
computers and technology demand. To the extent that we rely on technology, we must be
willing to exert the extra effort required to keep it happy.
It is surprising how many data are defined to the DBMS as varchar. It probably shouldn’t be so
surprising since all of our modeling tools allow us to set a default type and the default for the
default is always varchar(n). This is the default because it guarantees that any value supplied by
a user will be accepted by the DBMS. In other words, it makes life easier to use the varchar data
type. Other decisions that compromise eventual quality are made at this point, also because the
make life easier for someone in the near term and quality of result is a long-term concern. This
will also be an avenue for future exploration.
The final three items in the list (relationships, maturity, purpose) are dynamic in the sense that
their values can and will change, sometimes rapidly and usually unexpectedly. Let’s take the last
first. Purpose, as “fit for…,” will change whenever we’re not paying attention. We hope that
our stewards will be on top of this but pragmatically (everyone likes pragmatism), they may be
too close to the business itself so that changing business needs or drivers loom so large as to
overshadow defined purpose which then fades to insignificance.
Data Quality is similar to the general concept of quality
Maturity is also dynamic. We expect maturity to change over time. When we think of data
maturity (if we do), we include stability in all the other properties, quality metrics that have
flattened out, recognition within the enterprise and probably several other aspects.
Finally, we have to face relationships. We’re not very good at relationship management. This is
as context-free as any assertion can get. Some of us wouldn’t recognize a relationship if it sent
us a valentine. Others pile all sorts of unwarranted expectations on top of our relationships and
then wonder where the quality has gone.
It all starts in the modeling or definition phase. Chen (Chen, 1976), when he invented a
graphical notation for describing data, gave equal weight to entities and relationships. Both had
a two dimensional symbol and the opportunity to possess attributes. For many reasons, not least
perhaps that tool developers didn’t grasp the importance of relationship, “data modeling” tools
eventually turned a multi-dimensional, real thing into a single line segment that is only present at
all as a clue to the schema generation software to copy the identifier from one of the linked
entities into the attribute list of the other. It is labeled a foreign key so that the database engine
can build an index.
Although examples are often counter-productive in the discussion of data quality, one example
may illustrate the role of relationship in completing the semantic of a data set. PATIENT is such a
common entity in the health care marketplace that no one even bothers to define it. It is a set of
“demographics” by which we mean the attributes and it has relationship with PHYSICIAN or
PROVIDER. It probably also has relationship with VISIT or ADMISSION, ORDER, PROCEDURE,
PRESCRIPTION, SPECIMEN and other “entities” of specific interest to the enterprise such as
EDUCATION_SESSION, CLAIM…
It doesn’t take long to figure out that the relationship between patient and physician is more
complex than can be accommodated by a single foreign key. A physician can “see” a patient,
refer a patient, treat a patient, consult (with) a patient, admit a patient…the list goes on and on.
Each of these relationships has real meaning or semantic value and may even be regulated by an
outside body. Typically, these are implemented by a single foreign key attribute for each.
Sometimes they are called out explicitly as associative entities—frequently in events such as
VISIT, ADMISSION, ORDER, PROCEDURE, PRESCRIPTION. The only problem in this case is that
they are no longer recognized as relationships.
Now, imagine a situation in which an in-utero procedure is scheduled on a fetus. You may be
aware that transfusions, heart valve repair and a host of other medical procedures are actually
being performed on the fetus while it is still within the mother’s womb. So, who is a patient? If
the facility also terminates pregnancies for any reason you can see the conundrum. Medicine
doesn’t allow for terminating the life of a patient (Dr. Kevorkian is an interesting case study but
not for this discussion). At the same time, we would like to sometimes treat the fetus as a
patient, perhaps for reasons of safety. We also experience the lack of values for attributes that
we may have tagged as mandatory, e.g., DOB, SSN.
It is only when we explicitly talk about relationships that these issues emerge. Relationships cast
light on the entity from all angles.
Relationships also represent the business processes that inform the purpose of the data. Often,
undocumented meaning gets attached to data. Two analysts will get together and agree that for
the purpose of this analytic, this combination of attribute values will be included (or excluded).
For a given ETL (Extract, Transform, Load) job, we decide that an attribute value that isn’t on
the approved list will be replaced with “&”. The adjustments to business processes are constant
and usually undocumented and unnoticed. Until we can point to a documented process or
relationship, we have no way of capturing and dealing with changes.
What’s the difference between an association and a relationship? Somewhere in there we’ll find
clues about dynamic quality properties. One thing leaps out as a property of quality and a
property of relationship—expectation. When we claim that something has quality, we establish
an environment in which it is permitted to have certain kinds of expectations. The same is true
of relationship. When two parties or entities enter into relationship they agree as to the
expectations they will have of each other.
In our quest to define quality for data, we will be forced to document expectations and to monitor
accountability with respect to those expectations. We’ll give the relationship much more
attention later.
Foundation—Machines and Logic
We begin our investigation at the most basic level—that of the machine. Whatever Matter (as in
“Does it really matter?”) means to a machine it is quite different from what it might mean to a
higher consciousness or even to your boss (the Dilbert influence). At the machine level,
mattering takes the form of whether or not to activate an error condition (switch).
As you may be aware, the computing machine is actually a continuum that is anchored by
silicon-based integrated circuits and memory, hosting an instruction set (the ISA or instruction
set architecture) which in turn is the basis for firmware. Firmware is the platform on which the
operating system (OS) is built. The firmware layer is distinct to the processor type.
At the level of the machine, all data resolves to a linked sequence of bytes represented by an
address in memory. Obviously, we’re leaving out a lot of details. This chapter is a synopsis of
an entire semester of Microcomputer Architecture.
The byte is an 8-digit binary value. Binary means that there are only two values in our
mathematics, 0 (zero) and 1 (one), therefore each digit can be presented by an electronic
construct that is either on (1) or off (0). The machine neither knows nor cares what the value
represents and simply performs commanded operations, e.g., and, nand (not and), or, xor (one
but not both), +, -, /, * (higher order operations that can be expressed in terms of and, or, …), etc.
The only time any of the data matters is when the machine performs an operation resulting in a
memory (or register) overflow. In this case an operation produces a result that is too large to fit
into the assigned memory without losing some of the bits (binary digits). To the Logic that
requested the operation, this is a symptom of what is potentially a bigger problem. It means that
the data wasn’t understood well enough to reserve a memory extent adequate to hold the result.
Each stroke of a key on the keyboard generates a byte-value. Each value is a number from 0 to
255 that represents a character in the ASCII (or some other) character set. Somewhere in the
logic between the keyboard and the hardware, there must be recognition that, for example, a set
of keystrokes represents a number rather than a string of characters. If you use a spreadsheet
program you probably have experience in telling the program that a cell’s content is to be treated
as a number with zero or more decimal places.
You will have seen that cells behave differently if they hold numbers versus characters. For one
thing, text defaults to left-justified within the cell while numbers are right-justified by default.
An attempt to perform an arithmetic operation on a cell containing text will produce an error
(except in a few programming languages—most notably M or MUMPS—which makes
assumptions that allows the program to continue executing).
The goal of the Logic (higher-level program) is to identify and trap errors or potential errors
before they are delivered to the machine. This effort is worthwhile because the Logic should be
able to determine the cause of the error condition and take action to warn or correct. Allowing it
to pass through to the machine will result in an “overflow” error condition. This error is usually
accompanied by a memory or register address and possibly a screen full of byte values. The
machine will halt its handling of the input data at this point because that is the only safe thing to
do. This is rarely seen any more as these potential error conditions (bad data) are trapped and
dealt with at a higher level. In the case of the spreadsheet, “#REF” is the recognition of an error
condition before it is passed on to the machine.
Please accept sincere apologies for this chapter, but it is necessary to create an awareness that the
machine isn’t the problem. The chapter is short because it is dense (a semester in a few
paragraphs) but more detail is not needed for our purposes.
We’ve satisfied ourselves that the machine is not the object of our focus. Its requirements
(expectations) are simple and straightforward. In essence it throws all the responsibility back on
the participant who delivers the data. It faithfully and efficiently carries out operations on bytes
(pieces of data), confident that it won’t be given data that it can’t handle. It promises to avoid
potential harm by simply stopping and waiting for the problem to be cleared.
The characteristics of quality that matter to the machine are only that the instructions given must
fit the data. Fit is interpreted very literally. The instruction must set aside enough memory to
hold the result of an operation without loss of information. The machine’s only responsibility is
to minimize damage by halting operations as soon as an error condition is recognized.
Our investigation must proceed upward in the abstraction continuum to the Logic or software
layer(s). We have seen here that things that appear from a safe distance to be monolithic or
homogenous may, in fact, be layered and highly complex. As we proceed toward Logic, we will
come to recognize almost unbounded complexity.
13
http://www.answers.com/topic/lemma#ixzz1E9lGwhxN
levels for data quality. Let’s assume though that we do wish our system to be associated only
with quality data.
Motivation 2: Innovation
We’ve got this great idea for a software “solution” that will take the marketplace by storm.
Everything that we said about Motivation 1 applies to this one as well. Now, though, Marketing
gets impatient and, in order to be first to market, begins to sell something that really only exists
as an idea—or maybe a prototype. This has the effect of punishing timelines and induced
shortcuts.
We’ll consider the human side of this in more detail later, but for now it’s enough to understand
that all of that validation that we discussed above and much of the modeling will start to look
like dispensable fluff. We don’t need it if we use text data types and control the data used in our
demonstrations. It’s only important to our aim of being first to market with our solution that we
demonstrate a product that appears to be doing something that our customers need. Need we
mention that our customers have never told us that the quality of their data is important to them?
In fact, in many cases (only fear of being labeled a cynic prevents me from saying most) the
customer has never thought about the quality of their data and how their data might provide a
more realistic test. They look at the sample data and see that it looks just like the ideal data that
they produce. It’s an easy sell to get them to believe, “This is the result you will see from this
product.”
Please don’t misunderstand. We aren’t blaming the seller of the solution. If anyone is to blame
in this it’s the buyer who doesn’t understand his own situation. Let the buyer beware is a sound
warning that has been ignored for millennia. Remember the Bernie Madoff story in the chapter
on abstraction?
14
As an aside, this seems to cast a different light on pragmatism. Is a prototype pragmatic? How pragmatic is it?
Humans come and go
Humans are not interchangeable
Each is unique
Each has a unique way of perceiving anything
Maturity level (and therefore, perspective) changes over time
Skill level changes over time
Knowledge level changes over time
Human means changing. Humans must change and adapt just to live with one another
successfully. For a large portion of humanity, though, change equates to uncertainty and
uncertainty evokes fear. For those who advocate change then, it is critical to do so in a way that
reduces uncertainty and therefore fear.
Fear can be overcome or avoided by some tried-and-true methods, among them:
Education
Training
Practice
Motivation
Leadership
Fear is one of the forces that act to inhibit change and although it is a powerful force, it is still
but one. Another that is even more difficult to address is inertia. If you were a physics student,
you may be saying, “Inertia isn’t a force.” In physics terms it is not a force. Inertia is the
tendency of a mass at rest to remain at rest or for a mass in motion to stay in motion. Inertia
must be overcome by some force. In human terms, inertia is often known by other names such
as lazy, unmotivated, satisfied, content, comfortable, inattentive, ignorant, powerless, and on and
on or, conversely, driven, committed, goal-oriented, focused, etc.
Whenever a person or an organization is described with one of these adjectives, we can be sure
that inertia is in control.
There is another aspect of inertia that we should be aware of because, while some of the
adjectives imply lack of motion, others allow for undeviating motion. Let’s remember that the
other side of inertia is that a body in motion tends to remain in motion (along the same
undeviating path).
In any case, it takes a force to overcome or change (re-direct) inertia.
We have a tendency to organize things, to classify, label, count and file. Moreover, we like to
specialize and be the best. Specialization and being the best go hand-in-hand. Think of the
Guinness Book of World Records. If I want to be in the book I look at all of the existing records
and then pick one that I already like and am proficient at. Then I specialize just a bit. I juggle 14
cleavers but I do it while standing on one foot. In the same way, if I am marketing myself I can
claim to be the best at Agile (Alliance, 2013) development in a manufacturing market (or a
healthcare market…)
The first few programmers when computing was in its infancy could call themselves the best and
who could argue? As the potential of the computing machine came to be understood, more and
more people who were intelligent and talented began to call themselves programmers and they
were in demand. Gradually, though, competition for the available work grew and people began
to look for ways to claim that they were the best.
At the same time, computer processing (as it was then called) was maturing as a discipline, if not
a profession. The methods for producing good software were being researched, analyzed and
improved. New process steps were created which created the need for new experts.
Marketing was also maturing as a function and this had a direct effect on specialization. As new
products were introduced it was necessary to differentiate them from already-existing products.
What better way to do that than by claiming that the new product is “the best.” Frequently a new
capability or functionality had to be created in order to claim “new and improved.” The new
functionalities provided an avenue for a technician or developer to claim it as his own and once
again be the best or the only.
Decades of this has led us to where we find ourselves today—in an impenetrable morass of
largely meaningless acronyms and technical jargon. Now, a programmer can claim expertise
merely by listing the hottest new technologies on his resume or by liberally seeding his
conversation with acronyms and jargon. Of course the impression of expertise may not be
sustainable but that’s not the point.
The flip side is that the consumers of information technology are easily cowed by references to
things that aren’t understandable to them—even to the point of claiming, “I’m non-technical!”
and making the claim a badge of honor. The ugly result of unnecessary specialization is that a
huge gulf has been created between the technology people and everyone else. In a conversation
with someone from the “business side” a mention of I.T. will very often elicit some eye-rolling
and statements like,
“I can’t even talk to them about something like this.”
“I.T. wants to do it their own way.”
“What comes out of I.T. won’t be what I asked for or what I need.”
“I hate dealing with I.T.”
The fundamental problem, and the reason why [data] quality is intransigent as an issue today, is
precisely the gulf that was exposed in the previous section. A little story might help to illustrate.
A certain church had “adopted” (in 2008) a tribe of hunter-gatherers in east Africa. The hunter-
gatherer culture is nomadic in order to follow the food supply. Nomadic means that the tribe has
no permanent structures and no need for any such.
Your inclination may be to dismiss this tribe as an anachronism with a comment like, “They
should wake up and get into the 21st Century.” A longer term viewpoint might be that since these
are some of the last nomadic people on the planet, any problem will take care of itself if we
simply allow time to pass.
A church (as this particular church) wants to demonstrate to these people that they are linked to
other people whom they don’t know and that these other people care about them. In an effort to
do this, the church created a “mission trip” and offered its members the opportunity to be
emissaries (and a chance to go to east Africa).
Meetings were organized to discuss the trip and what would happen. In these meetings, a
recurring topic dealt with what the emissaries would “do for” the people of the tribe. Some
wanted to build (a church or school or clinic) and proceeded rapidly to discussing materials and
fund-raising until they were reminded that this tribe had no need or desire for such structures and
wouldn’t know what to do with them if they suddenly appeared.
The tribe live within an inertial bubble of tens of thousands of years of nomadic culture in which
the only useful knowledge pertained to finding sufficient food and water to stay alive. Medical
and other services were delivered within that bubble by going to where the people were and
providing what they needed.
The Amish and other groups here in the U.S. are similar to this tribe in terms of the very visible
gulf between them and the technology–dominated world in which most of us live.
What we are saying is that this same kind of gulf exists between the typical “end-user” and those
who are comfortable with the manipulation of technology. Now, obviously we are over-
simplifying and there is a spectrum that runs from technophile to technophobe. The
programmers are at one extreme and chances are that your V.P. is at the other. Even
“manipulation of technology” involves a spectrum. There is manipulation as in productive use
and manipulation as in creating from new ideas.
All of this is simply background for the observation that it’s no wonder that data quality or
technology quality in general doesn’t seem to matter to most people.
We, to whom it does matter, seem unable to grasp the existence of a culture that doesn’t see
things in the same way we do. Even though we can’t see the quality of the observation skills of a
hunter-gather or the quality of a particular draft horse or a quilt or the use of light and the brush
strokes of a piece of art, we don’t feel at all inadequate. Now we want to infer that others are
somehow less because they can’t see the quality, not just in a piece of data, but in a data
architecture, a process step or a data profile.
In order for that kind of quality to matter, a person must see the whole. A holistic orientation is
an absolute requirement in order for quality (of any variety) to matter. How much it matters
depends on the scope of the portion of the whole that my holistic perspective is able to
accommodate.
How does one become able to expand his holistic view? Experience enhanced by learning is the
only way to bring more of everything into your personal universe. Another secret: once you
think you have nothing left to learn—you won’t. Beware, though, life has a way of moving you
off of that pedestal and it is frequently unpleasant. It’s only when that happens and you accept
that everyone you meet has some new knowledge or new perspective that will expand your
universe—only then will your understanding of holistic begin to change and expand. Holistic
must be defined in terms of all the stakeholders collectively. To restrict it to the perspective of
only one—even if that one is you—is to guarantee a sub-optimal result.
We have all seen and lived these sub-optimal results. Whenever we trade some aspect of data
quality for hitting a time-box or a budget or for making someone’s life just a little easier, we
have suboptimized. We do this again and again because we focus on the local benefit and get
tunnel vision.
Why do we insist on getting tripped up by this? We have only to look at the benefits to
understand the why. What is the benefit of making someone’s life a little bit easier? If that
someone is you, then the benefit is in the sigh of relief or the “one less thing to worry about” or
the few extra minutes each day.
If it’s someone else who benefits, we get the credit, the accolades, the commendations,
recognition, enhanced reputation, credibility, or simply the chip that we collect and hold to
exchange for some future favor. And what did we have to trade for this? Only something that
no one may ever notice and if they do, it will be well into the future. In fact, we hope to have
moved on to greener pastures by then. It’s an easy trade—a no-brainer.
It takes immense self-discipline not to suboptimize. The choice for suboptimization often isn’t
even apparent and is buried deep within the candy and baked goods and covered in chocolate.
We have seen that suboptimization is the natural outcome of human nature. How, then, can
suboptimization be driven out in the face of the natural, unthinking and self-validating processes
and tendencies that produce it in the first place?
There can be but one answer but that answer, discipline, is itself burdened with connotations of
effort and suffering that seem to make it a non-starter. We all know that we need discipline and
especially self-discipline. If it were easy though, we wouldn’t have an epidemic of tobacco use
or obesity on our hands; the US Army would not have had to water down the physical
conditioning in its basic training programs; and addiction would be a much smaller problem.
We have very few models of discipline that we can hold up as standards or goals. The
organization itself will be accused of hypocrisy since it is clear to all observers that the discipline
it attempts to force on itself through policies and standard practices is only intended to apply to
those who must be controlled. Those policies and procedures are most often used as a club to
force submission under threat of termination of employment as specified in the employee
handbook.
Budgeting processes provide one example. Managers devote considerable time to assembling
and justifying a budget request each year. Even before they get a chance to present their request
they are notified that all requests must be cut x%. Frequently these demands to “trim the fat” are
repeated multiple times. A manager must find ways to get through several rounds of trimming
and still arrive at the endpoint of this process with sufficient funding to deliver on
accountabilities or else risk being seen as ineffective or redundant.
All of this comes in the guise of bottom up budgeting. Let’s see what we need to make the
business successful—that is the motivating tease. You, the manager are important and it’s
important for us to understand what is needed to make your organization productive.
Of course, the CFO has long since created a forecast and all of the repetitive fat trimming of the
“budget process” has been designed to get to the forecast number. In this scenario,
suboptimization is guaranteed. Even if the various department managers had been working
toward some common goal, they will have forsaken those expenditures as extraneous “fat” long
before the end of the process.
Attentive observers will also have noticed that the biggest expenditures tend to bypass the
budgeting process completely.
The corporate world is chock full of examples such as this. Individuals are consistently pitted
against one another—not by design—because of inattention and lack of discipline. We insist on
allowing ourselves to be victimized by arbitrary time lines and budgets.
The most effective defense against the tendency to suboptimize is unrestricted information flow.
When people lack information or believe that they do, they assume a defensive posture. In
defense, they begin to hold back information themselves and then to adopt self-centered
approaches to guarantee their own survival. “How do I know you’re not playing me?” leads to
“Your problems are your problems. I have to make sure I have what I need.”
It begins to look like we will always need to work in a suboptimized environment. This has been
a known problem for decades. The Rand Corporation published a study of suboptimization in
1952 (Rand Corporation, 2012). Even with wide-spread recognition of the problem it continues
to flourish. Given its hardiness, it would be wise for us to learn how to co-exist, minimize
damages, and succeed anyway.
The central, though unspoken, theme of suboptimization is self-interest. Self-interest is the
primordial goo we often hear about in discussion of the origins of life (on Earth). In fact, self-
interest is the basis of evolution and suboptimization is the mechanism.
We shouldn’t wonder then when we encounter it in discussion of data quality. Because
suboptimization will always be with us, our best chance of coping is simply to identify it and
plan around it. Enlightened self-interest is still self-interest but it means softer boundaries and
expanded opportunities.
So much of our lives is expended (wastefully) by attempting to stand in the path of self-interest.
The lesson to be learned is that self-interest is a train and that our only hope is to get out ahead of
it and lay some new track with a switch so that we can divert the train.
A train without track is a train wreck. This is what you will have if you attempt a course change
without laying the groundwork.
When we stand in front of a train believing that we can (and should) divert it from its path the
potential outcomes are limited in number. The inertia of the train is such that re-directing it
while keeping it intact requires a different approach to the application of force. Unless we have
built new track, there are only two outcomes and both are unpleasant.
The most likely outcome is that we become history. Of course it won’t be the history that we
envisioned. It will be a kind of history that will produce the exact opposite of the outcome we
really wanted. For years afterward we will be that someone who tried it and disappeared. We
will have simply become an impediment to change, guaranteeing that the new day we were
seeking is pushed even farther into the future.
The other outcome is a train wreck. We’ve all seen train wrecks in the movies. It’s fascinating
to watch and we marvel that the locomotive is in the river while those in the last car continue to
dine, or converse or nap, oblivious to what is about to happen to them.
A [good] manager never forgets those in the last car even while applying power and then brakes
at the right moments to make optimal progress without going off the track. Those in the last car
hurtle along with no sensation of hurtling. In order to maintain that illusion, the manager must
avoid sudden changes in direction, sudden braking, sudden acceleration—sudden anything.
How do you do that when your most reliable tool is the org chart? If all of the track, both old
and new, is in your box on the org chart you can proceed with some hope of success. But is your
box composed of boxes? Are other managers, other groups, other communities, other interests
involved? If so, the job of surveying and track-laying just got much more complicated.
We begin to see that the organization may not be up to the task. Organization simply lacks any
insight into interests and motives. Those who draw borders on maps often look for guidance
from terrain features such as mountains and rivers, assuming that these natural boundaries will
have divided interests as well. How surprising then that drawing artificial boundaries, even
when they follow natural ones, become motives for strife when cultures/communities with
disparate attitudes and customs are divided and thrown together in new mixes that “aren’t
natural.”
Those divergent cultures may be made to cooperate in order to get the train on the new path but
they will always be resentful of one another, will follow different leaders and will seek ways to
make themselves prosperous even (and maybe especially) if it means taking something from
their neighbors.
An additional opportunity for problems comes from the fact that in the information age, the train
and its tracks may be entirely virtual. “Neighbors” may not have physical proximity.
Communities/Cultures along the virtual track may differ not only in attitudes but also in
language. To a certain extent we acknowledge and accept the inevitable suboptimization as
inevitable. Then we fall victim to dissociative identity disorder, passing from a pragmatic
acceptance of and appreciation for the difficulties to a self-centered mode in which we fume at
our inability to get what we need.
Later we’ll examine the concept of “what I need” in some detail but for now we leave the subject
of [human] people to resume our exploration of quality itself.
15
The 1980s and 1990s were the golden age of software development methodologies. We had learned a lot about how to create high quality, low
maintenance systems and there were many CASE (computer assisted software engineering) products available that could be configured to enforce
your methodology of choice. There was a problem though—putting all that quality in was slow. The process was so slow that sometimes the
need for the system had evaporated before anything was ever delivered. At this point a group of well-known methodologists got together and
created the Agile Alliance. The Agile Alliance distilled what were considered to be the most important aspects of all of the methodologies in use
and repackaged them in an iterative (as opposed to a “waterfall”) model that was designed to deliver results quickly and use the end
user/customer for quality control prior to launching the next iteration. The result of this is that a new kind of “quality” was born. Unfortunately
Agile didn’t resolve any of the issues we are dealing with here.
item. In fact, in those days programmers actually recognized that they were tagging or labeling a
location in memory. The entire concept of a data item was an abstraction still hidden in the mists
of the future.
What was found at the address represented by the tag? A binary value (a sequence of zeros and
ones representing a number in base-2) could represent anything. In the beginning it was numbers
only. IBM invented BCD (Binary Coded Decimal) using six bits rather than four and by doing
so were able to encode uppercase alpha (for alphabetic) characters as well. By expanding to
eight bits (Expanded BCD Interchange Code or EBCDIC) lower case alpha characters as well as
some special characters could be encoded. The Univac computers from Sperry Corporation used
a 9-bit byte in order to make use of an expanded instruction set.
ASCII (American Standard Code for Information Interchange) uses an 8-bit byte to represent
256 different characters. This code is used worldwide today. All of this is interesting but not
critical knowledge. The important concept here is that the Latin letters and the Arabic numerals
that we think of as data are recorded in electronic memory, transported as electronic impulses
and decoded by programs (Logical constructs). These are the grains of sand on which our
quality foundation must be built.
At this point we had the ability to store virtually any kind of data in computer memory. This
wealth of capability naturally came with a price. How could a program differentiate between
different kinds of tagged memory? What if one program stored integer numbers in memory and
then another program retrieved that memory and tried to interpret it as alpha[numeric] values?
The need for data typing was born.
Because memory was severely rationed (due to its cost), programmers developed many tricks to
overload a memory address in such a way that one program could see a computational value
while another could see a character or non-computational value. Some of these tricks are still in
use today although the difference in interpretation and use may be much more subtle.
A programmer may define a data item, usually called a variable, which really represents a tag for
an area of memory. They may insert a comment that this variable will be used to hold values
representing, for example, a person’s body temperature. The assumption is that this program
will interpret the sequence of bits assigned to this tag as a real or floating-point number which is
simply a number with a decimal point in it. The programmer sees no need to go into greater
detail in the definition because all who use this system know that a body temperature will have
one digit to the right of the decimal and that valid body temperatures are those read from a fever
thermometer and for a living human will range between about 75F (23.9C) and 112F (44.4C).
Maybe you can see a problem looming already. How do we know whether the value we find
was expressed in Celsius or Fahrenheit? A human looking at the value would assume based on
known valid ranges. A computer program can’t assume and would need to apply rules present in
the code. What will the basis for the rules? The definition would be the place to begin the
search.
This problem is manageable, though applying rules would certainly increase the number of
processor cycles needed and therefore the response time of the program. To complicate things a
bit more, we may find values like 60 or 120. These are not believable as body temperatures.
Could our rules differentiate? What would our program do with such values? How would they
be interpreted? Over time an examination of the data might suggest that the input program was
allowing intended respiration values to be stored in our body temp variable. Maybe we would
even have enough associated information to track these entries to a specific nurse who always
took a patient’s vitals in a certain order and always recorded them in the same order.
What is our defense? Think of the definition as a package, like an egg carton, that helps the
customer to decide what should go into it and also to help in deciding whether what we find in
the container actually belongs there.
The best possible defense is to include in the definition enough specifications to determine
without ambiguity whether a given value should be stored in this variable and, with the same
lack of ambiguity, determine whether a value returned in this variable is valid. But wait, it isn’t
enough to create this specification list, we also have to make it available to future programmers
who might want to modify either the input or the output logic. This is all possible within the
schema definition for a table in any relational database management system. We can specify that
a value is required, that it be a real number, that it be between 75.0 and 112.0 (if Fahrenheit is the
standard) or between 23.9 and 44.4 (if Centigrade). To simplify we can store values in
Fahrenheit units and allow the program to convert to Centigrade at the user’s option.
Oh, yes, and one more thing—we have to build such definition/specification into our quality
control procedures so that we make sure they are applied wherever this data item is used.
When an inadequate definition is associated with a data item, the next person who tries to use it
will have questions. It must be assumed that the person who created the initial definition is
unknown or, if known, no longer available. The questions won’t be answered. The best case
scenario has you tracking down the person responsible for the definition only to find out that you
have already seen everything they accumulated.
The diligent person will analyze the data values in context to deduce or simply infer the answers
needed to fill in the blanks. In many cases the person, believing that they possess sufficient
knowledge and experience and in the interests of convenience or time-savings, will assume the
answers, claiming insight that will later turn out to be unreliable. When we discover that the
values currently in our database are not consistent with the definition published for them, what
more can be said? What do we really know now about the ones that appear to be consistent with
the definition?
The original definition—if it exists at all—will be expanded to cover the new situation. Business
will proceed as usual until someone tasked with creating a report or someone searching for
actionable information in a report suddenly notices that subtotals are incorrect. Minimal
investigation reveals that the headings for subtotals include several different kinds of things, not
all of which are expected. We have uncovered a DQ issue.
We can recognize that the definition we have is inadequate in some way, but what definition was
being used when the unexpected values were enshrined? We have at least two choices in
addition to pretending we never noticed (which is a far more common choice than we might
wish).
We can root out the offending values (this is known as data cleansing), but consequences come
immediately to mind. What do we do with them? Can they simply be deleted? What about
related information which, by the way, also does not fit the original definition as we have
understood it.
The second choice, segregating the offending data values and their associated data to a new data
item, also has consequences. Will they need to be associated with the “real” data? Can we
extend the definition to cover the unexpected values? What effect will that have on existing
reports, forms, etc.?
This kind of problem is routine. Nothing remains the same for long and particularly in the
domain of business. Mergers and acquisitions are a never ending source of DQ issues.
Entrepreneurial initiatives within a company comprise another fertile garden.
It will not be productive to try to fix blame or even to do a root-cause analysis. The cause is
change, whether the change be due to growth (planned or unplanned), market pressures,
technology, or simply opportunity. The real questions require answers that will not come from
I.T. Unfortunately, it is going to be very difficult to get anyone to pay attention long enough to
understand the questions, let alone develop answers that will demand analysis, planning and
decision-making. People just have a blind spot where data and specifically DQ issues are
concerned.
Recasting this as a process issue will result in a much increased level of comfort. People very
naturally understand that there may be a better way to do something. If we can show them what
a quality result looks like and then show them a process by which they can produce such a
quality result, the enemy is on the run. Turning the issue into something tangible will create
longer attention spans and increased tolerance for engagement. Everyone grasps the need to re-
examine processes when a raw material or component is changed. The need for new processes
for new products is apparent. The problem, then, is to make a DQ issue look like a process
issue.
This is the point at which current data quality efforts fail. We have shown ourselves unable to
view quality problems through a process filter. Instead, we bravely step forward and volunteer
to fix the bad data. It will take a strong leader to resist this tendency and get the herd of cats
back into the cage.
When we begin to view data as a raw material, component or product, we will immediately begin
to see opportunities for the quality control (QC) of data. Like any other QC effort, though, we
will find that we can’t anticipate everything. We will be able to provide guidelines that will help
us sort out the unanticipated situation and get us back on track.
The original definition of a datum, set of data, or information object in an organizations is
absolutely critical in laying a foundation for quality. A good foundation is not a guarantee of
successful data management but a poor or non-existent foundation is a predictor of problems.
The problems may range from occasional unreliability to macro level failure affecting an entire
organization, market segment, or even a nation.
It may and probably will be impossible to put the genie back in the bottle. Repairing “bad” data,
while it may get us over one more hurdle, is fraught with danger at best and at worst, the process
of repair erases clues about the cause of the problem. Too many times, the repair actually
succeeds in creating even bigger problems that may not be noticed until long after it is too late to
undo the operation.
Knowing that it will be costly and very likely impossible to walk back a problem to its source
when the source has never been documented, we should be highly motivated to spend whatever
time is needed to ensure that we have documented the source of the data and the uses and
relationships in which it is involved.
Why don’t we do this then?
16
They had to walk a fine line because they could always be criticized for wasting memory which was quite expensive in those days—but that’s
another story. This is also the source of the infamous Year 2000 or Y2K problem. Programmers allocated two characters for storing the year
portion of a date. This was entirely reasonable in 1959 since who could have guessed that this code would still be in use 41 years hence.
returned n characters where n is the number of bytes reserved. When the forces of change made
even their most extravagant estimates inadequate, tool developers looked for another and better
way.
Thus was born the VARCHAR or variable length character string. This data type solved a host of
problems in that a programmer could name a maximum length for the data item and leave it up to
the compiler to ensure that memory would not be wasted. The computer only used memory
enough to contain the contents of the string and kept track of how long it was. It retrieved only
the actual string value and not all of the memory reserved for it. Those who weren’t there cannot
possibly imagine the sigh of relief that ensued.
Almost immediately VARCHAR became the type of choice for a default setting. We had the best
of all possible worlds (or so it was thought in the glow of newfound freedom). It actually took a
bit of time to realize that after-glow and myopic fog have much in common. In truth, were we to
examine the default settings used by our modelers and programmers right this minute, we would
find that upwards of 99.99% are using VARCHAR (n), where n is often 255, as their default. You
may actually want to make a note of any that you find who do not use default settings because
they may be able to explain to you some of the problems you are experiencing (or they haven’t
discovered yet that defaults are a feature).
Recall our discussion of body temperature and its definition. The use of a VARCHAR data type to
store what is actually a real (floating point) number is the root of that problem. If a data item, for
example BODY_TEMP, were declared as REAL (or FLOAT or SINGLE or DOUBLE…), with
appropriate range values established, it would be impossible to store values of “NR” (not
recorded) or 120 or 50. If we made it mandatory as well, we could insure that our nurse or
clinicians always provided a BODY_TEMP value.
Why don’t we do these things? The only reason is suboptimization. We shift the cost of error
detection and handling to someone else far removed. We keep the cost of development down
and create a much larger cost that includes distrust, lack of credibility, inconsistency and real
money for some faceless person in some other part of the company.
The fix is to adopt quality assurance processes that assure that the data product won’t just be
thrown into any available container but will be placed into a container designed specifically to
protect it. Be assured that attempts to do this will get a lot of pushback accompanied by wailing
and gnashing of teeth. This is the standard response to any attempt to eliminate suboptimization
and the best way to deal with it is constancy of purpose together with trust. “You’re not going to
be held to old metrics. We’re going to monitor this for a while before we establish new metrics.”
There is an economic or financial tradeoff to be discussed. If we know that we must pay now or
pay later, we very often choose the later option simply because we hope that we’ll be better able
to afford it later or that we’ll be the lucky ones who never experience the breakdown.
The default use of VARCHAR(n) is an excellent illustration of this tradeoff. We trade relatively
instant gratification in the form of a usable system for the certainty of high maintenance costs
and DQ issues later. The tradeoff is especially easy because no one talks about future costs.
Everyone congratulates himself for being part of a successful project, everyone updates their
resume, and a few are promoted. When the costs hit, there is an entirely new cast of characters.
The new cast goes to work without adequate specifications or definitions, making “educated”
guesses about the intention of the original project team. They often change the meaning of
things just enough to make the next maintenance effort even more costly. The result of this
iterative model is an entirely new system after 1-2 maintenance iterations. A side effect is a
substantial amount of data that is useless or nearly so. How do we even begin to calculate the
cost of this?
17
A readable description of the normal forms can be found on Wikipedia (www.wikipedia.com) with a search for “database normalization.”
sense we need. When relational data management emerged it was thrust into an already well-
developed and mature data management environment. Many compromises were then introduced
in order to create a better fit into that environment. The more we understand of the principles
that led to relational data management, the better equipped we are to recognize the compromises
and decide when to use them and when to ignore them or even to exert the effort needed to
overcome them.
As data processing advanced from its infancy, the need to store and access ever increasing
volumes of data virtually demanded an increasing reliance on technology. Remember,
technology is about speed. We were doing everything that we now do before the computer was
invented but it took longer and used more people.
The history of programming is dominated by the idea of processing. In fact, data processing was
an early name for what is now known as IT (information technology) or IS (information
systems). Programmers carried the burden of technology application. They eventually became
known by new names like developers, software developers or software engineers. They were
differentiated as systems vs. application programmers. Sometimes they were assembly language
programmers or firmware programmers or PLC (programmable logic controller) programmers.
Programmers by whatever name continue to dominate the technology world though they may
now do it from Mumbai or Singapore.
Record vs Set
In the dark ages of system development the system accepted data records on paper cards (or
tape). The data could contain all the information that could be recorded on an 80-column card
carrying one character per column. The invention of magnetic tape input allowed the records to
be arbitrarily long as long as they were separated by a special character. The last record was also
followed by another special character, the EOF (end of file).
The next development was the disk drive which allowed for reading what appeared to be random
records (read record number 11223344 for example). Further advances included new structures
for files so that the records could be stored in smaller segments scattered about the disk (or even
several disks) and still be accessed sequentially or randomly as needed.
Developer/Programmers, creative as they are, found new algorithms to better manipulate the data
in all of these varied formats (because, due to the frugal nature of business folks, they were all
kept around until a “business case” could be made for replacing each one). These algorithms
soon became part of the languages used to write the programs. The languages, therefore, were
almost universally oriented to record-at-time processing.
Now comes a new development that trumped all the media developments (cards, magnetic tape,
disks), structure developments (sequential, ISAM, linked list, hashed…), the algorithm/language
developments (assembler, Fortran, COBOL, Modula, Ada, C (and all its variants), SmallTalk,
Java…). We no longer needed to process one record at a time. With the advent of the relational
dbms (database management system), not only could we leave the how of accessing data records
to the dbms, we could actually specify via a relational query language (the most widely known of
which is Structured Query Language or SQL) what kind of records we wanted and even which
part of the record.
We could ask for
The set of order records created since 5:00 PM on July 15 of this year.
The set of customers for order records created since 5:00 PM on July 15 of this year.
The set of products for order records created since 5:00 PM on July 15 of this year.
And the really beautiful part—we could manipulate the resulting set in the same way.
We could ask for
The (zip codes from the Customer_Address table) where customer ID is the same as
customer ID of the (Orders table where Order_Date is greater than 5:00 PM on July 15 of
this year).
Further, we could change the entire set with a single query! No longer were we forced to plod
through each record of the Orders file checking whether the creation date met our specification
and, if so, writing the record to a new temporary file. Then plod through each record of the
Customer file, comparing the customer ID to the customer IDs in our temporary file and writing
the Zip Code of the matching records to yet another temporary file for eventual reporting. Now
we could tag each of those customers with some marketing indicator (such as “Current” or
“Recent”, etc.) with a single command and let the dbms decide how best to accomplish it. And,
by the way, the dbms has optimization algorithms built into it that far exceed the abilities of the
average programmer.
This is a good time to give database administrators a plug. The DBA comes with an
understanding of the internal workings of the dbms and can guide both programmers and
architects to deliver the quickest possible response times. If you are dependent on relational
databases, then you need a DBA. This role will pay for itself very quickly. Remember this, too,
the DBA’s raw material consists of memory and processor cycles. With sufficient raw materials
a good DBA can deliver unimagined consistency.
The bottom line is that now we only have to define the set of data we want to work with and it
will be delivered to us.
18
In fact, SQL emerged as the “winner” not least because it was one of the first languages to incorporate the cursor. It was also receiving heavy
backing from IBM at the same time.
programmers and their use meant that a query expressed in the given language no longer
produced a predictable (that is consistent) result. In addition, the entire concept of a relational
dbms was predicated on the assumption that the data was to organized following relational
principles. These principles became known as normalization rules. (Database Normalization,
2013)
If the tables of a database do not conform at least to 1st normal form (1NF), it may not even be
possible to specify a query that will produce the set we want. The first rule of relational design is
that a relation (or table) may only contain information about one kind of thing. Often a file in
pre-relational days might consist of records that represented a history of some activity. This
allowed the programmer (or the file clerk) to pull a single record (think file folder) about a client
(for example) and thereby have all information available about that client. Medical records were
(and are) a good example of this.
If files like this are loaded into relations (tables) most queries will require some programming to
get the answer we want. Obviously the “right” thing to do would have been to reorganize the
data, loading it into multiple, related tables with each containing one kind of thing. If we did the
obvious thing, we could enjoy all the benefits of relational database management BUT if I’m a
programmer I’m not getting paid to punt problems off to someone else. In fact, in any given sub-
optimized organization we will find it an extremely rare occurrence. My schedule is practically a
sure bet to finish first in any competition.
If I’m a programmer, I may not know enough about normalized relational design to actually be
able to improve things. In many areas of technology a poor effort can be worse than no effort. I
do, however, have at my disposal cursors and stored procedures and triggers that, together, will
let me overcome the data design deficiencies.
It can’t possibly be emphasized enough that
One of the most important motivations for the creation of the relational model for data
manipulation is to avoid the need to call on a programmer every time you want some
data.
Use of cursors, stored procedures and triggers guarantees job security for the
programmer.
It saves no time at all to perform a single operation repetitively when it can be done once on an
entire set. For example, in a crowd of people we need to identify those with a valid ticket to
move them to another venue. We could poll each member of the crowd, ask for their ticket,
examine the ticket for validity and then move them, one by one, to the alternate venue.
Or we could ask the crowd to move to the alternate venue if they are holding a ticket that meets a
set of specifications. Which would be your choice? A programmer will see the value of set-at-a-
time in the “real world” but will often fall back on what he knows best in the software
development world.
This is creating problems for you. The problems include
1. Unnecessary complexity which resolves to cost
a. In development
b. In maintenance
c. In communication
2. Inconsistency in the database (we know this as poor data quality)
3. Longer response times (frequently MUCH longer)
One last time now, fairness dictates a reminder at this point. We have done much to encourage
this behavior on the part of our developers. We could have made the effort to reorganize our
data resource (normalization) to deliver set results in all cases but, because of ignorance and
perceived cost, we didn’t (and don’t). We could become a little more knowledgeable about the
entire subject of data management (as you are doing by reading this book) so that we are better
equipped to ask relevant questions and judge whether the answer is smoke or substance. We can
at least recognize that the problems do not have technology causes so that we can insist on
answers that are not couched in techno-speak.
Apply update x to all <your state here> customers
not
If this is a <your state here> customer apply update x
19
Note that this term is used generically to indicate a high-functioning, quality organization and not in reference to the
trademarked training and certification programs. Six standard deviations, represented in statistics as ᵟ (sigma), from the mean of
a process whose performance can be represented as a bell-shaped curve will include virtually 100% of results. This process is the
ideal of consistency.
Current and On-going Status of Data Management
When you become lost or disoriented, the best first step is to sit down where you are and take an
inventory. This is a MacGyver approach. MacGyver was a TV phenomenon in the late 1980’s.
He was known for using materials at hand to escape from seemingly inescapable scenarios. The
scenarios often involved bombs with timers counting down the minutes and seconds to doom.
Our situation may not be as deadly but the simple methodology of understanding what you have
and how you can use those things to construct what you don’t have is a useful one in real life.
Survival experts advise
Don’t panic
Stay where you are unless you are sure you can make your situation safer
Use what you have at hand to improve your situation
Prepare for rescue
Prepare to stay alive until help arrives
In our history, we learned to recognize some of the perils that have forced us into the current
situation. We reviewed some approaches to staying alive by harvesting low-hanging fruit if
necessary and planting crops if possible. We learned the names of some of the “beasts” that we
have to look out for—inconsistency and suboptimization.
Now it’s time to learn some techniques, not only for staying alive but for finding our way back to
civilization. Recall the example of the orchestra, where rehearsal is the key to beautiful
alignment and individual mastery is the ticket for admission. This book does not represent
individual mastery in any area. It is full of signposts and trail markers pointing the way to the
areas that must be mastered. It’s too much to hope that you will find or become the one person
who has mastered all of the instruments. You will probably need several different virtuosos but
you must remember that rehearsal together is essential. Such rehearsal must be part of the job
description for each.
Setting Course
Here are some roadblocks that have already been encountered by others in your position. The
more we know about roadblocks, the better we can overcome them.
1. Corporations are spending a rather large portion of their budget on data quality
mitigation—it just isn't recognized as such.
2. A substantial portion of the FTE's within your company (which is just like any other in
this respect) are owing their job to the need for quality information—once they realize
that you mean to move them to the excess pool, they may not be supportive of your
efforts.
3. We fail to take action that will reduce "problem" quality because it costs too much even
though the actions are well understood and within the grasp of even junior employees.
4. We can't present a value statement that can be implemented unless it relates to a very
specific subset of the data resource.
5. We create new data quality issues every day by not doing what we are capable ourselves
of doing and then attempting to make others accountable.
When your vision for your organization explicitly includes quality data, you already have 80% of
what you need. W. Edwards Deming is famous for many reasons but one of his contributions is
useful no matter you stand on Statistical Process Control. He gave us the famous 14 Points
which, like Covey’s 7 Habits, are a worthwhile addition to any office wall.
1. Constancy of Purpose
A read of Total Quality Control for Management, (Nemoto, 1987), shows very clearly the result
of adherence to these principles. The question is whether you, within the business culture of the
United States (or Great Britain, Germany, France, China…) in 201x, can generate the same
constancy of purpose.
Let’s be clear that we are discussing a transformation here. Nothing less will get the results we
require. It takes 21-28 days to form a new habit according to PsyBlog (How Long To Form A
Habit?, 2009) and the attention span of our culture seems to be growing shorter rather than
longer, so this is no small thing. Constancy of Purpose is what drives all of the other principles.
Constancy of Purpose isn’t a completely unfamiliar concept. It is very similar to Vision and
Mission. A “man on a mission” has constancy of purpose that lasts until the mission is
accomplished. A person with a vision may sustain constancy of purpose driven by the vision for
a lifetime. Constancy of purpose isn’t the same as single-mindedness, which won’t allow us to
address anything else until that focus has been resolved. Constancy of Purpose can be goal-
driven or it can be method-/process-driven. It’s best to allow the method to emerge but
sometimes we know in advance the method that must be used.
If our purpose is to stand at the summit of Everest, many intermediate goals suggest themselves
and we will have a choice of methods up to a certain point. From that point onward, however,
we know that putting one foot in front of the other over and over again is the only method.
If our purpose is to walk across the Golden Gate, our methods will depend on
Where we are now
Whether there is a target date
Financial resources available
Physical resources available
Our purpose is to have demonstrably credible, reliable, timely information where it is needed
within our organization. Again, we must be clear that this isn’t simply a nice thing that will give
us an advantage over certain of our competitors. This is a matter of survival—or at least
profitability which is closely related to survival.
14. Create a structure in top management that will push every day on these 13 points.
When all of management has bought into the new culture (including these Points), the new
culture has been born. A new language will emerge in the boardroom, in internal communi-
cations, in lunch table conversation…
Managers will notice that it is easier to communicate ideas with subordinates. People will begin
to exchange data on purpose and they will speak in terms of measurements. New ideas will be
tested by measurement and only the best will survive, regardless of their source. New things will
assume critical importance while the “facts” that we formerly considered important will vanish
altogether.
Please recognize that it is not possible to bring into control processes that are not defined. Most
managers and business owners would stand before a group of peers and maintain that they know
what the processes are for which they are accountable. They would almost certainly be wrong.
Unless you have instituted audit procedures that consistently produce process metrics that
accurately reflect production experience, you simply cannot say that you know what your
processes are.
Abstraction
The reason things get difficult for many people is because of something that should make things
easier. If you have studied philosophy at all you will have heard of Idealism. In layman’s terms,
our world—at least the part we have built—is constructed of ideas. The natural world may be
less malleable. Even Kant would have run from a hungry tiger.
The notion of abstraction is in such widespread use that most will not even be aware of it. Part
of being human is the ability to use abstraction to reduce complexity.
We all should become a little more comfortable with various kinds of abstraction. The list below
names some common abstractions. For each of them we could name (or picture) several
instances of the abstraction.
Abstraction is both an idea and a tool. When we ask what things have in common we are
creating a collection of things. The commonality is the unifying idea or theme. A collection is an
abstraction. When we actually build the collection each item that we add is an instance of the
unifying idea for the collection. When we ask, “What do these things have in common?” we are
engaging in the process of abstracting. Abstraction is an extremely useful tool in problem
analysis and problem solving.
It is often difficult to discern the difference between the abstraction or idea and an instance of
that abstraction. For example, someone holding a $5 bill (or note) asks “What is this?” and we
answer, “It’s money.” We would be exactly correct if we answered that it is an instance of
money. Money is instantiated in so many forms that it represents a real problem for most people.
If we put on a table a $5 bill, a postal money order, a stock certificate, a bank statement, and an
active insurance policy and asked, “What do these have in common?” we might wait a long time
before anyone recognized that they are all instances of money.
Similarly, if we laid on the table a grocery list, a spread sheet, a customer list, a catalog, a USB
drive, and a graph/chart and asked the same question we might have to ask a hundred people
before any recognized that they are all data. We might never hear that they are all instances of
data.
The concept of instance goes hand-in-hand with abstraction. We can all empathize with former
Justice Potter Stewart of the Supreme Court who, in attempting to define “obscene” in 1964,
wrote, "I shall not today attempt further to define the kinds of material I understand to be
embraced…[b]ut I know it when I see it…" What he was saying is that we can recognize
instances without fully understanding the abstraction. Note that he apparently was engaged in
listing various instances in an attempt to get to the abstraction.
When it comes to data, this is what we all do. We can all recognize what we have before us as
data without recognizing that it is an instance of data. By now you are probably thinking that
this particular discussion has gone far enough. While appreciating the confusion and
bewilderment that discussions like this one create in the casual or neophyte observer (which is a
group that includes almost everyone), it is still important to drive home the idea of data as
abstraction.
Just as there are rules or algorithms about proper handling of money that do not depend on the
instance before us, there are rules about managing data that transcend any instance. These rules
need not be understood by everyone but, if data is important to us in the same way that money is
important, we need to find someone who does understand them and include that person in
decision-making.
Do you know how the big change you are contemplating will affect your money? You will make
sure you do understand the potential impact before you launch the change. Do you also know
how your data will be affected? You might want to discover those potential impacts as well.
Relationship Concepts
Len Silverston and Paul Agnew have written a book entitled The Data Model Resource Book
Volume 3: Universal Patterns for Data Modeling (Silverston, 2008), which is self-explanatory if
you happen to be a data modeler.
If you aren’t a data modeler, please bear with me because I have something for you in return for
your patience.
This is an excellent book, complete and thorough. I bought the book (only an author can really
appreciate the commitment this represents) and eagerly looked through the chapter titles. What I
was looking for was a chapter on relationships.
I have been of the opinion for about 20 years and have been talking to anyone who will listen for
about 15 years about the role of relationships in ER (which stands for entity-relationship)
modeling, more commonly known as data modeling. I started this book in 2006 (it is now 2013)
but it simply wasn’t ready to be written yet.
In 2012, Danette McGilvray (Granite Falls Consulting) presented a “Meet the Expert” session for
the IAIDQ (International Association for Data and Information Quality). She did a great
presentation on 12 dimensions of data quality based on her recently published book, Executing
Data Quality Projects (McGilvray, 2008). I had submitted a question asking whether the 12
dimensions applied to relationships as well as entities and their attributes. Danette gave the
answer that I expected—she had not considered relationships.
The last and best motivation was the short shrift delivered by the Universal Patterns book. Now,
I know that relationships are different than entities and I’m willing to let Len and Paul and
Danette off the incompleteness hook (I’m sure they’re relieved) but not until I publicly
acknowledge their role in motivating the work you are engaged in reading (actually Book 3
through Book 6).
Now, for you “normal” people (non-data people), this chapter will offer insights into
relationships in general. You might deduce from the name “relational database” that relations
play a significant part. We know that relationships play a critical part in our ‘real-world” lives as
well, so this is one of those serendipitous times when we can improve our understanding of our
organizational (work) life as well as our personal life.
I find that it is not possible to talk about relationships in the data design sense without diving
fairly deeply into the mechanics of relationships in general. So, my offering to you is that if you
are “working on” a relationship with someone or if someone is insisting that you “work on” a
relationship with them, you will find concepts here that will be both useful and productive.
So, if your interest is in gaining some additional insight concerning your interactions with
another person or you are struggling with the organization of information (data), this is for you.
For the more technology-oriented, a short (13 slides) slide deck is available for download.
The word, interactions, was used above as a poor substitute for relationships. Because I didn’t
want to overuse relationship, I chose a word that means much less. Relationship as a concept
has a power that can’t be easily replaced. Microsoft, in the thesaurus built into Office, suggests
the following as possible replacements for relationship.
None of these expresses what we mean by the word relationship. None even come close. At the
risk of alienating half of my potential readers: if there is one thing that is the essence of the
difference between female and male thinking, it is the recognition of relationship as something
real, something that can be “worked on.” Sisters, I’m going to do my best here to make
relationships real for everyone. Brothers, trust me, this is going to help you.
We all want improved relationships with our loved ones, friends, and coworkers. We all need to
organize the flood of information in our world—some of us do it for a living. In either case,
relationships are the key and understanding what they are and how they work is critical for us to
feel, and be, successful.
If you are looking for a deeper understanding of entity-relationship modeling because that’s how
you earn your living, you won’t mind the inter-personal parts because you also have inter-
personal relationships in your life and they always need deeper understanding and attention.
If you are interested only in the inter-personal aspect, but are deeply and genuinely interested in
understanding and improving, then you will benefit from the more “technical” treatment because
it will give you additional tools that you may use as you “work on” your relationship. I also
promise to warn you if the discussion might go beyond relationships and into purely technical
issues.
Understanding the relationships around us is vital, whether you are in workplace politics or a
romance.
If we work in the world of business and information technology and we want processes and
systems that continue to function even when a new department pops up or an old one disappears,
then we need to go deeper into the underlying relationships and avoid being distracted by
traditions or personalities.
The stability in our lives comes from the core—our relationships. Your author and expert asks
that you grant him credibility in the area of relationships based upon the following:
44+ years of marriage to one woman
more than 30 years of experience with relational database
training as a group leader with Marriage Enrichment, Inc.
more than 30 years of experience with entity-relationship modeling and analysis
training with Befriender Ministry and Stephen Ministry
author of “A Philosophy of Data Modeling”, Database Programming & Design, 1988
Many presentations on modeling for meaning
work with 3M, Mayo Clinic, Unisys Defense Systems, BNSF, Northwest Airline and several
smaller corporations in multiple industries involving every aspect of entity-relationship
modeling
If you are a woman, you’re probably thinking that there is no way a man could ever have a clue
about relationships. If you are a man, you may be thinking that you should have picked up a
different book.
Lesson one, then, is that both men and women can be wrong about relationships.
For those interested in modeling data, you have been creating “relational” models with minimal,
misleading and for the most part useless information about relationships. Stick with me and
learn how to create models that are intuitive, expressive and, above all, useful. Why? Once
again, I’m glad you asked.
Data Models (or ERDs or ER Diagrams or whatever you want to call them) are extremely useful
in these ways:
As a discussion and clarification tool
As a means of establishing boundaries
As context for
o Defining terms
o Defining relationships
o Testing assumptions
If the modeling is done with the business users it makes an excellent blueprint for system
user interface design
Modeling solely for the purpose of generating a database schema (definition of the tables in a
database) is potentially a useful byproduct. My recommendation would be to create your model
for the above reasons (you can call it a conceptual model if you like) then create another model
from that in order to have something to feed the schema generator. The problem is that the
“physical” model that we want in order to generate the schema has to contain a large quantity of
information that isn’t of use to the business relationships. Additionally, there is never any
differentiation in the physical model between entities and relationships that have business origins
and those that have programming origins. This lack of clarity makes the physical model nearly
useless for the purposes of quality in information.
Motivation
Does a human being exist who has not, from time to time, been baffled by a particular
relationship? It must be true. Females the world over roll their eyes when the topic is males and
relationships. The males do the same thing, but for different reasons.
We can imagine many reasons why this may be true. Females live in relationship and the
nuances are critical. Males seem to live life in between relationships—that is, in the “space”
between relationships. Males, perhaps a majority of them, seem to prefer that competition define
all relationship. The “Norwegian bachelor farmer” made famous by Garrison Keillor is the
prototype. Someone so befuddled by the intricacies of relationship that he chooses to forego
them entirely.20
We know, for example, that males and females have different ways of communicating. It is also
true that two people of the same gender may be equally unable to communicate clearly because
they both speak and listen through the filter of their past. When their pasts are sufficiently
different, they find themselves stuck in a frustrating battle to be understood.
Consider that humans populate every continent of the world (save Antarctica) and every type of
terrain and climate. They build homes of mud, sticks, skins, leaves, lumber, steel and cardboard.
They dine on virtually anything that grows or moves. They suffer from common and uncommon
maladies. They come from small families, large families, extended families, single-parent
families, loving families and dysfunctional families.
Given all that variety and diversity, the probability of any two people being able to communicate
effectively without long practice is vanishingly small. But to make matters even more
complicated, mix in gender and insist that the topics to be communicated include emotion and
feelings.
Nearly every human interaction (nearly is only to avoid the quibbling that would result if I said
all) has an emotional overtone that can cause one of the parties to get hooked into an emotion
from their past. These hooks are huge barriers to communication.
We see the same problems in work relationships. Communication difficulties lead Sally to
expect something that Mike is unprepared to deliver. The expectation may be based on experi-
ence with a previous employer, a previous boss, a previous job description, or even family
experience.
20
By the way, I have actually met two such men and visited with them. They were brothers who learned to live with each other from the cradle
and never saw a good reason to include anyone else in their lives. Yes, they were of Norwegian descent and yes, they were farmers. There was
not a single frill in their home which was heated by the same wood stove they cooked on. The walls were covered with newspaper purely as a
way of reducing drafts. The lesson is that any relationship is possible and potentially long-lasting when the parameters have been agreed to by
the parties.
We don’t often think of process-process boundaries or organization-organization boundaries as
comprising relationship, but those regions are, in fact, rich in relationship. We ignore these
relationships in our modeling efforts at great peril to the fidelity of resulting systems. Failure to
treat these relationships in our models leads to frustration, dissatisfaction and abandonment.
Relationship is vitally important in the management of data and information within a business. If
we are unable to relate pieces of information to one another, we won’t be able to create the
comprehensive picture that the leaders need in order to make good decisions.
These three different types might be called
Chosen: These are the relationships we dream of, seek out.
Mandated: These are dictated to us. They include both family and work relationships.
Constructed: The ones we get to define from the ground up
There is great potential for debate on the question whether Chosen relationships are not also
Constructed. I want to separate them based on the perception that, while Chosen relationships
may be customized, they already have some form, including a name, roles and basic expectations
defined by social convention before any party or parties elect to enter. These are broad
characterizations and, depending on you, there may be wide overlap between the types or
classes.
The premise of this chapter is that all of these types of relationship share a common pattern. If
we can gain mastery of one type, we could use that mastery to generate success in the other
types.
Because every kind of relationship involves at least two parties, it stands to reason that every
relationship includes and is dependent on communication. This chapter focuses on the
communication that must happen. We’ll take apart relationship to examine its components. The
answer is there somewhere. What is the nature of the dependence? Are certain kinds of
communication better than others?
The bottom line is that we must find the key areas where communication must happen—for
instance in business process—and adopt a formula that will insure that complete, necessary
communication takes place.
This rationale may find a sympathetic ear in business analysts and data modelers, but, if your
interest is confined to getting your husband, boyfriend or partner to fill an emotional need in you,
it may be a stretch at this point. Fear not! The same approach will be productive no matter what
the relationship. I promise a payoff for you.
Let’s say that your need is for commitment from your partner. That’s a fairly common
complaint. Many relationships dissolve when one person is unable to feel commitment from the
other(s). This is no less true in business relationships.
We’re going to treat each other like adults and we’re going to be honest with ourselves and with
each other as we move forward. Relationship, whether business, personal or data, is NOT depen-
dent on emotion or feelings. Rather, relationship is about choices. I’m not going to pretend that
I understand how a woman thinks or feels, nor will I pretend that I understand how any given
man thinks or feels. That simply isn’t necessary.
We’ll assume here that we all have the same basic kinds of needs as defined by Maslow
(Maslow, 1943) and the physical survival needs are being met.
The program that we’ll follow is to tease apart relationship at various phases of its lifecycle.
We’ll examine
1. Relationship formation
2. Relationship growth and maturity
3. Relationship dissolution
We’ll also look at what it takes to get a usable status report at any point in time during the
lifecycle.
Finally, we’ll use interpersonal relationships as the specimen to be dissected. We do this
because these are the most familiar. If we can gain a better understanding of personal
relationship and can see that business relationships are no different, we’ll finally be in position to
create some real improvement. We’ll extend the lessons out to other types as needed to insure
that the model stays consistent.
Everyone has experience with types of personal relationships, which makes them ideal examples.
Business relationships (Mandated and Constructed), especially the kinds that are documented in
databases, have more formal overtones that require only slightly different handling.
Some Examples
From here on, we’ll use a simple notation to represent relationships. In simplest form, a
relationship is a link between two people (or things including events, but we’ll stick with people
for now). Below is a representation of a relationship.
The horizontal bar in between the two boxes represents the relationship. Of course, this diagram
says nothing at all about the nature of the relationship and doesn’t acknowledge the possibility of
more than one relationship between me and you.
Here is another diagram (a simple model) that possibly conveys a bit more information.
An unmarried woman who is a romantic will immediately recognize some possible names for
this relationship including meet, find, and marry.
It’s worth brief consideration here to wonder about “someone” and how he or she might name
the relationship. “Why?” you may ask. Remember that question.
Let’s assume that no one wants to “trap” anyone else into any relationship. Such a relationship,
involving an unwilling or unknowing party, could not produce positive results. Are you still
with me?
So, if we accept this and want to build a relationship involving willing and even enthusiastic
partners so that we can improve our lives and theirs, then we must understand what the building
blocks are. If we do not, then we run a very real risk of creating what may be perceived as a trap
or snare and resented as such.21
What are the properties of a relationship that we can sense? How are these sensations perceived
by those involved?
Modeling
The modeling of relationships using a graphical notation took off in 1976 with the publication of
Peter Chen’s article (Chen, 1976) on Entity-Relationship Modeling. It came out at the perfect
time to help fill a perceptual gap between adherents of the new (at that time) “relational”
database structure and the established network or CODASYL22 structure.
One of the attractions of the relational structure was its mathematical completeness. Operations
on “relations” are predictable mathematically. Of course this had immense appeal to academics
of the computing world, not least because it allowed for the construction of a “query language”
with operations similar to addition, subtraction, multiplication and division.
The ER notation suggested by Chen drew attention because a diagram could be turned into
(“mapped onto”) a relational schema (database file structure) in a very straightforward way.
ER notation used the mathematics of set theory in such a way that a relational schema derived
from an ER model diagram retained all of its mathematical predictability.
One of the features of Chen’s graphical notation was that relationships had their own two-dimen-
sional symbol (Chen, 1976). This gave relationships weight equivalent to entities. It made them
21
The format of these two statements is known as a value proposition and I acknowledge the contribution of Gwen Thomas who first showed me
the power of this construct.
22
“Network” and “CODASYL” are distinguished from relational data management in that their implementation is not portable. All access to data
is accomplished by means of programs. There are primitive commands that go with the model, chief of which is “next” or “get next”.
It was streamlined so that more information could be conveyed with fewer symbols (less
ink)23.
Emphasis shifted to the automated generation of program code from diagrams. This was
called CASE (Computer Aided Software Engineering) and included several modeling
notations including Process Flow, Data Flow and ERD, each of which was often provided in
multiple formats to fit your favorite development methodology.
Emphasis shifted from processes to data and an entirely new approach to software
development emerged called Information Engineering. This was really only made possible
because of ER modeling.
The sands continued to shift resulting in the dominance of data organization and management
concerns and the relegation of process to the design of the user interface.
As we attempted to model more and more complexity, diagram real estate became increasingly
valuable. It turned out that the relational completeness of the diagram apparently didn’t need
anything substantial from a relationship. In fact, a single line connecting them could express all
relationships between two entities. This perception, though untrue, has had a profound effect on
both data design and data management. It is not enough to simply link two tables of data. The
amount of meaning (and therefore quality) that can be extracted from such a simplistic
“relationship” is absolutely trivial. Many of the issues that have seemingly popped up since then
are a direct result of trivializing relationship.
With that change, the meaning of relationship was gone for good. Relationship had become
merely the means by which a record or set of records in one database table could be linked to a
record or set of records in another table. This was all very well for database administrators, but it
has seriously hampered the efforts of those who persist in their attempts to model the world that
an information system serves rather than the system itself.24
All of this change was driven by the folk wisdom that “processes change, data doesn’t.” While I
have uttered these words myself on many occasions and I still believe them, if pressed hard we
have to admit that it is the kinds of data that don’t change. In other words, we have found that
this bit of folk wisdom, like most, has a grain of truth at its core and that failure to comprehend
the limits of that truth have led to widespread misapplication.
The individual pieces of data are highly subjective in their qualities and therefore quite
changeable. When the new CEO assumes control, it isn’t just the organizations and processes
that change, but the very definition of key pieces of data.
I won’t bore you to death with additional history. The net result of these tides has been that the
relationship itself has virtually disappeared from the practice of ER modeling. I suspect that if
females had been involved in greater numbers while all this was going on, the result may have
been a bit different. They would not have been so quick to devalue relationship.
23
Edward Tufte (Tufte, 1997) advocates making all visual distinctions as subtle as possible while still being clear and effective. “Less ink” is a
rule of thumb in the quality of a representation.
24
This begs the question, “Why Do We Model?” This question was the subject of at least one doctoral thesis (Simsion G. , 2006). It seems that
design rather than description is the purpose for modeling data. Beyond this, however, is the question of what to model. It is argued that the
greatest design benefits result from an approach which begins by modeling the portion of the world into which the desired system will fit
(descriptive modeling). We then extract the portion of the world directly affecting or affected by the system and, within that [descriptive] model,
we design the data structures for the system itself. For a summary of modeling whys see (Simsion G. w., 2005).
The only reason that relationships still exist at all in the typical model is referential integrity, the
utility of automatically generating database procedures or triggers to ensure the ability to
successfully link two files in the database. For those of a pragmatic nature, that is enough of a
reason.
Unfortunately, for reasons that will soon be demonstrated, successfully linking two or more files
may not provide the basis of a sound business system. Referential integrity is not enough.
Let’s look at a simple example.
Consider the family tree (or pedigree). Many are interested in this and would like to manage the
data for various reasons. We identify three kinds of things that will be important: person,
parentage, children. You may notice that parent and child are also persons. This is true, but as
you will see, in our database they are relationships.
I made it sound as though family tree and pedigree were equivalent. They are not, although they
share many characteristics. A Pedigree is only concerned with genetic relationships while a
family tree is concerned with family relationships (which need not be genetic, as in the case of
adoptive children). Because we’re preserving our options, we want to make our database serve
both purposes.
Person
This is a table of Persons and contains attributes of a person. These attributes are pieces of
information (data) that we might want to associate with a Person. Note that we relate Person to
itself by including a field, MotherID, which is a pointer to another row in the table.
There are many problems with this “relationship” which can be expressed in ER form in the
notation at left or, with slightly more meaning, in the notation at right.
What about the Person instances for which MotherID is unknown (null)? For a family tree we
can fill in the PersonID of virtually anyone. “The Person thought of as Mother” would be an OK
definition or name for the attribute. For Pedigree purposes we’ll need more rigor. We know that
every Person is linked to a female Person who gave birth. Knowing this rule doesn’t necessarily
help us to populate our table. For one thing, how do we interpret null values for MotherID? For
another, how do we know whether the Person referred to by MotherID is the “birth mother” or
“Person thought of as Mother”?
How can we redraw our model so that we can answer these questions? We first recognize that
we need to define some attributes of the relationship itself. We are naturally led to the notation
at right above, which actually gives us something to which we may attach attributes. We could
treat the Mother relationship as just another entity and this is the path our tools force us to take.
In Erwin, a very popular “data modeling” tool, we can tell the schema generator to generate all
relationships as entities. Another option is to generate only many-to-many relationships as
entities.
Do you see how the nature of the relationship itself is becoming obscured? The DBA only wants
to make sure that relational integrity is defined. Every time a Mother is inserted into our
database, we want the dbms to insure that a corresponding Person record exists. What about the
reverse? If we want to be assured that (for our Pedigree) every time a Person record is inserted
into the database, a Mother record gets created, then we make the relationship mandatory. If we
must create a Mother record, what attributes will be required? If we make the MotherID a
required field then every time we insert a Person, we must be prepared with the PersonID of the
Mother. Probably we can’t guarantee that but that ability depends on the “business processes”
surrounding the system. Finally, what should happen if the Mother record is deleted or updated?
Do we want the dbms to automatically apply updates made to Mother to corresponding Persons?
What if we delete a Person record or a Mother record? If we find we have made some mistake
and we delete a Person, do we also want any connected Mothers to be deleted? If we delete a
Mother, do we want all Persons for whom she functions as Mother to also be deleted?
In many cases the business processes are not consistent enough to allow the dbms to
automatically do anything. Well then, if we can’t do it automatically, what are the options? We
either have to pretend that these needs will never trouble us or we have to build logic that will
prompt us when/if they do emerge and then carry out our desires, whatever they may be, at that
time. Knowing what we now know about motivations and economics, which—if any—of the
choices will be made? Do you think that knowledge of those business processes should be part
of what we know about the relationship? How can we get that to happen?
just possible that a substantial amount of stress can be lifted from our lives. It’s also possible
that we can relieve similar stress on our data systems.
Beginning with the properties of the relationship class itself, we note that it has both a
description and a purpose. The description will lay out the basic ground rules as in: A baseball
team will consist of a sufficient number of members to compete under the rules of the games of
baseball (league rules being the final authority) with a maximum of nine active players and a
sufficient number of reserve players to guarantee that the required minimum number is always
available. You might wish to add additional description regarding the various positions and
skills that must be included or suggest a set of roles without which the baseball team
(relationship) will be incomplete. That’s up to you as a user of the pattern.
When a new member joins the team, he or she will accept a role and subscribe to the purpose.
The purpose is an essential part of the relationship since, without agreement as to purpose there
will never be agreement as to role expectations. For example, a team whose purpose is to return
a profit to the owner(s) will have far different expectations than the team whose purpose is to
win a championship.
Even in a relationship such as marriage
def: Marriage is a lifetime commitment to join the lives of two persons.
it is critical that the parties agree on a purpose at the time the relationship is instantiated. If one
person is purposed to raise a family and the other to live a life of romance, some negotiation is
going to have to take place in order to make the relationship last. Clearly, we have pared the
definition down to the bare bone and not all would agree with it as it stands here.
One further example will help to show the importance of purpose and definition to a relationship.
Nearly all of us have experienced an attraction to another person. This is almost always
accompanied by anxiety, uncertainty, indecision, and sometimes outright fear as we realize that
this one is different. What are the questions we need answers to?
Name
o What name shall I give to this?
o What name is he/she giving to this?
Role
o What shall I call myself in relation to him/her?
o What shall I call him/her in relation to me (friend, boyfriend…)
Purpose
o What do I want?
o What does he/she want?
The answers to these questions suggest expectations and obviously, if the parties are giving
incompatible answers, their expectations can’t be aligned. Without serious negotiation, the
relationship is doomed. Every expectation must be answered by a corresponding intention, also
known to both parties.
Our pattern denotes the many-to-many situation that exists with respect to relationships and roles
by means of an entity type called ROLE-EXPECTATION. Every relationship involves the
expectations of the participants. When expectations are clear everything can flow along
smoothly. When they are not clearly articulated or when they are not put into words at all, the
parties have virtually no chance to create a positive and productive relationship.
A specific entity (an instance) may be filling one or more roles in various relationships. A role is
filled by one or more entity instances at any point in time and over time. We all understand that
the role of husband can have a history over the lifetime of a marriage relationship. Similarly, the
role of buyer in a contractual relationship may be held by multiple individuals at any time and is
almost certain to be filled by more than one person over the lifetime of the contract.
The complete picture of any relationship consists of potentially many roles. Much as we
appreciate simplicity, it also seems likely that a given role might be found in more than one
relationship.
For example, BUYER might be a role in a retail relationship, a real estate transaction or in a
contractual relationship. To bring this home, MAN or MALE are roles found in many
relationships, as are WOMAN and FEMALE. An important aside: does your model include one of
these as an entity? These roles are often confused with entity types and may often be found as
categories. Do you see how the MALE and FEMALE roles are different than the values for gender
that we might assign to a PERSON entity type?
You will note that the “traditional” line segment(s) are used here only to convey information
about cardinality and optionality. A line segment, no matter how richly adorned with avian feet,
bars or circles—even labels, cannot convey the meaning of a relationship. At best it can only
convey the idea of a relationship, which must be fleshed out if we are to make our model really
useful.
We all know that expectations are part of any relationship. Each role expects certain behaviors
from the other role(s). Speaking only for myself, this is the source of frequent irritation and
occasional unhappiness. It isn’t the expectations we know about that cause problems, it’s the
ones we don’t know about. In our personal relationships we can’t hope to get all of the
expectations out in the open up front. In fact, one of the expectations is that you will learn to
anticipate my expectations. It sounds pretty hopeless doesn’t it? I wonder how many would
choose the relationship knowing this one up front. Think of the typical data quality breakdown.
Isn’t the issue that we (or someone) failed to anticipate the expectations of someone else?
Our pattern says to expect expectations and that’s the benefit of the pattern. It’s only when we
don’t have any idea that there will be expectations that we get into serious trouble. Knowing
allows us to, at a minimum, be aware of situations that could be expected to involve
expectations.
It would be unreasonable to think that all of this could be negotiated at the inauguration of a
relationship. Relationships (and their expectations) evolve over time. Having the relationship
pattern in mind helps us to be alert to hints of changes and should prompt a renegotiation
(discussion) when changes are noticed. In the personal realm, these discussions should satisfy
the “working on” expectation.
We might also notice the idea that someone who is not a party to the relationship may be affected
by it in some way. In the case of a marriage relationship, we will have to introduce in-law roles
and manage those expectations, but what about bystander roles?
A pure bystander will not be in any formal relationship with our parties and for that reason
cannot be allowed to have any expectation regarding the relationship. They should only be
considered to the extent that the “bystander” entity is involved with one of the parties in yet
another relationship of some kind. For example, my boss and I share that relationship and our
roles might include expectations regarding marriage relationships of which one or the other is a
party.
It is critical to note at this point one other essential difference between a pattern and an instance.
We have to remember—and this cannot be over-emphasized—that the pattern describes a class
of relationship instances. This means that what is being documented is any instance within the
class. What we are doing is generalizing.
We have been warned about generalization since childhood and yet we find it so useful that we
risk the problems that generalization may present in order to reap the reward of being able to
simplify the way we deal with the world. Be warned: although all marriages share some critical
roles and the expectations for those roles are formalized to the extent that audiences of thousands
will laugh uproariously at a joke that references one of the expectations, it is still dangerous to
rely on that expectation in your particular instance. It certainly aids in negotiating expectations,
though, to have a starting point already defined.
This is one of the ways in which the first class, Chosen relationships, differs from the other two
classes. As we reflect further on the three classes of relationship, we will see that as we move
from Chosen through Mandated to Constructed conformance to expectation becomes more and
more the norm. We will also see that if we handle our Constructed relationships the same way
we have been accustomed to handling our Chosen ones, we will experience the same kinds of
turmoil in our work life that we have in our personal life.
Obstacles to Relationship
Many women believe themselves to be experts in the practice of relationship, but have no ability
to be honest with themselves. How could they? They never had a model for open, vulnerable
and honest communications.
Most men, on the other hand, prefer to be oblivious to the needs of relationship. We may not
even understand that there is a relationship until we begin to experience some of the effects.
Perhaps we become the object of anger when we meet a friend at the Home Depot and spend the
rest of the day on the golf course. Maybe we come face to face with what can only be
recognized as disappointment directed at us and we have no idea concerning the cause of that
disappointment.
We simply allow our “significant other” to dictate the rules. Our SO tells us that it’s our fault
and if we would only be different, everyone would be happy.
There is a saying that is old and trite, and it holds a lot of truth. “A woman picks a man hoping
he’ll change and a man picks a woman hoping she won’t [change].” What if this is literally true?
What problems could it cause? What is the best way to deal with this before it becomes
problems for my relationship?
Come with me for a minute while I take your situation out to the extremes in order to illustrate
this point.
Imagine for a moment that you are from a mountain village in the Himalayas, have never even
seen anyone whom you haven’t known from birth, never seen a light switch or electrical outlet,
nor anything printed and that you wake up one day in the middle of a desert. You are in the
company of a Someone who has lived a life similar to yours in its isolation but in the midst of the
Amazon basin.
That’s a picture of most relationships. You find yourselves “together” for some reason that is
not immediately apparent and you know that your life could be better or at least easier if you
could rely on the other. Can you imagine that your odds of surviving are vanishingly small if
you try to do it on your own in a world full of things that you don’t understand?
You try to talk to the Someone but it results only in confusion and frustration for both of you.
You wonder if it’s worth it. Then you look around you and see an unforgiving and bewildering
emptiness in every direction.
The idea of turning your back on the Someone and simply walking away loses its appeal. You
can clearly see yourself withering to insignificance along that path. What are your choices then?
Some of us find that the frustration and discontent are too much to bear and we do turn our backs
and walk away. We hope that somewhere in that emptiness is a Someone who will be able to
communicate with us. Maybe the Someone agrees with this decision—or maybe not. Maybe
s/he follows at a distance, staying close, but not too close, hoping not to be left alone.
Others of us can’t stand the thought of being alone in the world. We resolve to make the best of
a bad situation. We contribute what we can and grudgingly accept the contributions of the
Someone, even though we don’t like what s/he has to offer. We never learn to communicate
although we do learn to get along. We develop habits and rituals to take the place of
communication. We coexist.
Then there are those who throw themselves into the creation of a relationship. They eagerly
listen to what the Someone has to say and eventually learn to communicate fluently in the
Someone’s language. They find valuable traits, abilities and knowledge in the other. They offer
themselves freely and learn how their own traits, abilities and skills complement those of the
other. Together they find that they can do things that neither could do alone. They survive and
then they thrive.
Of course the variety of approaches to relationship includes many more than these three, but we
can think of these three cases as being near the extremes and somewhere in the middle.
How do we bring this example back to the workplace? The work situation isn’t really that much
different from the personal/social example. We accept a job and show up full of hope and
expectations. After a few days or weeks (or maybe a few hours) we suddenly realize that we
aren’t where we assumed. We realize with a sense of panic that we might be in completely
unfamiliar territory.
We look around at our co-workers and see for the first time that they each have their own hopes
and expectations and that some of those are pretty radically different from ours. Some of these
co-workers are part of our team and it becomes apparent that the team may have a different
understanding of where we’re headed than we do.
The three options are still available and the only real difference is the level of emotion involved
(although some of us can create a lot of emotion (it’s often called drama) in any given situation.
What we often find it most difficult to do is to simply commit to the relationship and request the
help of the other inhabitants to understand where we, where we’re headed, and what resources
we have available to get there.
What about an inter-department relationship? How might the expectations of the Sales
department change? Will the intentions of the Fulfillment department ever change? Do any of
these changes affect Shipping and Receiving or the Warehouse? How do these change impact on
Management expectations? Can you see why these deserve attention? How much attention are
they currently getting? Management expectations change and new reports (or new dashboards)
are requested. Does Fulfillment get new dashboards to help their intention to meet management
expectations (goals)?
We are generally content if we identify a foreign key (order_number) that is included in an
invoice record or in a shipment record so that the order can be linked to the invoice and/or to the
shipment. That simple tactic conveys nothing about the promises made by Sales or the
expectations of the Customer.
The Spectrum
In case you aren’t familiar with the idea of a spectrum, let’s take just a few seconds to create a
mental image. Even if you already know what a spectrum is, it’s still useful to make certain that
we’re all on the same mental page.
In order to see a spectrum, we need a prism. In the most familiar example, water droplets in the
air act as a prism, bending sunlight so that all of its color components are visible. This is a
spectrum and we know it as a rainbow. In general, we use the term spectrum to give the idea that
something is not necessarily as homogeneous (hum á genus)—which means having a single
structure or composition—as we might be tempted to think. Don’t worry; we’re not going to
going into a detailed explanation of the science. In fact, while it’s good to remember that most
things are not as simple as they seem, it’s also very useful to remember that it isn’t necessary to
go into a lot of detail about how and why when we can get quite far on knowing what.
Imagine then that relationship, like sunlight, has components that require something like a prism
to see. What would make a good relationship prism? I nominate time.
If we look at a relationship (or even a type of relationship) over a period of time, we will see its
Time as a prism
names that we choose while others may be inappropriate for the same reasons. While anyone,
regardless of gender identification or orientation or age or experience can probably describe this
relationship, they may use different words to identify the events or states of the relationship.
You don’t need to have experienced this kind of state or even this relationship to discuss it or to
talk about your expectations at a given point in the spectrum.
The important thing is to see that this is a single relationship extending over an interval of time.
A core set of expectations is involved throughout the lifetime of the relationship. Other
expectations change or evolve. Is this kind of relationship unique, or even unusual? It takes
only the briefest of reflection to realize that any personal relationship must change as its
inhabitants change.
Type or Instance?
So far, we have been discussing relationship types rather than instances. Types of interpersonal
relationships include friend, spouse, sibling, parent-child, family, neighbor, babysitter, teacher-
student, teacher-parent and so on. We can understand without explanation (intuitively) that I
may have a sibling (type) relationship with five others (in my case), but my relationship
instances with each of those five are distinct and unique.
Type is another kind of abstraction that is extremely important. You may have heard of
something called “object oriented” in the context of system development. In the earliest days of
computing, there was but one type, the binary integer. It took almost no time to realize that one
type just wouldn’t work. There are many kinds or types of numbers in addition to integers. To
refresh yourself regarding the history of types in computing, you may want to take another look
at the Old Testament, Book 6, Machines and Logic. As we continue to work at getting
comfortable with the notion of types of relationships versus instances of those types, it’s time to
set aside instances as a topic of discussion here. Our goal here is not to dissect any specific
relationship instance. What we want to do here is to learn enough about the basic building
blocks of any relationship to be confident that we can take one apart and put it back together—
better than it was before.
Maybe you have considered this relationship. What is my relationship to the world? How do I
Another perspective
fit in? It is useful for each of us to spend some time considering this relationship. This really
isn’t a good example for exposing any of the building blocks—except one. No relationship can
be disassembled without acknowledging the perspective from which we are working. We can
recognize this relationship while recognizing, too, that “the world” has no perspective that is
useful to the discussion. Also notice that switching the position of the parties makes no
difference. What we take away is the realization that my relationship to the world is entirely
dependent on my perception. My perception of the world (which is really a class) will affect my
relationship with any instance from the class the world. In the same way my perception of
manager or sales or customer will affect my relationship with any individual from one of those
classes. We easily recognize this law in our lives and we have a name for it, prejudice. Clearly
prejudice has its own spectrum and is inescapable. It’s one thing to act out of prejudice as a
reflex and still another thing to stay aware of the prejudice as we decide what our actions will be.
Before we start the dissection, a quick review of terminology used so far will be useful.
Remember that communication is one of the keys to a successful relationship, so it’s vitally
important that we continue to speak the same language as we move forward.
We are still taking the inhabitant perspective. The inhabitant is one of the parties involved in a
relationship and perspective is the point of view of one of the inhabitants.
A type or class is like the form or mold used to define all instances. A type contains the features
or properties shared by all instances. An instance is a specific case of a type. The type’s
features or properties all have values—in other words, the names are all filled in.
Our goal is to get you into a position such that, if you select the type of relationship you want to
inhabit, you can feel capable of managing the parts that are your responsibility and offering
support in the parts that aren’t your responsibility. We all recognize—don’t we?—that all
inhabitants of a relationship succeed or fail together.
Let’s just repeat that before we proceed:
All inhabitants of a relationship succeed or fail together.
The figure below is the simplest possible depiction of a relationship. It shows two parties
connected by something. We call the something a relationship. This diagram provides no
Entity A Entity B
information about the nature of the relationship. It could represent “farmer milking cow” or
“man committed to woman” or “customer complaint about product.” We need to get more
meaning from our diagram.
Women like to think of “working on our relationship.” By this is meant something very
complex. They seem never to tire of discussing relationships. Men, on the other hand, seem
entirely willing to ignore the complexities of a relationship, preferring to simply adapt to or cope
with each nuance as it presents itself.
To males, a relationship is frequently something that someone else is talking about. It is often a
mystery. To females, a relationship is something that is intuitive and real, though it may still be
largely mysterious and elusive.
You have already received an introduction to the idea that a relationship can be represented in a
diagram. A simple pictorial representation has been developed for documenting and analyzing
the nature of the relationships between entities. From now on, I am going to use the words entity
and party interchangeably to represent a person, place, event or thing that may be involved in a
relationship type.
A picture of a relationship might look like those below. This picture simply says that the Entity
on the left is related to the Entity on the right. So far, not very useful. We need to know—even
if we are male—more about this relationship in order bring it into our life and use it.
woman man
Sally Mike
The first way to give meaning to a relationship is to name the entities involved. If we add names
to our example we might get something like the Sally-Mike diagram.
I’m still not satisfied that I know what the relationship is. Many different relationships suggest
themselves. This is the point in interpersonal relationships where uncertainty and fear enter the
equation.
married to
woman man
engaged to
woman man
reports to
woman man
supervises
woman man
sister of
woman man
Relationship Examples
We find it very difficult to name a relationship most of the time. Is the relationship one of
friendship or is it love? Is there a difference? If we don’t name it, we’re done. We can go no
farther. Think of the name as a handle that we can use to begin “working on” the relationship.
This is the premise that lies at the foundation of this book. First, that it is frequently difficult to
put a name on a relationship and second, that naming it is essential for understanding and the key
to getting the most from the relationship. Naming all of the parts is essential if we are to derive
any benefit from the relationship. Are there additional parts that we haven’t met? Follow me.
Those who model relationships between things for a living often don’t. That is, they claim and
may even believe that they are modeling entities and their relationships, but their models often
go no deeper than the examples above.
Nearly all of their efforts go into understanding the entities themselves. Don’t start to feel
superior now. That is apparently the most natural approach. In fact, in our own personal
relationships isn’t it true that it may be impossible to actually name a relationship without first
getting to know ourselves? If what I really need is someone to cook meals for me, can I enter
into a relationship named “love” with any expectation of mutual benefit? Possible, maybe, but
it’s a distinctly improbable outcome.
In general, relationships are the most complex things that we deal with on a daily basis.
Successful relationships are the foundation of everything important that humans and their virtual
counterparts, corporations25, do.
We work hard to gather up all of the attributes or properties of an entity. We can learn her name,
her phone number, her address, her birthday, her astrological sign, her dress size, her social
security number, her driver’s license number, her height and weight—we can record every
possible fact about her but, without naming the relationship, we will never know what to do with
all of that information.
From these examples you can see the huge increase in meaning that comes from naming the
relationship. In the figures on the preceding page we see ambiguous relationships. This is the
kind of relationship in which many people spend their lives. If we happen to know Sally and
Mike, the unnamed relationship is not quite so ambiguous. We can imagine some potential
names for it simply because of the additional contextual clues provided by the entity names.
It is generally frowned upon to belabor a point, beat a dead horse, preach to the choir, or carry
coal to Newcastle, but the idea that relationships must be named is so critical that it’s worth the
risk.
We have discovered the first essential property of the relationship type. Relationship instances
have a name.
25
A recent candidate for the US Presidency said that “corporations are people.” This is true in the courts but it is clearly dangerous for any
individual human to think that they can behave as a corporation does.
Two Names are Better than One
married to
woman man
engaged to
woman man
reports to
woman man
manager of
supervises
woman man
supervised by
sister of
woman man
brother of
Isn’t it frustrating when you have named your relationship but he has a different name? If you
have felt that frustration, then you understand perfectly the rule that a relationship instance has
two names.
In fact, a relationship is composed of two equal and opposite parts that need to be in harmony.
Each participant sees the relationship differently. Each may have a different name for the con-
nection that exists between them.
When we acknowledge, understand and accept both names, then it becomes possible to exploit
the relationship. The word exploit is used here after much thought. The word tends to have
negative connotations today, but it means simply “to utilize productively.” That’s what we’re
after. We want the relationship to work for us—to produce what we need from it.
The relationship below looks much like the middle one in the previous diagram except that one
of the relationship names has been changed. What is the difference between the two? Mentors
gives a much different impression than manager of. Which one is true? Which one gives more
information about the possible interactions? Does it make any difference whether the man is
doing the mentoring or the woman is?
In our relationships with others, it is the need for both parties to provide a name for the
relationship that gives rise to so much difficulty. When one person says “I love you” and the
other responds with “I like you a lot, too” there is very clearly a mismatch. The labels we assign
are much more important than Shakespeare would have us believe.
When Romeo says, “What’s in a name?” his commentary is about the importance or lack of
importance of names to entities or parties. His friends and family have made a big deal out of
family names. In their world, the relationships available to a party named Montague could
include a party named Capulet only if the nature of the relationship was negative—one of
enmity.
Capulet can have nothing to do with Montague. Romeo has already made his mind up to ignore
all the history and present tension around those names. He is determined that, despite all of that,
Romeo may love Juliet and expect love to be returned. He comes up with a good example that
carries the day.
Shakespeare goes on to show us, however, that Romeo is wrong. It is about recognizing the
essence of a thing. We ignore names at our own risk. The names Capulet and Montague, with
all their history, caused problems that made appropriate communication impossible. Lack of
communication led to misunderstanding and the deaths of both Romeo and Juliet.
To communicate the essence of something, we can apply a name that both describes and evokes
the essential properties of the thing. When we say “rose,” we create the expectation of a form
and an aroma. We are left to wonder only about color or developmental stage (bud or full
bloom).
What is the essence of a relationship? Friend, boyfriend, fiancé, husband, friend, enemy,
victim—all are possible names for a given relationship. Each different name leads us to expect
something different. The existence of a name provides a handle that can be used to begin the
work but it is vital to understand the essence of a relationship before we can give it the proper
name. If we settle on the name it too quickly, we run the very real risk that the set of
expectations that the name evokes will not be acceptable to both parties. Often we choose a non-
threatening name first and then negotiate the real name. Because females are so much more
adept than males, they may be willing to take advantage and maneuver the male party to get
agreement way before the understanding arrives. Remember what was said earlier about
entrapment?
Romeo is right when he tells us that the characteristics or properties of something don’t change
when we use a different name. The relationship(s) between Capulet and Montague families do
not necessarily govern the relationship(s) between Romeo and Juliet. Please note, however, that
while Romeo and Juliet are free to establish their own relationship, they are never entirely free
from the relationships between their families.
Romeo misleads us terribly when emotions and expectations enter the picture. Of course, we
would all be much better off if we could rid ourselves of any expectations in our relationships,
but in our world when someone says, “I love you,” for example, she—especially the first time—
hears something like, “I’m going to adore you, put you on a pedestal and satisfy your every need
for romance…”
He may hear, “I’m ready to take care of you and be there when you come home.” Unfortunately,
when I say “I love you” I may actually believe that I’m prepared to meet that expectation. This
is a good time for a reminder. Each party in a relationship has to work hard at honesty and not
just with the other party(ies). It’s equally hard to maintain honesty with yourself. But, if we’re
to be successful at relationship, honesty is crucial.
All of this obviously applies to personal relationships, but what does it have to do with data and
information?
Have you ever told someone what you thought they wanted (or needed) to hear? How did that
work for you? For them? Do you think that the problems may have arisen because your
relationship wasn’t what you thought it was?
Anyone who believes that the quality of data and information is independent of the humans
involved is certain to experience a life of frustration and disappointment. On the other hand,
those who work hard at the relationships, cultivating honesty and understanding expectations,
will experience far less of that frustration and disappointment.
Advanced Relationships
Very few relationships ever proceed to the advanced stage. If you have been married for 30, 40,
50 or more years, it doesn’t mean that you have developed an advanced relationship. You may
have become co-dependent over the years. Or you may have remained in that relationship only
because any other choice involved too much fear or pain.
In the case of an organizational or architected relationship, it may persist long after it has ceased
to add value, simply because no one accepts ownership or the architect’s role. In many cases, the
inhabitants each perform according to the expectations they have come to understand through
habituation. Each person measures success by the degree to which they are satisfied with what
they receive from their supplier and the apparent satisfaction of the person to whom they deliver.
The problem in these relationships is that there is not sufficient personal pain or disruption to the
organization to trigger a desire for change. A relationship can be said to develop inertia or
momentum.
Recall from your time spent in high school science class, that inertia is the property of a mass in
motion that must be overcome in order to stop the motion or change its direction. Inertia is what
causes the pain when a falling body hits the ground. Momentum is another name for inertia.
Inertia is the product of mass and velocity. Can we think of a relationship as having mass and
velocity? Think of the mass of a relationship as definition and expectations. We can think of a
relationship as having velocity in the sense that it is moving in a certain direction because of the
intentions of the parties. A relationship’s inertia is created from its birth. It is combined of the
expectations and the intentions of the inhabitants (parties). Like a snowball rolling downhill,
inertia increases the longer the expectations and intentions remain unchanged.
A relationship can develop a great deal of inertia early if the inhabitants work to define it and
develop healthy expectations. Movement can be quite rapid. Early inertia can last a long time.
Inertia has the property of direction which it inherits from its velocity. Many relationships start
in a positive direction and the ones that don’t usually have a short life. How much energy do you
think it will take to change the direction (and thus the inertia) of a relationship that has existed
unchanged for a long time?
Think about all the relationships in your organization and how they have contributed to the
current state of your data. We exposed those relationships in the Old Testament. It may well be
the case that many, most or even all of these relationships may need to change in order to effect a
noticeable change in the quality of the data. One thing is certain: ignoring the relationships and
inserting a big, complex, new piece of software or a new methodology is going to produce
turmoil in the relationships without materially affecting the quality of the output.
Like the snowball halfway down the mountain, we can’t hope to simply stop it without the
application of considerable force and the result will be destructive. What can do is to apply a
series of “nudges”, relatively small changes that, together over time, will change the direction.
Consider the marriage relationship which goes on for some period of time before one party (or
both for potentially different reasons) comes to realize the momentum is not in a positive
direction. At this point we have a choice to either jump out or begin to do what is necessary to
change the momentum. The most frequent mistake is to demand that “we” (or more honestly,
“you”) need to put more effort into the relationship. Even if “you” is more honest, it is also the
least effective. We all want to believe that we are doing everything we can and are therefore
blameless regarding the momentum of the relationship. Do you see, though, that a
relationship—any relationship—is a balancing act. The expectations and intentions of one party
must always be coordinated with those of the other party (or parties). It’s a tug-o-war with intent
balancing expectation such that neither party gets pulled into the mud hole.
Sometimes the coordination (adaptation) just happens and sometimes it involves negotiation and
compromise. It always takes time. Often the time frame begins to test the patience of one or
more of the parties. This is where experience becomes important. Newer marriages dissolve
when a party becomes impatient and has no experience with gradual accommodation. One party
will try to stop the snowball or make a large change in its direction with disaster as the result. A
series of patiently applied nudges coupled with negotiation and compromise will often produce
the desired momentum change with a minimum of turmoil.
We also must understand that change is a given and therefor adaptation and the willingness to
adapt must also be givens in order to preserve the relationship long term. What kind of people
does this take? In a nutshell, it takes mature people.
“Mature” is one of those words whose meaning seems clear on the surface, but which can cause
a lot of confusion without a carefully established context. Emotional Maturity is a relatively
recent concept created because of this need for better context. Simply, emotional maturity is the
ability to recognize and manage your emotions before they get control. It requires the ability to
be present in interactions through openness, honesty and demonstration of concern for the well-
being of the other.
Consider the following two ways of dealing with a problem.
1. “When I see a backlog piling up, I am concerned about my ability to deal with it. Can we
find a way to reduce or eliminate those backlogs?”
2. “You need to stop letting things pile up. It makes extra work for me.”
It doesn’t matter whether the backlog is composed of dirty laundry or warehouse orders, on the
one hand we will feel like a collaborator while on the other we will feel like the scapegoat. One
encourages negotiation and the other throws up a wall.
Do the parties in your relationships exhibit this kind of maturity? Do you see it in the
relationships of others around you? Most people have never seen maturity in action and so have
no way to model it in their own lives. People who exhibit maturity should be placed where their
example can affect as many others as possible. What’s the alternative?
If years together don’t make an advanced relationship, what does? By what standard do we
measure the quality or strength of a relationship? This question needs to be answered whether
we are the inhabitant of a relationship or the architect of it.
Let’s go back to the template model for a relationship. Recall that there are at least two roles
involved and potentially many expectations for each role. The advanced relationship is one in
which the inhabitants have collaborated on naming the relationship and the various roles they
will play. Beyond that, they have clearly stated their expectations and each has agreed to do his
or her best to meet the other’s expectations and has formed the intent to do so.
In business, we may not find much tolerance when expectations are not met. There may even be
penalties. This may also be true in personal relationships, but the advanced relationship often
exhibits much more tolerance of failure. Of course if there is never success or if one inhabitant
simply stops trying to meet expectations, there may be severe penalties, up to and including
dissolution of the relationship. In fact, after enough failed expectations, the inhabitants will
simply cease to acknowledge the relationship altogether. Each will begin to behave as if the
others don’t exist. This is the beginning of active subspecialization.
What are your expectations for each? Again, it may help to list current and desired
expectations.
What are your intentions for each? Current and desired is still a good idea.
What information do you need to keep about each? For example you may want to have
accessible the inception date of the relationship (or anniversary), schedules, net payment
requirement, etc.
A Data Example
As you know, there is considerable interest today in genes and genetics. An important
component of this interest is the pedigree. A pedigree is the record of the genetic precursors of
an individual. We know of its application to animal breeding. Someone who is breeding dogs or
thoroughbreds is vitally concerned about the pedigrees of the mating pair.
Many people today are also interested in ancestry and actively build and maintain family trees.
A family tree is similar to but not as rigorous as a pedigree. A family tree will often include
family relationships that are legal rather than genetic. For example, a family tree will generally
include marriages and adopted and/or step children, neither of which are of interest to the
geneticist.
Let’s build a database that will let us store and retrieve pedigree data. In order to increase the
market for our product, let’s also stipulate that it may be used to document family trees and that
the customer must be able to extract one or the other without ambiguity. It’s important to
remember that we will never be able to store information that isn’t visible in the database and, if
it isn’t in the model, it isn’t in the database.
Our architects (data modelers) interview everyone involved and come up with the model shown
above. It is understood that Person will have additional attributes such as date of birth, family
and given names, and current contact information. We may also want to keep a date of death as
well as other dates.
If we convert this model to relational tables we will wind up with three tables.
Person (DOB, Family_Name, Given_Name, Death_date, Current_Address)
Maternity (Mother, Child, Genetic)
Paternity (Father, Child, Genetic)
At this point almost everything we know about the relationships (Maternity, Paternity) is implied
by the configuration of the diagram. The modeler (architect) is satisfied that all of the
requirement have been met. If we give this schema to the programmers, we have at best a 50-50
chance of being satisfied (let alone happy) with the result. Unfortunately, the programmers are
more likely to receive a model such as this
or this, which conveys a bit more meaning but would not produce the same database schema.
In the case of the first model the schema would contain a single table
Person (mother, father, DOB, Family_Name, Given_Name, Current_Address).
We will have lost our genetic information. It is possible to tell the generator to implement
relationships as tables. In that case we will have three tables.
Person (DOB, Family_Name, Given_Name, Current_Address)
Mother (MotherPerson, ChildPerson)
Father (FatherPerson, ChildPerson)
The second model will generate either one or five tables (Person1 and Person2 are not real
entities but copies of Person created to make the diagram read better.
The tables would be either
Person (Genetic_Mother, Genetic_Father, Other_Mother, Other_Father, DOB,
Family_Name, Given_Name, Current_Address)
Or
Person (DOB, Family_Name, Given_Name, Current_Address)
Genetic_Mother (MotherPerson, ChildPerson)
Genetic_Father (FatherPerson, ChildPerson)
Other_Mother (MotherPerson, ChildPerson)
Other_Father (FatherPerson, ChildPerson)
About this time someone notices that “Other Mother” and “Other Father” don’t exactly equate to
“Adopted” or “Step” and we have completely misplaced the notion of marriage. Even though
our database could answer the request to generate a family tree for Person A, the information that
makes it look like a family tree, the information that warms it up and adds family to it is
completely missing.
Why is this happening? How did the process fail?
“I have a meeting to go to.” This statement is usually accompanied with a slight emphasis on the
word meeting and a bit of an eye roll that lets us know that the speaker really has better things to
do. If you were thinking about a new dream house for your family and the architect asked for a
meeting, would you reschedule other things and make the meeting a priority? Of course the
architect must respect your time and make the meeting about you and your dream.
People don’t like to be summoned to a meeting where they will be asked to think about
something they had never even considered and then be chastised for not having ready or
appropriate answers to questions about someone else’s dream. If you really want to sour people
on meetings, make them attend a series of meetings that really produce something and then give
the final decision over to someone who was never a participant, has no knowledge of the
meetings and has a completely different focus.
Let me tell you a story about an actual experience.
The Moral
The moral for the advanced relationship and its architecture is this. If what you’re reading here
has been making sense and if you want to bring advanced relationships to life in your
organization and in your life, then it is essential that you give the architecting process a chance to
work. Tragedies such as the one in the story can be avoided and they must be avoided if we wish
to preserve credibility in any relationship.
We can’t be partly or occasionally or mostly mature. Here is truth.
If I withhold truth because I believe you are withholding truth, then there is no truth. As soon as
I withhold truth, you are forced to make decisions without adequate information. Even if you
intend truth, it isn’t truth. When either party withholds truth from the relationship there is no
truth. In the absence of truth there are no good decisions.
In the absence of truth, no relationship will survive.
The winners become the ones with the most money when the dust settles. This may seem
appealing but remember that the winners will have sacrificed their credibility. Having “won” in
this way, they may find it very difficult, if not impossible, to architect a productive relationship
in the future. Everything is going to cost more in the absence of trust.
another attempt at a model for our Pedigree/Family Tree database. This one exhibits additional
effort and contains some symbols that will be useful to a database administrator or a
programmer. The bottom line is that it enables us to document virtually any kind of relationship
that a person may be party to. We can see who the partner was in that relationship as well as its
duration.
Because we have made each of the original relationships an entity/table in its own right, we are
now able to assert much better control in terms of expectations and intentions. For example, you
will notice symbols at either end of the connecting line which no longer represents a relationship
to us. Now it is simply a vehicle to convey some of the Expectation information.
We indicate the expectation that a Person will be the mother of zero or more other persons.
Another way of expressing this expectation is that a Person may be the Mother of another
Person. In the Maternity relationship we indicate the mother by recording the MotherID which is
really a PersonID as shown by the FK1 notation. ChildID is also a PersonID (FK2). Note that if
the Person is not a mother, there will be no instance of Maternity in which the Person’s PersonID
will match a MotherID. On the other hand we do have an expectation that each Person’s mother
is known. For Pedigree purposes this is essential hence we expect that a Person has exactly one
Mother. This is sometimes expressed as “one and only one”.
You’ll see that there is a Child relationship that does not assume a genetic link. It allows for
arbitrarily defined instance types (ChildType) such as “adopted” or “foster” and provides the
opportunity to record a duration for the relationship so that a person could be a child (ChildID) in
more than one instance. A Person need not have any children, therefore that participation is
optional (zero or more).
The Relationship relationship is included for Family Tree purposes in order to record marriages
or other such relationships of interest. We could record the end of a marriage by supplying an
end_date for the instance. These arbitrarily defined relationship instance may well be of use to
the person researching the pedigree and might be used to record background information learned.
The Notes section here states the expectations with respect to the relationship.
It was my goal here to choose a situation that any reader could understand and provide examples
to illustrate what is involved in “working on” a relationship. There is no single correct solution
and there are nearly unlimited ways to arrive at a solution. What each path will have in common
is that it will have documented as many expectations as possible (or at least as necessary) and it
will have recorded the intentions of the parties in terms of fulfilling the expectations.
The notations used in these examples are useful but you can certainly create your own (although
these have the advantage of having been tested through much use by many people over a
relatively long time.).
26
When the term “real time” first came into use it meant electron-real and not human-real time.
There was a need when designing and building electronics for the space program or for defense
systems, for example, to ensure that one component could communicate with another
instantaneously to avoid creating loops or deadlocks that would render and entire system
inoperable. Today most uses of the term refer to a timeframe measured in seconds (or even
minutes) rather than micro-seconds—literally thousands or millions of times slower.
custom building a data handler. The rules of relational data design need not apply. There can be
no doubt that unusual circumstances may call for unusual solution, however, serious and
protracted thought should be given before abandoning the gains represented by relational data
management. If you are already in need of a programmer whenever you have a new question to
answer from your data or if you hear the term denormalized from your staff, you probably have
already abandoned many of not most of the advantages of relational data management. You may
be in a NoSql situation while still relying on SQL.
Marketing will always result in new names for things if not new things per se. Be assured that
very little that comes onto the market is actually new. Technophiles will always advocate for the
new tool because they know that the next time they interview for a position, all of those new
names will have become essential as a way of demonstrating their commitment to learning and
adaptation. Interviewers typically have no way of ascertaining a candidates level of mastery and
must rely on gauging their level of interest. Your needs are not the needs of the previous
employer and only mastery will give value. Learning to handle every new tool that appears on
the horizon takes a lot of time during which the master would have solved your issue by bringing
the appropriate people together.
Commitment to Action
Every aspect of human life, from the most elemental to the esoteric and spiritual, is plagued by a
tendency to wait for some favorable combination of events. “As soon as…” “If only…”
“When…”
We’ll launch our effort at the right time and until then we’ll keep doing what we have been
doing. There’s a story about a frog sitting on a log who decided, after a long time between
passing flies, to jump off and go to another part of the pond where the insects were more
abundant. “As soon as the sun goes down,” he said to himself.
When the sun settled below the cattails, he prepared to jump into the pond. “Maybe I should
wait until the middle of the night when that big pike is resting.” This thought held him on the
log. At midnight as he once again began to gather himself to begin his swim, he thought of the
raccoons that came to the pond to hunt each night. “It really would be better to wait until dawn.”
As the sun began to shine through the birch trees, thoughts of daytime perils kept him firmly
attached to his log. Day after day, the circumstances never seemed just right until one day a big
dragonfly came by and the frog watched it circle around until it was within reach. At the proper
moment he opened his mouth to snag his meal—and learned that he had become too weak. Now
he saw clearly the value of all of his decisions. They were worth nothing because none had been
accompanied by action. He was committed to eating and staying alive but not committed to the
actions that made those goals attainable.
You are the only one who can do this. If you don’t do it, who will?
Each person and group in the chain from “wouldn’t it be nice if” or “we really need” to “this
isn’t a complete picture” or “this isn’t what I needed” has reasons or rationales for doing what
they did. In most cases the reasons were based on guesses about what was really wanted and
why. Often the instigator of the whole thing spent less than two minutes thinking about what
was needed, why it was needed, and how it would be used. Does that person really expect
everyone down the line to know what’s in his mind? Of course not. They assume without
thinking that those who will produce the result will come back with additional questions and that
the discussion will bring things into focus. It’s true but very sad that those at the ends are often
much more aligned than those in the middle. So called “middle management” are frequently
caught in a no-man’s land where they believe themselves without reliable allies. Truth may
appear non-existent and relationships go unrecognized.
The beginning of the chain expects discussion and negotiation, the end of the chain needs
discussion and negotiation. All of the relationships as we move down the line are undefined with
the possible exception of time box and budget expectations. No one knows the upstream person
or organization well enough to guess what they need and no one knows the downstream person
or organization well enough to guess what they need or what they are capable of. It’s time we
worked on aligning the processes through architected relationships.
If the data architects or modelers worked on architecting those relationships, much more useful
questions and discussion would be the result. If the architects recognized those relationships and
sought information about them as though they existed, the parties would be forced to
acknowledge them as well. This could only improve alignment.
If governance processes included architecting, documenting and monitoring relationships,
organization structure might look and function completely differently.
What if…
An historian compared history to a parade. We, in the present are at the tail of the parade but
clearly part of it. The parade route follows a winding route up a mountain. As we march along,
we can see bits of the past as it passes by. Sometimes we can even see long segments of it. The
revelation is that each time we catch a glimpse of the past, it is from a new perspective. History
seen from the perspective of today’s march may look entirely different than it did yesterday and
so may have a different meaning.
The information technology industry very rarely looks back. The route seems to be changing in
front of us and it seems to take all of our concentration just to stay on the path. It is a shame
because, to see the forests and swamps that were traversed in the past and the way in which they
were surmounted, could help us to see better ways of getting by today’s detours and roadblocks.
Truths
In the information/data marketplace we are accustomed to hearing the expression “a single
version of the truth.” This expression seems like a mantra but there are those among us who
appear to believe that this is a useful goal.
When we’re young and ambitious and all-knowing it can seem like the truth is ours to safeguard.
We hold it out in front of us as we smash through obstacles and crush opposition. At some point,
though, we exchange that initial truth for another—and then another and another. We never go
back to mend the damage that we did with our immature, less developed truths.
Here are some Truths that we can ride to the finish line.
It’s not about me.
“I know enough to be dangerous” is a prophecy
We arrived where we are now by some process. If we ignore that process we will be back
here again.
“Cynical” is a label rarely applied by the mature person but often applied to the mature
person.
The quality of data is a useful concept if and only if
1. There exists a standard that can be applied
2. The processes that produce the data are understood and under control
3. The uses of the data are understood and agreed upon
4. Issues concerning quality are based on exception(s) to one or more of the above.
Governance of data or anything else is for the purpose of predictability and consistency.
Governance is by the consent of the governed. That is, those within the scope of governance
efforts must perceive that they are benefitting from the governance. Without such consent
we have a domination system. In a domination system quality is whatever the person at the
top says it is.
Management of data or anything else is for the purpose of effectiveness and efficiency.
These may be reflected in lower costs but this may not be the case in suboptimized scenarios.
Leaders, like teachers, emerge when they are needed and it is incumbent on the rest of us to
acknowledge them.
Your Mission
You’ve decided you are the one to take on and solve all those data quality problems that plague
your organization. Well, somebody has to do it, right? I mean the lack of data quality is costing
us a lot of money. Holding on to that money for the company would mean a more profitable
company and who doesn’t want a better profit margin?
There are a few things you should know before you commit yourself to this.
11. Data Quality is like dusting. Some of you will understand this and others won’t. What I
mean to say is that there will never be an end to it.
5. Data Quality is NOT about technology. Technology is the spotlight that makes [the lack
of] data quality so visible (and so expensive).
6. In times of rapid change, Data Quality issues are inevitable. Today, a company that isn’t
changing is dying so… The good news is that if you get good at this, you career is as
secure as any career can be.
7. Even though Data Quality is NOT about technology, you’re going to find that you will
need a very good foundation in technology processes and especially the processes
employed in your company in order to have any chance of identifying the right places to
apply pressure.
8. An approach to Data Quality that goes from one bunch of low-hanging fruit to another is
going to be a net additional cost (not what you want to be associated with).
9. In practice, there is no difference between a Data Quality program and a Data
Governance program. The goals are only slightly different and the methods may be
identical. The goal of Data Governance is the establishment of an auditable process
leading to consistently high quality and reliable data. The goal of Data Quality is
consistently high-quality and reliable data which will require auditable processes.
Auditable means provably consistent.
10. You are going to learn that no one (and I mean absolutely NO one) wants to talk about
data. Learn to talk about other things and use them to illustrate the concepts you want to
teach.
11. You will not be able to do this alone. It’s going to take leadership on your part to
mobilize support and participation across the company.
12. It’s a really good idea to be able to communicate the data quality vision for consumption
by any audience. This will require you to be able to express it so that your audience gets
it (in their language, using their metaphors…)
13. Finally, your Vision is the only thing you will have to sustain you in this so make sure
that it is clear in your mind (and heart).
Two Approaches
There are two ways to approach data quality. Both involve a process that looks like this.
The difference between the two is that one deals with some instance of data (customer, patient,
procedure, visit, lab, order, invoice, etc.) while the other deals with the processes that surround
any data within the organization. The difference lies in the word “this” within the decision that
“We have to do something about this.”
The one that is used most of the time is to attack a specific issue that has become apparent in
terms of its cost to the organization. This is repeated as new issues arise. The other is used only
in organizations that are high-functioning. These are the organizations that have adopted a
capability/maturity based vision for themselves and are on the path to the Malcolm Baldridge
Award, or ISO quality certification or CMMS Level 5. The diagram shows, in pyramid form, the
The Process
Here is a process for reducing costs, eliminating complexity, asserting control and living happily
(if not joyously) with your data. At some point in your life you have heard (or maybe even said)
that the bigger (read more mature) person must take the first step in mending a relationship. This
is absolutely true and you are now that person, having come this far with me. That means that
the next step is ENTIRELY YOUR RESPONSIBILITY. The people and organizations (be they
neighbors, coworkers, teams, departments, divisions or competitors) may not even be aware that
they are in a relationship with you. Were you aware when you started reading this book?
The carrot dangling in front of you consists in
reduced frustration
better communication
fewer (ugly) surprises
improved efficiency and effectiveness
reduced costs (both in money and energy)
If these are worth it to you then you’ll want to start with these steps:
Make a list of those relationships. (Maybe you want to focus on just one at the
beginning.)
Name them as they are now and as you want them to be.
What are your expectations for each? Again, it may help to list current and desired
expectations.
What are your intentions for each? Current and desired is still a good idea.
What information do you need to keep about each? For example you may want to have
accessible the inception date of the relationship (or anniversary), schedules, net payment
requirement, etc.
Who is the person you want to partner with? Recognize that organization-to-organization
relationships are always based on one or more person-to-person relationships. Even a
historically bad inter-organization relationship can be re-architected and restored by
choosing different people to be the core. These people need not be visible on the org
chart. It is only necessary that they be opinion leaders.
Are your expectations and intentions different for this person than for the relationship? If
the problem relationship involves a department, for example, you may pick out a person
in the department who is an opinion maker and build a relationship with that person as a
first step in getting to the inter-departmental relationship that is the goal. Remember that
group that you wanted to be part of when you were a kid? You could find a way in by
making friends with one of the members and letting them bring you in. That’s what
we’re talking about here.
How formalized does the relationship need to be? We often go too far in formalizing,
thereby setting the stage for the eventual breakup. Intentions are key here. Do you
intend to enforce all expectations all the time? If so, you need a contract which is a
special kind of relationship that can be enforced in court. If, however, you intend to
communicate in those instances when expectations are not met, and to negotiate
improvement, then you need much less in terms of formality. Beware of formality. It is
much easier and less costly to get what you need by talking about it over coffee than by
using attorneys and courts.
What do you know about the potential partner?
What do you think their expectations and intentions are?
What do you want them to be? Look at your expectations and intentions again. Are they
still reasonable?
When you are clear about the relationship you want and why, approach the other party
and lay out your cards. Show them the steps you’ve gone through and the carrots that are
motivating you. Do not try to “close the deal” at this point. It took you time to get here
and you may have to walk them back to where you started and give them time to get back
here on their own. Remember the spider web.
If you have come here by considering “we” and “us” instead of simply “me” you will at
least have made an impression and softened resistance. You must expect, though, that
when they come back, their expectations and intentions may be somewhat different than
those you attributed to them. That is the first step in a successful relationship.
The negotiation is never completed in a strong and well-architected relationship. Think
of it as preventive maintenance. The biggest (and ugliest) surprises happen when we take
something for granted and the name on the relationship doesn’t matter in that case.
Epilogue
Those who become lost in the wilderness always find themselves going in a circle. It can be
extremely debilitating mentally and spiritually to invest yourself completely in something for an
extended period, risking your well-being and your career, only to find yourself back where you
started.
The first law of survival when you are lost is to find a place that is safe and offers the possibility
of water and food AND THEN STAY PUT! Wandering in circles is good exercise but it also
represents risk. What if we get stuck in place that doesn’t offer the basic survival needs?
You are going to become lost. This book should be considered a basic survival guide. Equipped
with it you should be able to recognize your surroundings and situation. Using it, you should be
able to plot a path that will get you closer to your destination.
At the very least, it is evidence that someone has been here before and got out, scarred but alive.
Consider this, then, the blaze mark on a tree or the message in a bottle. Someone has been here
once and gotten through relatively intact and able to describe the journey. Those entering this
wilderness would be wise to listen. There are plenty of hazards not discussed here but heeding
this advice will help you to get farther faster and you may be the one who conquers the
problems.
Appendices
The following appendices are offered as bread crumbs to be followed for those interested in
seeing how this book came into being. They are a collection of writings that represent
developmental steps. As such they may be useful to provide a stable foundation for some of the
intuitive leaps. If you need them, here they are.
More on Perspective
I worked for a company that had the hardest time getting new things started on schedule. Some
new product or line of business would be announced with a grand opening date. Often the date
was only a few weeks away. The problem was—over and over again—that the people who had
to actually make the new thing happen found out about it at the time of the announcement.
So guess what happened? Either the schedule slipped and slipped again or the doors opened on
something that was incomplete and holding all the pieces together was extremely hard on
everyone involved. Is that what you guessed?
The root cause of this, as it turned out, was lack of the needed perspectives. Highly placed
persons believed that they could make all the commitments for many of the enabling functions.
In principle, this was true and needs were recognized and responded to AT A HIGH LEVEL. As
any general or coach will tell you, the best strategy is only as good as the troops or players who
have to execute it.
When we finally got a group together to analyze the situation, it became clear that no one person
could know all that must be known to plan and implement the project. What was required was a
meeting—as soon as possible—of representatives of all the business functions involved as well
as those who are involved in everything (facilities, telecom, network...). All the perspectives only
emerged in a group setting where people with specialized knowledge could bounce ideas around.
Personal perspective expands in a group setting.
Again I ask, what perspectives are required for a successful BI
Information or Systems?
For those of you who work in I.S. or I.T. or any of the variants, a question. Is your organization
about information or is it about systems or maybe it's about technology?
My sense is that many more people are in it for the systems (programming) or the technology
(networking, servers, wires, boxes) than for the information.
Just to get all the cards on the table, I'm asking this from the perspective (there's that word again)
of someone who has been having his nose bloodied for years because of a stubborn streak that
keeps on insisting that it is about the information and that everything else is supporting cast,
walk-ons and extras (the Academy Awards are a recent memory).
The term, ontology, has become trendy in the relatively recent past. It simply means a
specification for a concept. A concept is an idea and many times it never progresses beyond that.
Rarely, an idea like freedom or liberty needs little or no specification to make it useful, Many
ideas, like stewardship, on the other hand, need quite a lot of specification before they become
useful.
Information Systems/Technology appears to be in need of some ontological work. I.S./I.T. are
ideas that require a context. They are found in the context of a business. The business, in turn,
has a context but we don' need to go that far for the purposes of this discussion.
Businesses need to produce, dispense, store, manage many kinds of things and all of them are
physical save one—information. Because information is a concept in its own right, it quite often
gets pushed out of the way in favor of the physical things that compete for our attention by virtue
of form, color or sound. These things require physical space and unless we do something, they
will soon pile up and make it impossible to get anything done.
Quiet information or data, on the other hand makes no demands and is consequently ignored.
Remember, though, that the business has an I.S. or I.T. organization because every now and then
someone needs a specific piece of data or a chunk of information and needs it now. Sometimes
the data has just come into existence and other times it has been languishing in a "data file" for
days, months or years.
How do we find that set of ones and zeros and turn it back into the concrete abstraction that the
business needs? Friends, that takes data as well. Device names, drives, folders, files, instances,
records, fields, indices, values—all of that is data. In I.S., we understand the need to keep that
kind of data reliable. We create systems and they are data as well. We understand the need to
maintain our system data: product, version, build, component, QA status... and the implications
of not doing so.
Frequently the Data people (data administration, data architecture, data stewardship, data
governance, database administration...) are part of I.S. or I.T. and we're content with that as long
as they are directing their attention outward, toward the business. As soon as they begin to
exhibit interest in us and our handling of our own data, we start to feel resentment, frustration
and even anger. "Who are they to tell us how to do our job?"
Friends, and I am sincere in my use of the term, programming is programming and data is data.
The Data people can help you and they want to help you and, most of all, they need to help you
in order to close the loop. They are being held responsible for the quality of the data resource and
the processes that create and manage the resource. You represent a huge exposure as far as they
are concerned. When you re-learn to associate your system with the information that flows
through it, I hope you will also learn to value what the Data people are offering.
Information Systems, Information Services, Information Technology: let's refocus on the reason
and purpose of those efforts. You can benefit from the consistency that results from standard
processes. You can benefit from better data management capabilities. We can all benefit from
understanding our shared purpose—the best information for the business we're part of.
Oxymorons in Abundance
No Governance in Data Governance.
No Intelligence in Business Intelligence.
No Leadership in Corporate Leadership.
Let me hasten to say that these are not intended to be value judgments. To be fair and truthful
(which are tough sells) I should modify these statement a bit.
1. There is little governance in data governance, little intelligence in business intelligence
and little leadership in corporate leadership.
2. The question to be asked is not "why" but "how". Whenever we are faced with something
unexpected, we are used to responding, "Why?" I've learned that why? almost always
puts people on the defensive and that communication effectively shuts down when people
become defensive.
If we change our question to how?, we can focus on processes and look for cause rather than
fault. So, how does it come about that data governance so often lacks any vestiges of
governance?
The first task is to differentiate between leadership, management and governance. This isn't
about people—an individual may be capable of doing all three—but it is about tendencies.
Here's a breakdown that might help this to make sense.
Leadership is about change.
Management is about effectiveness and efficiency.
Governance is about consistency and stability.
My take, after watching these processes work for many years, is that we find it nearly impossible
to keep these three functions separated. When we are doing management while talking about
leadership or doing leadership when we are talking governance, we not only confuse ourselves
but also the community we are working within.
For example, we make a leadership decision to create data governance. Then we turn the task
over to managers. In reality, leadership is required all the way out to the line organizations.
Managers cannot make effective or efficient something that they do not understand and have not
bought into. A too-early transition from leadership to management will give birth to confusion,
frustration, burn out and the failure of the initiative before it even gets to the governance phase.
Continued tomorrow.
Measuring Governance
I apologize. I said I would address this yesterday. We do have to get back to the guerrilla
movement on the frontiers of the empire, but let's take a little time to look at measurement of
governance.
First, let's agree that data governance is like any other governance except that it focuses on data.
A governance program directed at process or at competency or whatever, would have the same
characteristics? OK, I'll attempt a justification for that statement.
What do we ask of a data governance process? What are the objectives? By the way, I use the
term process here in the sense of a set of activities that are ongoing and have a consistent
purpose. The purpose of the data governance process is to:
Optimize the value of the data resource by insuring that the capture, storage, retrieval,
destruction and use of the resource is done in accordance with established policy, procedure and
standards.
Do you buy it? If not, I'd be pleased to discuss alternative purposes, but the remainder of this
discussion is based on this purpose.
Based on the purpose of data governance then, several perspectives on measurement suggest
themselves. The most obvious one is the QA (quality assurance) perspective. How are we doing
at following established standards? It is tempting to count the number of standards, policies and
procedures because counting is easy to do and there is a tendency among the governors to equate
many laws with good government. Strangely enough, among the governed the emphasis is on the
quality of the laws rather than their quantity. A small number of effective and easily understood
standards may deliver more benefit than a larger number of over-specialized or esoteric ones.
The most effective measurement will be part of the standard or process itself, but some
organizations may find it useful in getting governance going, to do retrospective analysis to see
how well/consistently processes are being applied. Health care makes extensive use of the "chart
review" to gather this kind of data retrospectively. Measurement intrinsic to the process or
standard has the potential to be much more nuanced and useful than that done retrospectively
simply because all of the context is available.
Clearly, though, the nature of the metric(s) is very much determined by the process or standard
itself. For this reason, it makes no sense to discuss metrics or KPIs (key process indicators), a
special kind of metric, without first establishing the process context.
Other perspectives might differentiate among standard, process, and policy or might measure in
conjunction with the data life cycle, specific subject areas or specific usages.
One last point, should you be tempted to think in terms of measuring accountability.
Accountability in the absence of a standard is really approval.
No governance mechanism can exist for long based on approval. Each change in "leadership"
will create massive turmoil as everyone seeks to reorient to a new approval model.
Standards Clarification
A bit of a postscript to the last post:
I can almost hear the snorts of disgust. Many in healthcare will be quick to dismiss the last post
by telling themselves that "we have standards." I can't allow them to let themselves off so easily.
Of course healthcare employs standards. I was, for a brief time, part of a newly formed HIMSS
task force on standards. Healthcare has a wealth of standards, none of which are truly standards
in that all use words such as should and unless and if possible.
Healthcare has not seen fit to develop a framework for standards and no ontology by which to
bring sense and meaning (and thereby value) to the hundreds of standards vying for attention. In
truth, anyone in healthcare can say without fear of contradiction that "we have standards" and
none of those assertions mean anything.
If n people or organizations are doing the same work using n (or even n-1) standards, it cannot be
said that the work is being done in accordance with standards. This is said routinely by each of
the workers but to those who view the work from an objectifying distance, it is quite clear that n
standards is no better than no standard.
Measurement of process in healthcare has a long way to go before SPC principles can be applied.
How, by what process, can healthcare practitioners be brought to believe in the power of process
standards through which measurement standards can be developed? Whose interests are satisfied
by the status quo?
Re-Branding
Mass marketing seems to be an American (United States) invention and may be the single most
impactful innovation of the last century. Please note that I make no value judgment. We each
have to make up our own mind whether the impact was positive or negative.
Certainly, it has served to increase wealth so if your standard is ROI then you would have to
view mass marketing as a positive development.
The downside effects are much more difficult to measure—plus virtually nobody wants to talk
about the downside. Just as clearly, people have been convinced that they "need" something that
they didn't even know existed. To that extent, a lot of raw materials were consumed and a lot of
byproducts were produced because of the success of marketing.
Perhaps the biggest downside from my own perspective is the continual re-branding of
technology practices. The effect of this is that everyone is on their heels all the time. We are
bombarded with new acronyms and substantial effort must be expended to learn about them.
Unfortunately, the common result of this effort is the realization that this "new" thing is really a
20 year-old concept with a new name.
Those not equipped to realize this invest even more time and energy in trying to make this new
thing be their magic carpet without ever discovering what it was that kept the rug from flying the
first time around. Technology folks are easy marks since they often are completely unconcerned
with history—newer is obviously better after all.
Lots of money is being made but society is paying the price. Healthcare is the perfect example.
Technology churn is costing billions at a time when everyone recognizes that costs are out of
control. No worries though, we'll just focus all the lights on insurance premiums, thereby
diverting attention away from the decision makers.
BI and Re-Branding
And by the way, there seems to have been a re-branding of "BI" for the mass market.
As recently as a year ago, BI (Business Intelligence) meant something special. Different kinds of
information displayed for very specific purposes. Now it seems to mean "reporting" (although
"BI" is a lot more edgy than "reporting" so probably worth more money.)
If you are buying "BI" and paying BI prices but getting basic reporting then you are a victim of
mass marketing and re-branding (see previous post).
Changing
I have been doing a few presentations of late on the subject of "Guerrilla Governance" which is
about the application of guerrilla principles to the quest for good [corporate and data]
governance.
The central theme is commitment founded on a vision and how to use that to create community,
communication and credibility. Through it all, the message is that complaining, wishing and
waiting has not produced results, is not producing result and will never produce results.
I learned that I already have what I need and now I'm working to get that message out. The raw
material, the resource used to power the change we need is in plentiful supply. It's the pain,
frustration, and lack of fulfillment encountered in everyday work life. Even if I do not feel it,
others all around me express these feelings every day.
The norm of work life is approval-seeking. The rare business has created a system of standard
processes and metrics that frees its employees from the need to seek approval. These are the
CMMI Level 5 companies and the Malcolm Baldrige Award winners. The vast majority invest a
handful of people with authority by virtue of a title and force everyone else to seek their approval
in order to change anything.
If you get this, it's up to you to change it. Alignment is the grail sought by management. It is
thought that alignment will produce the "well-oiled machine." The problem is that the "folk
wisdom" of the executive suite and the board room seems to be that the basis of alignment—
vision—is something best kept close. Rather, alignment springs from a common vision. A shared
vision is the shortest path to alignment.
If the leader of the company isn't actively sharing their vision with each and every employee of
the company, then it isn't happening. Reliance on staff meetings to promulgate the vision is very
much like the old party game of telephone. Who knows what the person at the other end is really
hearing? There are other visions out there—I have one myself. Whoever you are, whatever your
job, I urge you to hold yourself accountable to the grandest vision within you until it is replaced
by one even more grand. Be responsible for the change you need, but remember that the change
IS you.
Note to Sec. Sebelius
Secretary Sebelius,
I appreciate very much your stated position (according to Healthcare IT News) that technology
adoption in healthcare is not enough, that interoperability of technology is also necessary for
healthcare reform. I wonder how much you know about interoperability of healthcare
information systems. I wonder only because there is nothing in your published biographical
information that leads me to believe that you have any background-in-depth in a technology
discipline.
I don't mean for this to sound like criticism—it isn't—I think your position is a correct one and
your advisers have done a good job. I wonder if you are aware, though, that there has been talk
of interoperability for several years within the healthcare marketplace and there have even been
claims of the achievement of interoperability. There has even been a "certification standard"
published purporting to validate system interoperability.
All of this isn't worth the effort it took me to type the words. The reason for this "much ado about
nothing" is simply that there is no incentive within the marketplace for the level of cooperation it
would take. Technology of all kinds is the cash cow of healthcare and no one involved has any
reason at all to kill that cow or even to bring it into the barn.
In the early 1980's, the Department of Defense had a very similar problem. Each branch (Navy,
Army, Air Force, Marine Corps, Coast Guard) had its own procurement structure and its own pet
contractors. There were no standards and all that was necessary for a contractor to be successful
was to maintain some level of credibility with the procurement officer(s) involved. The result
was that (for example) Army units in the field couldn't talk to units of other services because
their communications equipment was incompatible. Logistics was a nightmare because of the
variety of spare parts that had to be maintained and computer systems incorporated the "dialect"
of the purchasing service and could not exchange information with the systems of the other
services. This is the surface of the problem. The technological diversity went much deeper as
well to the point where it was a major procurement effort to get two systems to communicate.
NASA was developing plans for an international space station and realized that they were going
to have to fundamentally change the way that systems were specified, developed, and
implemented if there was ever to be any hope of success.
The Defense Dept. took control of the situation through an initiative called Software Technology
for Adaptable, Reliable Systems (STARS). DoD mandated that processes and methods (and their
documentation byproducts) as well as tools and other technology used in the creation of systems
be standardized for the purposes of reducing costs and delivering a level of interoperability.
Healthcare operations and all of their vendors—virtually everyone outside the walls of the DoD
and the Software Engineering Institute at Carnegie-Mellon University remain blissfully unaware
of any of this history, all the while enjoying its fruits.
I want you to know that I believe interoperability can be obtained, but not without the institution
of new paradigms and some major upheavals in the technology vendor community. I have
dedicated 13 years of my life to laying some foundations where I can and I fervently hope that
you have the commitment and the political will to see this through. Without that, government
efforts are likely only to increase costs.
Coaching
It occurs to me that many people probably don't understand what coaching is or how they might
benefit. Since I am advertising myself as a data management coach, the first task in marketing
myself may be to do some education on what should be expected from a coach and differentiate
coaching from consulting.
If your only exposure to coaching is youth activities or watching your favorite team on TV you
may have an idea that coaches call the shots, that they direct, and are to be obeyed. Nothing
could (or should) be further from the truth. My job as your coach is to understand what your
capabilities are (as well as those of your "team") and to use that knowledge to help you find ways
of attacking your goals that are likely to lead to success.
You might also have developed the idea that coaches are cheerleaders and that one of their main
jobs is motivation through exhortation. Again, not true. While I will be quick to affirm strengths
and celebrate success, I will not create unrealistic expectations. A coach's goal is to help you to
understand the most effective ways at your disposal for addressing the problems and challenges
that will confront you.
A youth soccer example will illustrate. If you are fast and by nature aggressive, you can succeed
as a defensive player by attacking the ball and taking it away from your opponent before they
have a chance to score. If you are not the fastest player on the field and are a bit passive or
hesitant, you can still produce a good result for your team by merely staying between the ball and
the goal and delaying your opponent until help arrives or by forcing the play out to the edge of
the field.
In data management, similar principles can be applied. An aggressive, direct approach may
succeed for some while a more calculated and collaborative approach may work better for others.
In any case, you will want your coach to be able to help you find the successful path which calls
for experience as well as expertise on his part. One of the least appreciated values a coach
provides lies not in what you do but in what you DON'T do. Your coach wants you to be
successful and will help you avoid situations in which you can't or are unlikely to succeed.
You have knowledge, talent—all the raw materials for success or you wouldn't be where you are.
Sometimes what you don't have is time or some specialized expertise and in that case you will
want a consultant who can come in and get 'er done. But sometimes this is counterproductive
because you won't be able to keep calling the consultant back each time you need a change or
repair. If you have some time, a coach may be a better alternative since he will leave you with
success strategies and tactics that you can continue to apply.
You want your coach to be at your shoulder, ready to answer your questions but also to be asking
you questions continuously to help organize your thought processes. In that sense a coach is
more than a teacher and more than a mentor. A teacher will not be responsible for the application
of the subject matter. A mentor may be standing by at the end of a phone line. The coach will be
there with you.
Guerrilla Governance
In the March 14, 2009 post "Guerrillas and Governance" I introduced the notion that, because of
long-time inattention to the needs of the people/workers on the frontier (organizational
boundaries), systems of governance will have been developed there and may have been in use for
a long time.
In most cases, this governance will be relatively crude and inadequate. In the modern context, it
might be something as simple as "We don't accept those after 2 PM so that we give ourselves
time to get them done before 5:00."
What we, as guerrilla leaders should perceive is in two parts:
1. This group is dealing with a problem and has a "process" in place for doing so.
2. There is a problem. It is recurring. It has a cost.
If we feel the need to introduce a new level of governance that eliminates the problem rather than
dealing with it on a repeating basis, we must take into account both of these parts.
There is a ready-made community here and they have banded together for mutual protection. We
dare not dismiss that fact or we will create opposition that will resist us to the bitter end. Until we
take the time to make them feel (not just understand) that we really want to help them with their
problem—not ours—they will resist all of our efforts.
The dialogue goes something like this:
you: It looks as though you are experiencing problems with [form, file, request...].
they: You wouldn't believe the kinds of /@#*(^ we get. And it's most of the time.
you: So what do you have to do when you get one like that?
they: When that happens, we have to [lists multiple process steps needed to remediate]. That's
why we have to have a cut-off at 2:00.
you: So, if I understand this right, you are getting unusable or unacceptable input from [another
boundary function]?
they: That's right. They just don't seem to care how much we have to work.
you: What happens when you complain to them?
they: They just say that it's their job to generate [forms, files, requests] and it's our job to process
them.
you: I think there's a good chance that we could guarantee that you wouldn't have to do any of
those process steps you told me about or, if you did, it would be rare. Would that make your lives
easier?
they: Absolutely. How would you do that?
you: First, we should put together a meeting. I've already talked with them and, believe it or not,
they are dealing with similar problems and similar frustrations. I think the solution to your
problem is the same as the solution to theirs. To make sure we need to meet because there are
still a couple of things I need to get clarified. Will you help?
they: Tell me when and where. I can't meet on Tuesdays at all.
And so it begins. You will use their pain to elicit their cooperation. Their cooperation creates a
new community. Community action guarantees compliance. A newly empowered community is
a breeding ground for improvement of many kinds.
This is guerrilla governance. The only requirement to get started is a goal. You will need to be
able to articulate the goal over and over again in many different dialects. In many cases, you will
only want to expose the part of your goal that your audience is able to comprehend. Never try
hide the fact that there is more. You'll simply answer all questions openly and honestly and never
insist that anyone needs to understand your perspective. "We'll improve our understanding as we
go." is a good way to postpone dealing with difficult questions until more education has
occurred.
Always remember, you can't do this without them. Their commitment is vital. Talk freely to
management about progress and remember that management has pain as well. You're a leader.
It was never clear to me what the scope of this authority was to be or how it was to be used. I
finally asked the question, "Authority for what?" You may have heard that responsibility without
authority is the recipe for stress and burnout. I thought to pursue this line of thinking as a way to
discover what was meant by data governance. If I know the nature of the authority, I should be
able to deduce the nature of the responsibility. The question never received an answer. What I
got was blank looks.
I felt a strong need to get to the bottom of this since the word "enforce" or "enforcement" was
also used several times. I was becoming extremely uncomfortable.
Friends, if people do not accept governance and cooperate with it, then the governance model
needs to change. We do not need enforcers. We need arbiters, mediators and facilitators. More
than anything else we need teachers. I've heard it said that we all do the best we know how and
when we know better, we'll do better.
Controls and attempts to control do not work in governance. They only create bottlenecks and
delays that encourage people to find other ways. In our local civil government, we call it red tape
and bureaucracy. For example, building permits are required for many home improvements. The
reasons for this requirement are excellent. The permit and the resulting inspections (audits)
protect the current and future homeowner by insuring that the project is safe. In spite of the
obvious benefits, many do-it-yourself homeowners avoid the permit process because the process
is obscure, the standards must be discovered, it can be inconvenient, it adds to the cost and is
known to produce delays. Furthermore, the only way for the scofflaw to be caught is through an
inspection and the authority has no reason to inspect other than the permit. Note that contractors
licensed by the authority are much more likely to comply.
Contrast this to the governance of traffic on roadways. Standards are clearly displayed; drivers
must pass a licensing test demonstrating both physical capacity and knowledge. Law
Enforcement (To Serve and Protect) is primarily tasked with monitoring compliance (which their
mere presence guarantees). Compliance metrics are gathered via various kinds of technology and
governance changes (to speed limits, traffic signals, etc.) are made based on these audits. What if
we had a committee at each intersection with the sole authority to direct traffic?
As you can see, governance requires an initial framework (competence, licensure), a coherent set
of standards (coherent in the sense of both understandable and integrated), and monitoring/audit
capabilities. Anything else is extra and may even get in the way.
The result of good governance is a community that enjoys consistency, predictability and safety
and is mostly free from nasty surprises. The authority that is present is passive and present only
to deal with issues that don't fit within the governance structure. If authority is needed
everywhere, there is no governance anywhere.
We The People
We the People of the United States, in Order to form a more perfect Union, establish Justice,
insure domestic Tranquility, provide for the common defence, promote the general Welfare, and
secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this
Constitution...
Article 1 of this constitution describes a representative form of governance, recognizing that the
needs for deliberation and timely decision making can best be met in this way. This was
particularly true in a time when travel was by foot or by horse (or other animal propulsion) or by
water propelled wether by wind, oar or paddle.
Two thoughts come to my mind:
1. What might this article say if written today?
2. There has been no need to modify the principles set forth during the ensuing 222 years.
All of this leads to a third thought. If the goals of corporate governance are substantially the
same
more perfect Union—Every CEO wants the company to operate as a unit, with a single
purpose
establish Justice—A sense of justice is a prerequisite for people to focus on their duties
and responsibilities.
insure domestic Tranquility—Inter-personal and inter-organizational dissension is a
primary cause of lost productivity.
provide for the common defence—The company must defend its position in the
marketplace and each employee is critical to that defense.
promote the general Welfare—This goes hand-in-hand with justice. It's human nature to
want things to be better.
secure the Blessings of Liberty—Personal liberty is always subject to the other goals.
then maybe we ought to consider whether the method should be the same.
It's hard for me to consider data governance (which is where I'm coming from) in a vacuum. The
goals of data governance are substantially the same goals outlined above. Defense is about
defending the integrity of the data resource. Union is about consistency. Justice and welfare is
about everyone living by the same rules (thus producing consistency).
I don't want to make data governance sound so impossibly complex that we throw up our hands
in surrender. The message I'm transmitting is that we have models to use. We do not have to
reinvent governance.
One of the difficulties in any governance model is to come up with a definition or picture of "the
governed". We go through life happily assuming that everyone else is "just like me" in terms of
their wants and needs. Mostly that works, but every now and then, we run into someone who
isn't "just like me." When that happens we have two choices. Either we try to make the other
person just like me or we adapt our view of "me" so that it includes some new parameters. In
corporate life, it is exceeding dangerous to assume that anyone in a role different from ours is
"just like me."
Even if we restrict ourselves to data governance, we find that we have to include as "governed"
many who are filling different corporate roles and are definitely not "like" us. Again, I go back to
the American Colonies in the mid eighteenth century. Imposing or trying to impose a set of rules
on people whose life and needs I don't understand is destined for failure. The secondary
message is: either include everyone in designing the rules or (poor second choice) understand the
needs of the others before designing the rules.
Everything I see and hear about data governance is from the point of view of the person whose
role is management of the data resource. There isn't a single person in the marketing department
who would ever conceive of the need for data governance. Of course, we can spend time in
learning to talk the marketing language and becoming familiar with marketing problems, then we
can show them that some kind of governance is needed and they will agree. They might even
agree to invest some time on a committee. Eventually, though, they're going to wonder if this is a
good use of their most precious resource—time.
Making laws (standards) is a messy process. Much of the data governance effort is about the
process—identifying stakeholders, building consensus, the political side of things, while the
standards and processes become a very small box on a big diagram. My thought is that we don't
even know the stakeholders until we understand the processes. The political side is essential, but
there is a lot of good we could be doing if we would focus on the processes and standards.
I keep saying this because, while there may be similarity in the way two corporations handle
governance, I have serious doubts whether it will ever be possible to export one company's
solution to others. The political implications of forcing an outsider's will on a population would
cause "failed" to be stamped on the effort nearly immediately.
Bottom line: You're on a burning platform. Don't wait for someone to save you. What do you
have? What can you do? Do it!
Can and Should
Can and Should are in constant tension. They both imply something that has not yet happened—
in other words, they both are in the future. So here's the key question:
Do you want your future to be composed of cans or do you want a future of shoulds?
Should is closely related to could.
If you could do what you should do, would you do it? If you should and could but don't, what
kind of future do you have before you?
Is your past characterized by "might have", "could have", "would have", "should have", or as my
father was fond of saying, "mighta, woulda, coulda, shoulda?"
What's the difference between could and can? It might be knowledge or it might simply be
practice. For many people, the biggest difference is the realization that there is something beyond
"I can." Parents fill this role as do teachers, mentors and good friends. The process of revealing
the new world of could is known as coaching.
What we should do is a function of goals, history and current context. Most of us get paid to
know what what should be done. Most of us also take the easy way out and do what we can
rather than what we could or should. In fact, "Do what you can," has become a universally
accepted surrender. When the boss says it, it means that
1. they don't know what should be done
2. they don't know what could be done
3. they don't want to be bothered with knocking down roadblocks
4. they don't really care about the outcome
When I say it ("I did what I could.") it means
1. I know what should have been done
2. I know that I could have done more
3. I told them but they wouldn't listen
4. I was not committed to a quality result
We nearly always allow ourselves to choose the familiar path. When faced with a choice
between can and could, we choose to do what we have done in the past—can.
We cannot get the data quality we need unless we have the governance we need and we can have
neither if we continue to do as we've always done. This is macro as well as micro advice.
Governance is not committees and steering groups, though it may have need of such. Data
quality is not one definition, though that may be helpful. Both are about contextual consistency
and predictability. This goal could and should be achieved in whatever ways are appropriate to
the context within which the consistency is desired.
Consistency is a product of process and the foundation of improvement. Once the process
produces consistent output, you have freedom to classify and categorize its output in whatever
ways are suitable to its customers. We are currently engaged in trying to classify, warehouse and
use inconsistent products created by inconsistent processes.
What could we do? What should we do?
Haves and Have Nots
When I speak of have here, it should be clear that I'm referring to resources. Less clear but no
less important things to have include:
need (acknowledged)
commitment
known cause of pain
First and foremost is availability of resources to be applied to making improvements. Data and
information quality diseases have much in common with human diseases in terms of diagnosis
and treatment. There is much discussion today concerning the state of health care in the U.S. The
discussion focuses not on diagnosis or treatment—those aspects are well understood (if
imperfectly practiced)—but on paying for the diagnosis and treatment.
It seems that financial resources or the lack of financial resources is the single most important
determinant of physiological well-being. If we examine the whys behind this, we soon see that
expectations have much to do with it. The person without financial resources learns to expect
that some problems will be chronic and learns to live with them, perhaps at a lower level of
function. The financially well-off person learns to expect that every problem has a cause and a
cure and that time and money will produce the expected well-being.
Neither is absolutely correct and both sets of expectations produce advantages as well as
disadvantages.
We can apply the lessons of health expectations to data quality. Larger or wealthier companies
expect that they will be able to attack a quality issue with sufficient resources to conquer it.
Smaller or less well-off organizations will not feel able to dedicate one or more people to the
issue and will elect to "do the best they can" (see previous post). Small business leaders will see
that everyone must be involved in the solution for it to work and that alone will cause them to
turn away from a frontal attack and "make do." Large business leaders may believe that the right
manager or leader with sufficient resources can bring it off.
Again, neither is absolutely correct.
A person or an organization resigned to living with pain is always going to find it difficult or
impossible to improve while a person or organization immersed in full scale battle with the
problem may well miss opportunities for improvement.
As it turns out, a "data quality" campaign is like a campaign against bacteria—almost
meaningless. Because the scope and scale of the campaign preclude considerations of nuance, we
find that we make enemies from within the ranks and everything degenerates until nothing is
happening. We can make progress against a specific bacterium or a specific quality issue but we
soon realize that we can't hold those gains without creating a framework within which we can
establish trust, confidence and consistency. That framework has come to be called data
governance. In the case of physiological disease, the framework is Medicine.
Whether you're a have or a have not, the resource issue turns out to be far less important than we
might have thought. Consider expectations first.
Can we live with or adapt to the pain?
Have we already adapted? How?
What limitations are imposed by the adaptation?
We can choose to treat symptoms, cure the disease, and prevent the disease. Which is
within our reach? What can we do? What should we do?
In most cases, the best choice is to treat symptoms while making lifestyle changes to prevent the
disease. Sometimes we have to cure the current disease or we die before we can implement the
lifestyle changes. The point is that we always have options. A specific option must consider the
past, present and future. A combination of options may produce the best result. Last but not least,
have and have not is not really about resources but is about expectations. Commitment is often
born of desperation when we realize that we just can't tolerate the future implied by our current
expectations. Now we're really ready to do something meaningful.
What Do You Do When Things Aren't Working The Way You'd Hoped?
Let's pick a context first because a) this problem is pervasive in the world I live in (how about
you?) and b) the context will determine our course of action. I'll use my own life as an example.
I have spent my life seeking to understand my environment so that I could have a chance of
staying out of hot water by being able to predict outcomes. I actually got pretty good at the
predicting part but was never able to translate that into the staying out of hot water part. It turns
out that when you see a result coming that is unwelcome to everyone, hot water is the least of
your worries.
Of course I could have kept quiet and just let things happen but the problem with that is that
almost invariably a minor course change would have prevented the outcome. It always seemed
reasonable to attempt that minor change. Just as invariably there were political implications
involved in any changes to the published plan. Bottom line: my career is littered with "you were
right's" that came three years after I moved on.
So. if you would learn anything from my example, maybe it would be that "being right" carries
no value. Maybe it's that you should just keep your head down and wait for the seniority
promotion of for retirement. Maybe the lesson is that you do what you can and the rest belongs
to someone else.
I will say that over time I have achieved objectives that others considered "impossible" because I
was willing to take risks. The problem there of course if that if the objective was considered
unachievable then no one is prepared when the find themselves standing inside the walls.
I think that this is also the story of data management (to include what has come to be known as
data governance). Organizations have been talking about data management for nearly thirty years
now and there are hundreds if not thousands of experts who will tell you exactly what you should
to enjoy the benefits of good data management practice. What none of them will tell you because
a) you don't want to hear it and b) you wouldn't hire them is that there is no proven
methodology—no set of practices and tools, skills and technology—that will guarantee results.
Why should this be? You would think that in 30 years someone would have stumbled across
something that will deliver predictable results. The answer lies in the subject matter. "Data" is a
concept understood by everyone. Everyone in the boardroom has their favorite data. The issue at
the root of all problems is that "everyone" is seeing data "as through a glass, darkly."
The inability to communicate about data and reach a consensus is what is keeping us from our
objective. To this add the cult of personality that defines the management—let's call it
governance—of the corporation. The decision makers understand nothing of the underwater
portion of the data iceberg, seeing only the table, graph or dashboard that's in front of them.
What must be managed is the abstraction that is data and not the values that are only the visible
portion. When we try to do anything with the abstract, we find that there are side effects on the
visible portion that cause VIP personalities to convulsively respond in exactly the least useful
way.
You can get useful results if your objective is modest. For example, it is possible to get two
business functions who are exchanging data or three or more that have a symbiotic relationship
based on data to take consensus action to stop what is often a great deal of daily pain. The
intractability is encountered when we attempt to broaden the scope to cross departmental or
divisional boundaries. The goals and methods of data management are counterintuitive to those
raised in the power politics of corporate "success."
We usually find ourselves managing data as a commodity, "how much", "how many", "what is
the cost", "who produces", "who consumes", "spoilage rate", "how fast"... While these all have
an attraction in that the answers can be easily captured in one of those tables, graphs, dashboards,
none deal with the underlying problem of managing the abstraction. Data is the most complex
thing that a corporation attempts to manage. It is more complex even than money.
The pity is that we treat data as if putting it into a "piggy bank" solves all our problems. You
heard it here first:
Technology is no answer—technology can help us sort different kinds of values into
different piggy banks, no more.
Technical skills (modeling, DBA, quality...) are no answer. The cashier makes use of
such skills to keep his/her drawer in order and reconciled.
People skills by themselves can't achieve any result except perhaps building meaningless
consensus.
This is enough clues. If you call, don't bother to tell me what DBMS or CRP system or BI tools
you're using. None of those things are of interest until the final stages of a solution. I don't expect
any calls because too much credibility is wrapped up in the current initiative—whatever it is.
When it fails to produce results, a new personality will step in and you'll start the cycle anew.
Someone, someday may actually be willing to take a risk to stop the pain. I'll be retired or
deceased by that time but maybe you'll have learned from this what you should be searching for.
Bibliography
Alliance, A. (2013). Agile Alliance::Home. Retrieved May 31, 2013, from Agile Alliance:
http://www.agilealliance.org/
Bernstein, A. J., & Rosen, S. C. (1989). Dinosaur Brains: Dealing with All Those Impossible
People at Work . New York: Ballantine Books.
Casey, S. (1998). Set Phasers on Stun: And Other True Tales of Design, Technology, and Human
Error . Santa Barbara: Aegean Publishing Company.
Chen, P. P. (1976). The Entity-Relationship Model—Toward a Unified View of Data. ACM
Transactions on Database Systems, 9–36.
CITEC. (2012, 11 14). Deming's Red Bead Experiment. Retrieved 06 10, 2013, from YuoTube:
http://www.youtube.com/watch?v=R3ewHrpqclA
Computer Aided Software Engineering. (2013, 05 25). Retrieved 06 14, 2013, from Wikipedia:
http://en.wikipedia.org/wiki/Computer-aided_software_engineering
Crosby, P. (1979). Quality Is Free: The Art of Making Quality Certain . New York: McGraw-
Hill.
Crosby, P. (1984). Quality Without Tears: The Art of Hassle-Free Management. New York:
McGraw-Hill.
Database Normalization. (2013, 07 15). Retrieved 07 18, 2013, from Wikipedia:
http://en.wikipedia.org/wiki/Database_normalization
Deming, W. E. (1982). Out of the Crisis. Boston: MIT Press.
Dijkstra, E. (1982). Selected Writings on Computing: A Personal Perspective, Texts and
Monographs in Computer Science. Springer-Verlag.
English, L. P. (1999). Improving Data Warehouse and Business Information Quality: Methods
for Reducing Costs and Increasing Profits. Hoboken: John Wiley & Sons.
Gilovich, T. (1993). How We Know What Isn't So: The Fallibility of Human Reason in Everyday
Life. New York: The Free Press.
Helbig, Steinwender, Graf, & Kiefer. (2010). Action observation can prime visual object
recognition. Springer Open Choice: Experimental Brain Research, 251-258.
How Long To Form A Habit? (2009, 09 21). Retrieved 05 10, 2013, from PsyBlog:
http://www.spring.org.uk/2009/09/how-long-to-form-a-habit.php
Maslow, A. (1943). A Theory of Human Motivation. Psychological Review, 370-396.
McGilvray, D. ( 2008). Executing Data Quality Projects: Ten Steps to Quality Data and Trusted
Information. Burlington, MA: MorganKaufmann.
Nemoto, M. (1987). Total Quality Control for Management: Strategies and Techniques from
Toyota and Toyoda Gosei. Englewood Cliffs, NJ: Prentice Hall, Inc.
NIST. (n.d.). Baldrige Performance Excellence Program. Retrieved May 8, 2013, from Malcolm
Baldrige National Quality Award Program: http://www.nist.gov/baldrige/enter/apply.cfm
Peters, T. a. (1987). In Search of Excellence: Lessons from America's Best-Run Companies. New
York: HarperCollins.
Rand Corporation. (2012, 07 27). Suboptimization in Operations Problems. Retrieved 05 31,
2013, from Rand: http://www.rand.org/pubs/papers/P326.html
Reason, J. (1990). Human Error. Cambridge, UK: Cambridge University Press.
Redman, T. C. (1992). Data Quality: Management and Technology. New York: Bantam Books.
Silverston, L. a. (2008). The Data Model Resource Book Vol 3: Universal Patterns for Data
Modeling. Indianapolis: Wiley.
Simsion, G. (2006). Data Modeling: Description or Design? (Doctoral Thesis). Melbourne:
University of Melbourne, Department of Information Systems.
Simsion, G. w. (2005, March 1). There's a Lot of New Stuff to Say About Data Modeling.
Retrieved July 30, 2013, from Information Management: http://www.information-
management.com/infodirect/20050311/1022729-1.html?zkPrintable=1&nopagination=1
Tzu, S. (n.d.). The Art of War.
W a n d, Y. i. (1996). Anchoring Data Quality Dimensions in Ontological Foundations.
COMMUNICATIONS OF THE ACM, Vol. 39, No. 11, 86-95.
i Maximizer
People who are especially talented in the Maximizer theme focus on strengths as a way to stimulate personal
and group excellence. They seek to transform something strong into something superb.
Connectedness
People who are especially talented in the Connectedness theme have faith in the links between all things. They
believe there are few coincidences and that almost every event has a reason.
Ideation
People who are especially talented in the Ideation theme are fascinated by ideas. They are able to find
connections between seemingly disparate phenomena.
Strategic
People who are especially talented in the Strategic theme create alternative ways to proceed. Faced with any
given scenario, they can quickly spot the relevant patterns and issues.
Learner
People who are especially talented in the Learner theme have a great desire to learn and want to continuously
improve. In particular, the process of learning, rather than the outcome, excites them.
(Rath, 2007)