Data Quality and Master Data Management With Microsoft SQL Server 2008 R2
Data Quality and Master Data Management With Microsoft SQL Server 2008 R2
Master Data
Management with
Microsoft SQL Server
2008 R2
Dejan Sarka, Davide Mauri
Table of Contents
Table of Contents
Table of Contents ............................................................................................................................. 1
Foreword .......................................................................................................................................... 6
Acknowledgements .......................................................................................................................... 8
About the Authors .......................................................................................................................... 11
Chapter 1: Master Data Management ........................................................................................... 12
Types of Data .............................................................................................................................. 13
What Is Master Data?............................................................................................................. 14
Master Data Management ..................................................................................................... 16
MDM Challenges ................................................................................................................ 21
Data Models ............................................................................................................................... 23
Relational Model .................................................................................................................... 23
Dimensional Model ................................................................................................................ 31
Other Data Formats and Storages .......................................................................................... 33
Data Quality ................................................................................................................................ 36
Data Quality Dimensions ........................................................................................................ 36
Completeness ..................................................................................................................... 37
Accuracy ............................................................................................................................. 38
Information......................................................................................................................... 38
Consistency......................................................................................................................... 39
Data Quality Activities ............................................................................................................ 42
Master Data Services and Other SQL Server Tools .................................................................... 47
Page 1
Table of Contents
Master Data Services .............................................................................................................. 47
Entities ................................................................................................................................ 49
Attributes............................................................................................................................ 49
Members ............................................................................................................................ 49
Hierarchies.......................................................................................................................... 49
Collections .......................................................................................................................... 50
Versions .............................................................................................................................. 50
Other SQL Server Tools for MDM........................................................................................... 51
Summary .................................................................................................................................... 54
References .................................................................................................................................. 55
Chapter 2: Master Data Services Concepts and Architecture ........................................................ 56
Master Data Services Setup ....................................................................................................... 57
Installation of MDS Components and Tools ........................................................................... 57
Setup of MDS Database.......................................................................................................... 58
Setup of MDS Web Application .............................................................................................. 59
Master Data Manager Web Application .................................................................................... 62
Explorer .................................................................................................................................. 63
Version Management ............................................................................................................. 63
Integration Management ....................................................................................................... 63
System Administration ........................................................................................................... 64
User and Groups Permissions ................................................................................................ 64
Models ........................................................................................................................................ 65
Models .................................................................................................................................... 65
Page 2
Table of Contents
Entities and Attributes ........................................................................................................... 66
Hierarchies.............................................................................................................................. 77
Derived Hierarchy............................................................................................................... 78
Explicit Hierarchy ................................................................................................................ 79
Collections .............................................................................................................................. 84
Business Rules ........................................................................................................................ 85
Importing, Exporting and Managing Data .................................................................................. 89
Import Data ............................................................................................................................ 89
Managing Data ....................................................................................................................... 91
Export Data ............................................................................................................................. 93
Multiple Versions of Data ........................................................................................................... 95
MDS Database Schema .............................................................................................................. 98
Staging Tables ....................................................................................................................... 101
Summary ..................................................................................................................................103
References ................................................................................................................................ 104
Chapter 3: Data Quality and SQL Server 2008 R2 Tools ............................................................... 105
Measuring the Completeness ..................................................................................................106
Attribute Completeness .......................................................................................................107
XML Data Type Attribute Completeness .............................................................................. 109
Simple Associations among NULLs ....................................................................................... 112
Tuple and Relation Completeness........................................................................................ 118
Multivariate Associations among NULLs .............................................................................. 120
Profiling the Accuracy............................................................................................................... 126
Page 3
Table of Contents
Numeric, Date and Discrete Attributes Profiling .................................................................128
Strings Profiling .................................................................................................................... 131
Other Simple Profiling ..........................................................................................................134
Multivariate Accuracy Analysis ............................................................................................ 136
Measuring Information ............................................................................................................142
Using Other SQL Server Tools for Data Profiling ......................................................................145
SSAS Cubes ........................................................................................................................... 145
PowerPivot for Excel 2010 ...................................................................................................149
SSIS Data Profiling Task ........................................................................................................153
Excel Data Mining Add-Ins....................................................................................................156
Clean-Up ............................................................................................................................... 160
Summary ..................................................................................................................................161
References ................................................................................................................................ 162
Chapter 4: Identity Mapping and De-Duplicating ........................................................................163
Identity Mapping ...................................................................................................................... 165
Problems............................................................................................................................... 166
T-SQL and MDS String Similarity Functions ..........................................................................166
Preparing the Data ...........................................................................................................169
Testing the String Similarity Functions .............................................................................174
Optimizing Mapping with Partitioning .............................................................................176
Optimizing Mapping with nGrams Filtering .....................................................................180
Comparing nGrams Filtering with Partitioning ................................................................ 188
Microsoft Fuzzy Components ...................................................................................................190
Page 4
Table of Contents
Fuzzy Algorithm Description ................................................................................................ 190
Configuring SSIS Fuzzy Lookup ............................................................................................. 192
Testing SSIS Fuzzy Lookup ....................................................................................................200
Fuzzy Lookup Add-In for Excel.............................................................................................. 201
De-Duplicating .......................................................................................................................... 203
Preparing for Fuzzy Grouping ............................................................................................... 204
SSIS Fuzzy Grouping Transformation ................................................................................... 206
Testing SSIS Fuzzy Grouping .................................................................................................211
Clean-Up ............................................................................................................................... 213
Summary ..................................................................................................................................215
References ................................................................................................................................ 216
Index .............................................................................................................................................217
Page 5
Foreword
Foreword
Dejan Sarka
If all men were just, there would be no need of valor, said Agesilaus, Spartan King, 444 BC-360
BC. Just from this quote we can realize that Agesilaus was not too keen on fighting. Actually,
Agesilaus never hurt his enemies without just cause, and he never took any unjust advantages.
Nevertheless, the ancient world was just as imperfect as the contemporary world is, and
Agesilaus had to fight his share of battles.
If everyone would always insert correct data into a system, there would be no need for
proactive constraints or for reactive data cleansing. We could store our data in text files, and
maybe the only application we would need would be Notepad. Unfortunately, in real life, things
go wrong. People are prone to make errors. Sometimes our customers do not provide us with
accurate and timely data. Sometimes an application has a bug and makes errors in the data.
Sometimes end users unintentionally make a transposition of letters or numbers. Sometimes we
have more than one application in an enterprise, and in each application we have slightly
different definitions of the data. (We could continue listing data problems forever.)
A good and suitable data model, like the Relational Model, enforces data integrity through the
schema and through constraints. Unfortunately, many developers still do not understand the
importance of a good data model. Nevertheless, even with an ideal model, we cannot enforce
data quality. Data integrity means that the data is in accordance with our business rules; it does
not mean that our data is correct.
Not all data is equally important. In an enterprise, we can always find the key data, such as
customer data. This key data is the most important asset of a company. We call this kind of data
master data.
Page 6
Foreword
This book deals with master data. It explains how we can recognize our master data. It stresses
the importance of a good data model for data integrity. It shows how we can find areas of bad
or suspicious data. It shows how we can proactively enforce better data quality and make an
authoritative master data source through a specialized Master Data Management application. It
also shows how we can tackle the problems with duplicate master data and the problems with
identity mapping from different databases in order to create a unique representation of the
master data.
For all the tasks mentioned in this book, we use the tools that are available in the Microsoft SQL
Server 2008 R2 suite. In order to achieve our goalgood quality of our datanearly any part of
the suite turns to be useful. This is not a beginners book. We, the authors, suppose that you,
the readers, have quite good knowledge of SQL Server Database Engine, .NET, and other tools
from the SQL Server suite.
Achieving good quality of your master data is not an easy task. We hope this book will help you
with this task and serve you as a guide for practical work and as a reference manual whenever
you have problems with master data.
Page 7
Acknowledgements
Acknowledgements
Dejan Sarka
This book would never have been finished without the help and support of several people. I
need to thank them for their direct and indirect contributions, their advice, their
encouragement, and other kinds of help.
In the first place, I have to mention my coauthor and colleague from SolidQ, Davide Mauri. As an
older guy, I have followed his career over several years. I am amazed by the amount of
knowledge he gained in the past few years. He has become a top speaker and recognized author
of several books and articles. Nevertheless, he retained all the vigor of youth and is still full of
ideas. Davide, I am so proud of you. I always enjoy working with you, and I am looking forward
to our further cooperation.
With three other colleagues from SolidQ, Itzik Ben-Gan, Herbert Albert, and Gianluca Hotz, we
are forming a gang of four inside the company, called Quartet. It is not just unofficial group; our
official duty in the company is to certify and confirm places for company parties. Our endless
discussions during conferences, hikes, or time spent in pubs are an invaluable source of insight
and enlightenment. Besides general help through our socializing, all three of them have made a
concrete contribution to this book.
Herbert helped with technical review. Gianluca didnt review the book officially; nevertheless,
he read it. His old mans grumbling was always a sign to me that I wrote something inaccurate
or even wrong. Itzik was not directly involved in this book. However, this book would never have
been written without him. This is the ninth book I am contributing to so far. I would never even
have started writing if Itzik hadnt pushed me and involved me in coauthoring his book 7 years
ago. Itzik has invited me to contribute to four of his books so far, and we are already starting the
fifth one together. To my friends from the Quartet, thank you for all of the precious time we are
spending together!
Page 8
Acknowledgements
SolidQ is not just a regular company. First of all, we are all friends. Even more, we have the best
CEO in the world, Fernando Guerrero. Fernando, thank you for inviting me to become a part of
this wonderful group of people, and thank you for all of your patience with me, and for all of
your not only technical advice but also life advice! And thanks to all other members of this
company; because we all together total more than 150 SQL Server and related technologies
worldwide experts joining our efforts, I simply cannot list every single colleague.
Besides concrete work with SQL Server, I am also involved in theoretical research in The Data
Quality Institute. Dr. Uro Godnov helped me with first steps in data quality, and he is
continuing to educate me. Although he has some problems with his health, he is always
available to me. Uro, forget what Agesilaus said! We need courage not just because of our
enemies; sometimes we need courage because of ourselves. Stay as brave as you are forever!
I cannot express enough how much I appreciate being a member of the Microsoft Most
Valuable Professional (MVP) program. Through this program, MVPs have direct contact with the
SQL Server team. Help of the team in general and help from the Master Data Services part of
the team is extraordinary. No matter how busy they are with developing a new version of SQL
Server, they always take time to respond to our questions.
Finally, I have to thank to my family and friends. Thank you for understanding the reduced time I
could afford to spend with you! However, to be really honest, I did not miss too many parties
and beers.
Page 9
Acknowledgements
Davide Mauri
Dejan already wrote a lot about our friendship, our company, and everything that allows us to
continue to enjoy our work every day. But even at the cost of seeming repetitious, Id also like
to thank all my SolidQ colleagues, who are not only colleagues but friends above all. I learn a lot
from you, and each discussion is mind-opening for me. I cannot thank you enough for this!
I would like to say a big thank to Dejan, Itzik, Gianluca, and Fernando for being examplestruly
mentorsnot only through words but also with facts: You really give me inspiration and you
show me each day, with your excellent work, your determination, your honesty, and your
integrity, the path one has to follow to be someone who can make a difference, from a
professional and ethical point of view. I couldnt have found better colleagues, friends, and
partners. Thanks!
Id like also to thank specifically my Italian team. Guys, were really creating something new,
setting new standards, finding clever ways to solve business problems, making customers
happy, and giving them the maximum quality the market can offer, while being happy ourselves
at the same time. This is not easy, but were able to work together as a well-trained team,
enjoying what we do each day. Thanks!
Last but not least, of course, a big thanks also to my wife Olga and my newborn son Riccardo:
You really are the power behind me. Olga, thanks for your patience and for your support!
Riccardo, thanks for your smiles that allow me to see the world with different eyes! Thank you!
Page 10
Davide Mauri
Davide Mauri is a SolidQ Mentor and a member of the Board of Directors of SolidQ Italia. A wellknown Microsoft SQL Server MVP, MCP, MCAD, MCDBA, and MCT, as well as an acclaimed
speaker at international SQL Server conferences, Davide enjoys working with T-SQL and
Relational Modeling and studying the theory behind these technologies. In addition, he is wellgrounded in Reporting Services, .NET, and object-oriented principles, and he has a deep
knowledge of Integration Services and Analysis Services, giving him a well-rounded area of
expertise around the Microsoft Data Platform, allowing him to have the correct vision and
experience to handle development of complex business intelligence solutions. He is a course
author for SolidQ, including seminars about Upgrading to SQL Server 2008, co-author of the
book Smart Business Intelligence Solutions with SQL Server 2008, and author of the well-known
DTExec replacement tool DTLoggedExec (http://dtloggedexec.davidemauri.it).
Page 11
Data models;
Data quality;
Page 12
Types of Data
In an average company, many different types of data appear. These types include:
Metadata this is data about data. Metadata includes database schemas for
transactional and analytical applications, XML document schemas, report
definitions, additional database table and column descriptions stored by using
SQL Server provided extended properties or custom tables, application
configuration data and similar.
Semi-structured data is typically in XML form. XML data can appear in standalone
files, or as part (a column in a table) of a database. Semi-structured data is useful
where metadata, i.e. schema, changes frequently, or when you do not have need
for detailed relational schema. In addition, XML is widely used for data exchange.
Page 13
Any of these entity sets can be further divided into specialized subsets. For example, a company
can segment their customers based on previous sales into premier and other customers, or
based on customer types, or persons and companies.
For analytical applications, many times data is organized in a Dimensional Model. A popular
name for the Dimensional Model is Star Schema (although, to be precise, a dimensionally
modeled database can include multiple star and snowflake schemas). This is because it has a
central, Fact table, and surrounding, dimensional tables or Dimensions. Fact tables hold data we
are measuring, namely Measures. Dimension attributes are used for pivoting fact data; they give
measures some meaning. Dimensions give context to measures. Fact tables are populated from
Page 14
Page 15
Page 16
Please note the last bullet. Master Data Management can be quite costly, and very intensive in
terms of resources used, including man hours. For a small company with a single operational
system, probably no specific MDM tool is needed. Such a company can maintain master data in
the operational system. The more we reuse master data (in multiple operational, CRM and
analytical applications), the bigger ROI we get.
Page 17
Page 18
Central MDM, single copy with this approach, we have a specialized MDM
application, where we maintain master data, together with its metadata, in a
central location. All existing applications are consumers of this master data. This
approach seems preferable at first glimpse. However, it has its own drawbacks.
We have to upgrade all existing applications to consume master data from
central storage instead of maintain their own copies. This can be quite costly, and
maybe even impossible with some legacy systems. In addition, our central master
metadata should union all metadata from all source systems. Finally, the process
of creating and updating master data could simply be too slow. It could happen,
for example, that a new customer would have to wait for couple of days before
submitting the first order, because the process of inserting customer data with all
possible attributes involves contacting all source systems and takes simply too
long.
Page 20
For the last two approaches we need a special MDM application. A specialized MDM solution
could be useful for central metadata storage with identity mapping and central metadata
storage and central data that is continuously merged approaches as well. SQL Server 2008 R2
Master Data Services (MDS) is a specialized MDM application. We could also write our own
application. Other SQL Server tools, like SSIS and SSAS, are helpful in the MDM process as well.
However, for the last two approaches to MDM, MDS is the most efficient solution.
MDM Challenges
For a successful MDM project, we have to tackle all challenges we meet. These challenges
include:
Page 21
Authority
Who is responsible for master data? Different departments want to be
authoritative for their part of master data, and the authority for master data can
overlap in an enterprise. We have to define policies for master data, with explicit
data stewardship process prescribed. We also define data ownership as part of
authority issue resolution.
Data conflicts
When we prepare the central master data database, we have to merge data from
our sources. We have to resolve data conflicts during the project, and, depending
on the MDM approach we take, replicate the resolved data back to the source
systems.
Domain knowledge
We should include domain experts in a MDM project.
Documentation
We have to take care that we properly document our master data and metadata.
No matter which approach we take, MDM projects are always challenging. However, tools like
MDS can efficiently help us resolve possible issues.
Page 22
Data Models
It is crucial that we have a basic understanding of data models used in an enterprise before we
start a MDM project. Details of data modeling are out of the scope for this book; only the
minimum needed is here. We are going to introduce the Relational Model, the Dimensional
Model, and briefly other models and storages.
Relational Model
The relational model was conceived in the 1960s by Edgar F. Codd, who worked for IBM. It is a
simple, yet rigorously defined conceptualization of how users perceive and work with data. The
most important definition is the Information Principle, which states that all information in a
relational database is expressed in one (and only one) way, as explicit values in columns within
rows of a table. In the relational model, a table is called a relation, and a row is called a tuple,
which consists of attributes.
Each relation represents some real-world entity, such as a person, place, thing, or event. An
entity is a thing that can be distinctly identified and is of business interest. Relationships are
associations between entities.
A row in a relation is a proposition, like an employee with identification equal to 17, full name
Davide Mauri, lives in Milan.
Relation header, or schema of the relation, is the predicate for its propositions.
A predicate is a generalized form of proposition, like employee with identification
EmployeeId(int), full name EmloyeeName(string), lives in City(CitiesCollection).
Note the name / domain pair of placeholders for concrete values. The domain, or the data type,
is the first point where a RDBMS can start enforcing data integrity. In the previous example, we
cannot insert an EmployeeId that is not an integral number. We cannot insert a city that is not in
Page 23
Page 24
Orders
PK
OrderId
OrderDate
CustomerId
CustomerName
Address
City
Country
OrderDetails{ProductId, ProductName, Quantity}
Page 25
Orders
PK
PK
OrderId
ItemId
OrderDate
CustomerId
CustomerName
Address
City
Country
ProductId
ProductName
Quantity
Page 26
Orders
PK
OrderId
OrderDate
CustomerId
CustomerName
Address
City
Country
OrderDetails
PK,FK1
PK
OrderId
ItemId
ProductId
ProductName
Quantity
Page 27
Customers
Orders
PK
CustomerId
PK
OrderId
FK1
CustomerId
OrderDate
FK1
CustomerName
Address
CityId
OrderDetails
Cities
PK
CityId
PK,FK1
PK
OrderId
ItemId
FK1
City
CountryId
FK2
ProductId
Quantity
Countries
PK
CountryId
Country
Products
PK
ProductId
ProductName
Page 28
Page 29
Cities
PK
CityId
FK1
City
CountryId
Countries
PK
CountryId
Country
Customers
Orders
PK
CustomerId
PK
OrderId
FK1
CustomerId
OrderDate
FK1
CustomerName
Address
CityId
OrderDetails
Persons
PK,FK1
CustomerId
PK,FK1
PK
OrderId
ItemId
BirthDate
FK2
ProductId
Quantity
Companies
PK,FK1
CustomerId
NumberOfEmployees
Products
PK
ProductId
ProductName
Page 30
Dimensional Model
Dimensional Model of a database has more deterministic schema than Relational Model. We
use it for Data Warehouses (DW). In a DW, we store merged and cleansed data from different
source systems, with historical data included, in one or more star schemas. A single star schema
covers one business area, like sales, inventory, production, or human resources. As we already
know, we have one central (Fact) table and multiple surrounding (Dimensions) tables in a star
schema. Multiple star schemas are connected through shared dimensions. An explicit Time (or
Date, depends on the level of granularity we need) dimension is always present, as we always
include historical data in a DW. Star schema was introduced by Ralph Kimball in his famous book
The Data Warehouse Toolkit.
Star schema is deliberately denormalized. Lookup tables, like Cities and Countries in our
example, are flattened back to the original table, and attributes from that lookup tables form
natural hierarchies. In addition, we can also flatten specialization tables. Finally, we can add
multiple derived attributes and custom-defined hierarchies. An example of a star schema
created from our normalized and specialized model is illustrated in figure 6.
Page 31
DimCustomers
PK
DimProducts
CustomerId
PK
CustomerName
Address
City
Country
PersonCompanyFlag
PersonAge
CompanySize
ProductId
ProductName
FactOrders
PK,FK1
PK,FK2
PK,FK3
CustomerId
ProductId
DateId
DimDates
Quantity
PK
DateId
Date
Month
Quarter
Year
Page 32
The data in ODS has limited history, and is updated more frequently than the data in a DW.
However, the data is already merged from multiple sources. ODS is many times part of a CRM
application; typically, it is data about customers. A MDM solution can actually replace, or
integrate, existing ODS.
Some data is semi-structured. Either the structure is not prescribed in so many details, or the
structure itself is volatile. Nowadays semi-structured data usually appears in XML format. We
can have XML data in files in a file system or in a database. Modern relational systems support
XML data types.
XML instances can have a schema. For defining XML schemas, we use XML Schema Definition
(XSD) documents. XSD is an XML instance with defined namespaces, elements and attributes,
that expresses a set of rules to which an XML instance must conform. If XML instance conforms
to XSD, then we say it is schema validated. Here is an example of XML schema:
<xsd:schema targetNamespace="ResumeSchema" xmlns:schema="ResumeSchema"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:sqltypes="http://schemas.microsoft.com/sqlserver/2004/sqltypes"
elementFormDefault="qualified">
<xsd:import
namespace="http://schemas.microsoft.com/sqlserver/2004/sqltypes"
schemaLocation="http://schemas.microsoft.com/sqlserver/2004/sqltypes/sqltypes
Page 33
Reading XML is not very funny. Nevertheless, we can extract some useful information. From the
highlighted parts of the code, we can conclude that this is a schema for resumes, probably for
job candidates. It allows two elements, according to their names for describing skills and
previous employment. Skills can appear only once, while previous employment multiple times.
In order to maintain data quality for XML data, we should force validation of XML instances
against XML schemas. SQL Server supports XML schema validation for columns of XML data
type.
Every company has also to deal with unstructured data. This data is in documents,
spreadsheets, other computer formats, or even on paper only. If this data is important, if it is
Page 34
Page 35
Data Quality
Data quality is indivisibly interleaved with master data management. The most important goal
of an MDM solution is to raise the quality of master data. We should tackle data quality issues in
any MDM project. Nevertheless, data quality activities, such as data profiling, finding root cause
for poor data, and improvements of quality, can be independent of an MDM project as well. An
enterprise can define data quality policies and processes through existing applications only.
However, a specialized MDM solution can mitigate implementation of those policies a great
deal.
Before we describe data quality activities, we have to decide for which aspects we are going to
measure and improve the quality of our data. Data quality dimensions capture a specific aspect
included in general data quality term. Measuring data quality, also known as data profiling,
should be an integral part of the implementation of an MDM solution. We should always get a
thorough comprehension of source data before we start merging it. We should also include
measuring improvements of data quality over time to understand and explain the impact of the
MDM solution. Let us start with data quality dimensions, to show what and how we can
measure data quality.
Page 36
Page 37
Page 38
Timeliness tells us the degree to which data is current and available when
needed. There is always some delay between change in the real world and the
moment when this change is entered into a system. Although stale data can
appear in any system, this dimension is especially important for Web applications
and sites. A common problem on the Web is that owners do not update sites in a
timely manner; we can find a lot of obsolete information on the Web.
Ease of use is a very typical dimension that relies on user perception. This
dimension depends on application, on user interface. In addition, users of data
can perceive usage as complex also because they are undereducated.
Intention is the data the right data for intended usage? Sometimes we do not
have the exact data we need; however, we can substitute the data needed with
data with similar information. For example, we can use phone area codes instead
of ZIP codes in order to locate customers approximately. Although phone
numbers were not intended for analyses, they can give us reasonable results.
Another, worse example of unintended usage is usage of a column in a table for
storing unrelated information, like using product name to store product
Page 39
Trust we have to ask users whether they trust the data. This is a very important
dimension. If users do not trust data in operational systems, they will create their
own little, probably unstructured, databases. Integration of master data from
unstructured sources is very challenging. If users do not trust data from analytical
applications, they will simply stop using them.
Finally, we can describe some schema quality dimensions. A common perception is that schema
quality cannot be measured automatically. Well, this is true for some dimensions; we cannot
measure them without digging into a business problem. Nevertheless, it is possible to find
algorithms and create procedures that help us in measuring some part of schema quality. The
following list shows the most important schema quality dimensions with a brief description of
how to measure them when applicable.
Page 40
Page 41
Page 42
We can ask why there are duplicate records. An answer might be because
operators frequently insert new customer record instead of using existing ones.
We should ask the second why why are operators creating new records for
existing customers? The answer might be because operators do not search for
existing records of a customer.
We then ask the third why why dont they search. The answer might be
because the search would take too long.
Page 43
The next why is, of course, why it takes so long? The answer might be because it
is very clumsy to search for existing customers.
Now we ask the final, the fifth why why is searching so clumsy? The answer
might be because one can search only for exact values, not for approximate
strings, and an operator does not have exact name, address, or phone number of
a customer in memory. We found the root cause for duplication for this
example, this is application, specifically user interface. Now we know where to
put effort in order to lower the number of duplicates.
Of course, five whys is not the only technique for finding root causes. We can also just track a
piece of information through its life cycle. By tracking it, we can easily spot the moment when it
becomes inaccurate. We can find some root causes for some problems procedurally as well. For
example, we can find that NULLs are in the system because lack of subtypes with quite simple
queries. No matter how we find root causes, we have to use this information to prepare a
detailed improvements plan.
An improvement plan should include two parts: correcting existing data, and even more
important, preventing future errors. If we focus on correcting only, we will have to repeat the
correcting part of the data quality activities regularly. Of course, we have to spend some time in
correcting existing data; however, we should not forget the preventing part. When we have
both parts of improvements plan, we start implementing it.
Implementation of corrective measures involves automatic and manual cleansing methods.
Automatic cleansing methods can include our own procedures and queries. If we have a known
logic how to correct the data, we should use it. We can solve consistency problems by defining a
single way of representing the data in all systems, and then replace inconsistent representations
with the newly defined ones. For example, if gender is represented in some system with
numbers 1 and 2, while we define that is should be represented with letters F and M, we can
replace numbers with letters in a single update statement. For de-duplication and merging from
different sources, we can use string-matching algorithms. For correcting addresses, we can use
Page 44
Page 45
FactTables
PK,FK3
PK,FK1
PK,FK2
TableId
DateId
EmployeeId
DimStewards
PK,FK1
EmployeeName
ManagerId
Department
Title
Age
NumRows
NumUnknownRows
NumErroneousRows
FactColumns
PK,FK3
PK,FK1
PK,FK2
ColumnId
DateId
EmployeeId
DimDates
PK
DateId
Date
Month
Quarter
Year
NumValues
NumUnknownValues
NumErroneousValues
DimColumns
DimTables
PK
TableId
TableName
SchemaName
DatabaseName
ServerName
ApplicationName
EmployeeId
PK
ColumnId
FK1
ColumnName
TableId
Page 46
Staging tables for importing and processing data from source systems
Subscription views for systems that can retrieve master data directly from the
Hub
Page 47
ERP
CRM
SharePoint
DW
Other
Web
Service
Synchroniz
ation
Metadata
Business
Rules
Workflow
Data
Stewardship
Versioning
Entities
Hierarchies
Page 48
Entities
Entities are central objects in the model structure. Typical entities include customers, products,
employees and similar. An entity is a container for its members, and members are defined by
attributes. We can think of entities as tables. In a model, we can have multiple entities.
Attributes
Attributes are objects within entities; they are containers for values. Values describe properties
of entity members. Attributes can be combined in attribute groups. We can have domain-based
attribute values; this means that we get the pool of possible values of attributes in a lookup
table, related to the corresponding entity of the attribute.
Members
Members are the master data. We can think of members as rows of master data entities. A
member is a product, a customer, an employee and similar.
Hierarchies
Another key concept in MDS is Hierarchy. Hierarchies are tree structures that either group
similar members for organizational purposes or consolidate and summarize members for
analyses. Hierarchies are extremely useful for data warehouse dimensions, because typical
analytical processing involves drilling down through hierarchies.
Derived hierarchies based on domain-based attributes, i.e. on relationships that exist in the
model. In addition, we can create explicit hierarchies, which we can use for consolidating
members any way we need.
Page 49
Page 50
Page 51
Page 52
Page 53
Summary
For a beginning of the book about SQL Server Master Data Services, we started with some
theoretical introductions. We defined what master data is. Then we described master data
management in general. We mentioned explicit actions about data governance, and operators
who manage the master data. The name of those operators is data stewards. We also
introduced different approaches to master data management.
Then we switched to master data sources and destinations. For sources, it is crucial that they
take care for data integrity. We discussed briefly the Relational Model, the most important
model for transactional databases. We also mentioned normalization as the process of
unbundling relations, a formal process that leads to desired state of a database, when tables
represent exactly one entity. In this second part of the first chapter, we also introduced the
Dimensional Model, a model used for analytical systems.
Master data management always has to deal with data quality. Investments into a MDM
solution make no sense, if the quality of the data in our company does not improve. In order to
measure the data quality, we need to understand which data quality dimensions we have. In
addition, we introduced the most important activities dedicated to data quality improvements.
In the last part of the chapter, we introduced SQL Server Master Data Services. We mentioned
MDS key concepts. We also explained how we can use other elements of SQL Server suite for
master data management and data quality activities. It is time now to start working with MDS.
Page 54
References
Bill Inmon: Building the Operational Data Store, 2nd Edition (John Wiley & Sons, 1999)
Ralph Kimball and Margy Ross: The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling, Second Edition (John Wiley & Sons, 2002)
5 Whys on Wikipedia
Page 55
Managing versions.
Page 56
Page 57
Page 58
Page 59
Application Development
o ASP.NET
o .NET Extensibility
o ISAPI Extensions
o ISAPI Filters
Security
o Windows Authentication
o Request Filtering
Performance
o Static Content Compression
Management Tools
o IIS Management Console
HTTP Activation
Non-HTTP Activation
Windows PowerShell
Page 60
Page 61
Page 62
Explorer
Explorer is where all data stored inside Master Data Services can be managed. Models, entities,
hierarchies and collections can be navigated and their data can be validated with the help of
defined business rules.
Direct updates or additions to the Members, the actual data stored in entities, can be done
here. Its also possible to annotate such data to explain why a change has been made, so that
everything can be tracked and kept safe for future reference. Its also possible to reverse
transactions if some changes made to data have to be undone.
Version Management
Once the process of validating Master Data is finished, external applications can start to use the
verified data stored in Master Data Services. Since applications will relay on that reference data,
its vital that no modification at all can be done on it, but still data will need to change in order
to satisfy business requirements. By creating different versions of data, its possible to manage
all these different situations, keeping track of changes and creating stable and immutable
versions of Master Data.
Integration Management
As the name implies, Integration Management allows managing the integration of the Master
Data with the already existing data ecosystem. With Integration Management, its possible to
batch import data that has been put into staging tables and monitor the import results. It is also
possible to define how data can be exposed to external applications via defined Subscription
Views.
Page 63
System Administration
System Administration is the central place where the work of defining a Master Data Model
takes place. With System Administration it is possible to create models, entities, attributes,
hierarchies, business rules and everything offered by Master Data Services in terms of data
modeling.
Page 64
Models
As explained in Chapter 1, Master Data Services organizes its data with the aid of different
concepts to deal with different aspects of data. All these concepts allow the definition of a Data
Model. To start to get confident with Master Data Services concepts and technology, in the next
paragraph well create a model to hold customers data. By following a walk-through of all the
steps needed we will get an example of a functional Master Data Services model.
Models
The highest level of data organization in Master Data Services is the Model. A model is the
container of all other Master Data Services objects and can be thought of as a database in
respect with the Relational Model.
As a result, the first step to create any solution based on Master Data Services is to create a
Model. From the System Administration section Master Data Manager portal, select the
element Model from the Manage menu.
Page 65
Page 66
Customer
City
StateProvince
CountryRegion
Page 67
Page 68
Name
Code
Attributes can be of three different types. The types define which kind of values an Attribute
can handle:
Free-Form: Allow the input of free-form text, numbers, dates and links.
File: Allow you to store any generic file. For example: documents, images or any
kind of binary object.
Page 69
Page 70
Page 71
Firstname
Lastname
EmailAddress
Phone
Address
City
Page 72
Page 74
Page 75
Page 76
Hierarchies
Hierarchies are tree based structures that allow you to organize master data in a coherent way,
which also facilitates the navigation and aggregation of the data itself.
Page 77
Page 78
Explicit Hierarchy
It is also possible to define a hierarchy without using the relationship between entities: maybe
we need to create a hierarchy using some additional data not available in any system, by adding
new data manually directly into the Master Data Services database.
Page 79
Page 80
Page 81
Page 82
Page 83
Collections
Collections are nothing more than a group of Explicit Hierarchies and Collection Members. They
are useful to create groups of members that not necessarily grouped from a business
perspective; maybe they are useful to be grouped to make life easier for the user. For example,
if someone is in charge of managing Customers from Spain, Portugal and South America, (s)he
can do a more efficient job if (s)he can find all their customers in a single group. Heres where
Collections come into play.
Collections can be created on a per-Entity basis and, just like Explicit Hierarchies; they can also
have their own Consolidated Members. Again, Members management is done in the Explorer
functional area. After having created a Consolidated Member to represent the collection, its
possible to add members by clicking on the Edit Members menu item visible after clicking on
the drop-down arrow near the collection we want to manage.
Page 84
Business Rules
Business Rules are one of the most powerful features of Master Data Services. They allow the
definition of rules that assure the quality of data. A business rule is basically an If...Then
sentence that allows specifying what has to be done if certain conditions are met.
For example, lets say that an email address, to be considered valid, need to contain at least the
at (@) character.
Business Rules are created in the System Administration functional area, by accessing the
Business Rule Maintenance Page through the Manage menu item. New rules can be created
by clicking on the usual Add button, and rule name can be set by double-clicking on the
existing name that has to be changed. After having selected for which entity the Business Rule
has to be created, the rule can be defined by clicking on the third icon from the left:
Page 85
Page 86
Figure 26: SPECIFYING DETAILS OF THE MUST CONTAIN THE PATTERN ACTION
Page 87
Page 88
Import Data
Importing data into Master Data Services is a batch process based on three main tables:
mdm.tblStagingMember
mdm.tblStgMemberAttribute
mdm.tblStgRelationship
The first table mdm.tblStagingMember is used to import members and their Code &
Name system attributes. All other user defined attributes have to be imported using the
mdm.tblStgMemberAttribute table. If we need to move members into an Explicit
Hierarchy or add members to a Collection, well use the mdm.tblStgRelationship table too.
Without going into deep technical details, to import member data, we firstly need to create
members, and then populate all the user-defined attributes, in the following sequence.
For the first attempt, well load data for the CountryRegion Entity, taking the values from
AdventureWorksLT2008R2 sample database.
To extract and load data into the mdm.tblStagingMember well use the following T-SQL
code:
USE AdventureWorksLT2008R2;
GO
SELECT DISTINCT
CountryRegion
FROM
SalesLT.[Address]
Page 89
After executing the T-SQL batch from SQL Server Management Studio, well have three rows
imported in the member staging table of our MDSBook sample database. To notify Master Data
Services that we want to start the import process, we have to use the Import feature available
in the Integration Management functional area.
Page 90
Managing Data
Now that we have our data inside Master Data Services, we can use the Explorer functional area
to explore and manage the data. Here we can select the Entity we want to manage simply
choosing the desired one from the Entities menu.
Since weve just imported CountryRegion data we may want to check it. The CountryRegion
Entity page will be as shown in Figure 30.
Page 91
Page 92
Export Data
Now that we have imported, validated and consolidated our data, we may need to make it
available to external applications. This can be done by creating a Subscription View from the
Integration Management functional area.
Page 93
The view, along with other columns, has flattened the hierarchy using three levels, as
requested:
Page 94
Page 95
Page 96
Page 97
Page 98
CM: table that keep tracks of which Entity Member is in which Collection
Page 99
Page 100
Staging Tables
As we have learned in previous section, each time a new entity is created, an associated
database table gets created as well. In such dynamic and always changing environment, creating
a standard solution to import data into the Master Data Services database so that it will fit into
the defined model can be a nightmare.
Luckily, Master Data Services comes to help, giving us three standard tables, all existing in the
mdm schema, that have to be used to import data into entities, to populate theirs attributes
and to define relationships of hierarchies.
These three tables are called staging tables and are the following:
mdm.tblStagingMember
mdm.tblStgMemberAttribute
mdm.tblStgRelationship
By using them its possible to import data into any model we have in our Master Data Services
database. A fourth staging table exists, but it isnt used actively in the import process, since it
just reports that status and the result of the batch process that is moving data from the
aforementioned staging tables into entities, attributes and hierarchies. This table is the
mdm.StgBatch table.
You can populate the staging tables with regular T-SQL inserts and bulk inserts. After the staging
tables are populated, you can invoke the batch staging process from the Master Data Manager
Web application. After the batch process is finished, you should check for the possible errors.
Please refer to Books Online for more details about populating your MDS database through the
staging tables.
Following this process and with the aid of the three staging tables, is possible not only to
populate entities with new fresh data (which means creating new members), but also:
Page 101
Create collections
Page 102
Summary
In this chapter, the quick walk-through Master Data Manager Web application gave us an
overview of MDS capabilities. We also learned how to work with Master Data Manager. Of
course, everything is not that simple as we have shown in this chapter. Nevertheless, this
chapter helps us match elements of a master data management solution with practical
implementation by using Master Data Services.
The first part of this chapter is just a quick guide to installing MDS. This should help readers to
start testing it and use this chapter as a quick walk-through MDS. Then we defined MDS Models
and elements of a Model. We have also shown how to import and export data. Finally, we
discussed versioning of our master data.
We have to mention that for a real-life solution, we should expect much more work with
importing data. Before importing data into a MDS model, we have to profile it, to check it for
quality, and cleanse it. Typical actions before importing data include also merging the data from
multiple sources and de-duplicating it.
In the next chapter, we are going to explain how we can check data quality in our existing
systems by exploiting tools and programming languages included in and supported by SQL
Server suite. The last chapter of this book is dedicated to merging and de-duplicating data.
We are not going to spend more time and place on the Master Data Services application in this
book. This is a general MDM book, and MDS is just a part of a general MDM solution. In
addition, MDS that comes with SQL Server 2008 R2 is just a first version and is, according to our
opinion, not suitable for production usage in an enterprise yet. We suggest waiting for the next
version of MDS for real-life scenarios. Nevertheless, the understanding how MDS works in
current version we got in this chapter should definitely help us with successful deployment and
usage of the next version.
Page 103
References
Page 104
Measuring information;
Page 105
Closed world assumption: all tuples that satisfy relation predicates are in the
relation;
If we have the reference relation, we can measure population completeness by just comparing
the number of rows in our and in reference relation. Of course, just having the number of rows
in the reference relation is sufficient information to measure our population completeness.
In a relational database, the presence of the NULLs is what defines the completeness. NULLs are
standard placeholders for unknown. We can measure attribute completeness, i.e. the number
of null values in a specific attribute, tuple completeness, i.e. the number of unknown values of
the attributes in a tuple, and relation completeness, i.e. the number of tuples with unknown
attribute values in the relation.
Page 106
Attribute Completeness
Let us start with a simple example finding the number of NULLs in an attribute. We are going
to analyze attributes from the Production.Product table from the AdventureWorks2008R2 demo
database.
The samples are based on the usage of the SQL 2008 R2 sample databases, which can be
downloaded from CodePlex.
You can execute queries in SQL Server Management Studio (SSMS), the tool shipped with SQL
Server. If you are not familiar with this tool yet, do please refer to Books Online.
First, lets find which columns are nullable, i.e. allow null values, in the Production.Product table
with the following query that uses the INFORMATION_SCHEMA.COLUMNS view:
Page 107
Partial results, showing only a couple of columns and rows, are here:
ProductId
Name
Color
Size
SizeUnitMea
sureCode
1
Adjustable Race
NULL
NULL
NULL
Bearing Ball
NULL
NULL
NULL
BB Ball Bearing
NULL
NULL
NULL
NULL
NULL
NULL
316
Blade
NULL
NULL
NULL
You can easily see that there are many unknown values in this table. With a simple GROUP BY
query, we can find the number of nulls, for example in the Size column. By dividing the number
of NULLs by the total number of rows in the Production.Product table, we can also get the
proportion of NULLs.
Page 108
Size
,COUNT(*) AS cnt
FROM Production.Product
WHERE Size IS NULL
GROUP BY Size;
SELECT
100.0 *
(SELECT COUNT(*) FROM Production.Product
WHERE Size IS NULL
GROUP BY Size)
/
(SELECT COUNT(*) FROM Production.Product)
AS PctNullsOfSize;
By running these two queries, we are able to find out that there are 293 NULLs in the Size
column, which is 58% of all values. This is a huge proportion; just from this percentage, we can
conclude that size is not applicable for all products. For products, size is a common attribute, so
we could expect mostly known values. We could continue checking other nullable attributes
with similar queries; however, we will see later that for finding NULLs in a column, the SQL
Server Integration Services (SSIS) Data Profiling task is suitable for this.
query() method, which returns part of the xml data in xml format;
nodes() method, which allows you to shred an xml data type instance to
relational data.
All of the XML data type methods accept XQuery as an argument. XQuery expressions allow us
to traverse through nodes of XML instance to find a specific element or attribute. The value
method accepts additional parameter, the target SQL Server scalar data type. For the modify()
method, XQuery is extended to allow modifications, and is called XML Data Modification
Language or XML DML. You can learn more about XQuery expressions and XML DML in Books
Online.
For checking whether an element is present in the xml instance, exist() method is the right one.
First, let us create XML instance in a query from the Production.Product table by using the FOR
XML clause:
SELECT
p1.ProductID
,p1.Name
,p1.Color
,(SELECT p2.Color
FROM Production.Product AS p2
WHERE p2.ProductID = p1.ProductID
FOR XML AUTO, ELEMENTS, TYPE)
AS ColorXml
FROM Production.Product AS p1
WHERE p1.ProductId < 319
ORDER BY p1.ProductID;
The subquery in the SELECT clause generates XML data type column from the Color attribute.
For the sake of brevity, the query is limited to return seven rows only. The first five have NULLs
in the Color attribute, and the XML column returned does not include the Color element. The
last two rows include the Color element.
We are going to use the previous query inside a CTE in order to simulate a table with XML
column, where the Color element is missing for some rows. The outer query is using the .exist()
method to check for the presence of the Color attribute.
Page 110
The outer query correctly finds the first five rows from the CTE. XML specification also allows
that an element is present but has no value. In such a case, a special attribute xsi:nil should
appear inside the nillable (XML term for nullable) element. Therefore, in order to find all
incomplete xml instances, we have to check also for the xsi:nil attribute. In order to check for
the xsi:nil attribute, we have to create it in some rows first. We are going to slightly change the
last query. In the CTE, we are going to include XSINIL keyword in the FOR XML clause of the
subquery. This will generate the Color element for every row; however, when the color is
missing, this element will have an additional xsi:nil attribute. Then, with the outer query, we
have to check whether this attribute appears in the Color element:
WITH TempProducts AS
(SELECT p1.ProductID
,p1.Name
,p1.Color
,(SELECT p2.Color
FROM Production.Product AS p2
WHERE p2.ProductID = p1.ProductID
FOR XML AUTO, ELEMENTS XSINIL, TYPE)
AS ColorXml
FROM Production.Product AS p1
WHERE p1.ProductId < 319)
SELECT ProductID
,Name
,Color
,ColorXml.value('(//Color)[1]','nvarchar(15)')
AS ColorXmlValue
FROM TempProducts
WHERE ColorXml.exist('//Color[@xsi:nil]') = 1;
Of course, the CTE query returns the same small sample (seven) rows of products with
ProductID lower than 319. The outer query correctly finds the first five rows from the CTE.
Page 111
1 AS ord
,Size
,SizeUnitMeasureCode
,COUNT(*) AS cnt
FROM Production.Product
GROUP BY Size, SizeUnitMeasureCode
UNION ALL
SELECT 2 AS ord
,Size
,SizeUnitMeasureCode
,COUNT(*) AS cnt
FROM Production.Product
GROUP BY SizeUnitMeasureCode, Size
ORDER BY ord, Size, SizeUnitMeasureCode;
Before showing the results of this query, let us add a comment on the query. The result set
unions two result sets; first SELECT aggregates rows on Size and then on SizeUnitMeasureCode,
Page 112
Size
SizeUnitMeasureCode
cnt
NULL
NULL
293
38
CM
12
62
CM
11
70
NULL
NULL
11
NULL
11
NULL
XL
NULL
NULL
NULL
293
38
CM
12
...
Page 113
62
CM
11
70
NULL
NULL
11
NULL
11
NULL
XL
NULL
Note the bolded rows. Whenever the Size is null, SizeUnitMeasureCode is also null; however,
the opposite is not true. In addition, SizeUnitMeasureCode is null for all sizes expressed in
character codes, like L, M and not null for numeric sizes, except for size 70 (the bolded italics
row). We can conclude that there is a strong relation between nulls in these two columns. Of
course, size unit measure code tells us in which unit the size measured is if the size is numeric;
for example in the second row of the result set, we can see that size 38 is measured in
centimeters. When the size is expressed in character codes, the measure unit makes no sense.
However, we can see that something is wrong with size 70; this one should have measure unit.
The measure unit is missing or size 70 should not be in the relation, as it is potentially
erroneous. By researching unknown values, we can find potential errors. In addition, for the
root cause of the unknown values, we can omit the SizeUnitMeasureCode column; we already
know where nulls in this column come from. Therefore, we can limit our research to the Size
column only from this pair.
If we do the same analysis for the Weight and WeightUnitMeasureCode columns, we will find
that we can omit the WeightUnitMeasureCode column from further researches as well. Finally,
we can do the same thing for the ProductSubcategoryID and ProductModelID columns, and will
find out that whenever ProductSubcategoryID is null, ProductModelID is null as well. Therefore,
we can also omit the ProductModelID from further completeness checking.
How can we prevent missing size measure units? The answer lies in the schema. We can
introduce a check constraint on the SizeUnitMeasureCode column that would not accept null
values for numeric sizes, or we can create a trigger on the Production.Product table that can
Page 114
Products
Sizes
PK
ProductId
PK
SizeId
FK1
Name
SizeId
OtherCols
FK1
Size
SizeTypeId
SizeTypes
PK
SizeTypeId
SizeTypeName
MeasureUnit
Page 115
This query uses Common Table Expressions (CTE) to calculate the number of rows in
subcategories and classes. The outer query uses the CTE to calculate the percentage of rows of a
class in a specific subcategory from the total number of rows of the same subcategory by using
the OVER clause. The result set is ordered by a descending percentage. If the percentage is
close to 100, it means that one class is prevalent in one subcategory. If the value of the class is
NULL, it means that the class is probably not applicable for the whole subcategory. Here are
partial results of this query.
ProductSubcategoryID
Class
NRowsInClass
PctOfSubCategory
NULL
100
NULL
100
NULL
100
18
NULL
100
We can easily see that the Class attribute is not applicable for all subcategories. In further
research of reasons for NULLs of the Class attribute, we can exclude the rows where the values
are not applicable. We can also check which other attributes are not applicable for some
Page 116
ProductId
,Name
,Color
,Size
,Weight
,ProductLine
,Class
,Style
FROM Production.Product
WHERE (Color IS NULL OR
Size IS NULL OR
Weight IS NULL OR
ProductLine IS NULL OR
Class IS NULL OR
Style IS NULL)
AND
(ProductSubcategoryId NOT IN
(SELECT ProductSubcategoryId
FROM Production.Product
WHERE ProductSubcategoryId IS NOT NULL
GROUP BY ProductSubcategoryId
HAVING COUNT(DISTINCT Class) = 0));
Page 117
With this function, it is easy to write a query that calculates the number of nulls for each
interesting row. Note that the following query refers to interesting columns only, and limits the
result set to rows with applicable Class attribute only.
SELECT
ProductId
,Name
,dbo.ValueIsNULL(Color) +
dbo.ValueIsNULL(Size) +
dbo.ValueIsNULL(Weight) +
dbo.ValueIsNULL(ProductLine) +
dbo.ValueIsNULL(Class) +
dbo.ValueIsNULL(Style)
Page 118
Name
NumberOfNULLsInRow
802
LL Fork
803
ML Fork
804
HL Fork
We can say that tuples with more NULLs are less complete than tuples with fewer NULLs. We
can even export this data to a staging table, repeat the tuple completeness measure on a
schedule, and compare the measures to notice the tuple improvement. Of course, this makes
sense only if we can join measures on some common identification; we need something that
uniquely identifies each row.
In the Production.Product table, there is a primary key on the ProductId column. In a relational
database, every table should have a primary key, and the key should not change if you want to
make comparisons over time.
For relation completeness, we can use two measures: the total number of NULLs in the relation
and the number of rows with NULL in any of columns. The following query does both
calculations for the Production.Product table, limited on rows with applicable Class attribute
only.
SELECT
'Production' AS SchemaName
,'Product' AS TableName
,COUNT(*) AS NumberOfRowsMeasured
,SUM(
dbo.ValueIsNULL(Color) +
dbo.ValueIsNULL(Size) +
Page 119
TableNa
NumberOfRows
TotalNumber
NumberOfRows
me
me
Measured
OfNULLs
WithNULLs
Production
Product
237
222
61
We can continue with such measurements for each table in the database. We can also store the
results in a data quality data warehouse, as proposed in chapter 1 of this book. This way, we can
measure improvements over time. After all, one of the most important goals when
implementing a MDM solution is data quality improvement.
Page 120
Page 121
Page 122
Page 123
Page 124
Page 125
Page 126
Page 127
CustomerKey
,FirstName
,LastName
,BirthDate
FROM dbo.vTargetMailDirty
WHERE BirthDate =
(SELECT MIN(BirthDate) FROM dbo.vTargetMailDirty)
OR
BirthDate =
(SELECT MAX(BirthDate) FROM dbo.vTargetMailDirty);
The results have found suspicious data. The oldest person is born in the year 1865.
CustomerKey
FirstName
LastName
BirthDate
99002
Eugene
Huang
1865-05-14
11132
Melissa
Richardson
1984-10-26
Finding suspicious data is mostly translated to finding outliers, i.e. rare and far out of bound
values. We can use a similar technique for continuous numeric values.
With a couple of standard T-SQL aggregate functions, we can easily get an idea of distribution of
values, and then compare minimal and maximal values with the average. In addition, the
standard deviation tells us how spread the distribution is in general. The less spread it is, the
Page 128
MIN(Age) AS AgeMin
,MAX(Age) AS AgeMax
,AVG(CAST(Age AS float)) AS AgeAvg
,STDEV(Age) AS AgeStDev
FROM dbo.vtargetMailDirty;
The result shows us something we already knew: the oldest person in the data is probably the
wrong Age, and since Age is calculated, the BirthDate has to be wrong also.
AgeMin
AgeMax
AgeAvg
AgeStDev
26
146
50.1018518518519
13.1247332792511
Before we move to discrete attributes, let us mention how to interpret this basic descriptive
statistics. We should expect a normal, Gaussian distribution of ages around the average age. In a
normal distribution, around 68% of the data should lie within one standard deviation of either
side of the mean, about 95% of the data should lie within two standard deviations of either side
of the mean, and about 99% of the data should lie within three standard deviations of either
side of the mean. The minimal age is less than two standard deviations from the mean (50 2 *
13 = 24 years), while the maximal age is more than seven standard deviations of the mean.
There is a very, very low probability for data to lay more than seven standard deviations from
the average value.
For example, we already have less than 0.5% of probability to have data outside interval that
finishes three standard deviations from the mean. Thus, we can conclude there is something
wrong with the maximal value in the Age column.
How do we find outliers in discrete columns? No matter whether they are numeric, dates or
strings, they can take a value from discrete pools of possible values only. Lets say we do not
know the pool in advance. We can still try to find suspicious values by measuring frequency
distribution of all values of an attribute. A value with very low frequency is potentially an
Page 129
The query uses a CTE to calculate absolute frequency and absolute percentage, and calculates
cumulative values in outer query with correlated subqueries. Note again that this is not the
most efficient query. However, we want to show the techniques we can use, and the
performance is not our main goal here. In addition, we typically do not execute these queries
very frequently.
In the result of the previous query, we can see the suspicious occupation. The Profesional
value is present in a single row only. As it is very similar to the Professional value, we can
conclude this is an error.
Occupation
AbsFreq
CumFreq
AbsPerc
CumPerc
Histogram
Clerical
68
68
13
13
*************
Management
134
202
25
38
*************************
Manual
64
266
12
50
************
Page 130
267
50
Professional
157
424
29
79
*****************************
Skilled
116
540
21
100
*********************
Manual
Strings Profiling
Catching errors in unconstrained strings, like names and addresses, is one of the most
challenging data profiling tasks. Because there are no constraints, it is not possible to say in
advance what is correct and what is incorrect. Still, the situation is not hopeless. We are going
to show couple of queries we can use to find string inconsistencies.
In the character column, strings have different lengths in different rows. However, lengths are
distributed with either nearly uniform or normal distribution. In both cases, strings that are
extremely long or short might be errors. Therefore, we are going to start our profiling by
calculating the distribution of string lengths. The following example checks lengths of middle
names.
SELECT
LEN(MiddleName) AS MNLength
,COUNT(*) AS Number
FROM dbo.vTargetMailDirty
GROUP BY LEN(MiddleName)
ORDER BY Number;
The vast majority of middle names are either unknown or one character long.
MNLength
Number
18
NULL
235
303
Of course, it is easy to find middle names that are more than one character long with the next
query.
Page 131
We see that one middle name is definitely wrong (of course, this is the one we added
intentionally). In addition, the middle name that is two characters long might be written
inconsistently as well. It looks like middle names should be written with a single letter, without a
dot after it.
CustomerKey
FirstName
LastName
MiddleName
11377
David
Robinett
R.
99000
Jon
Yang
VeryLongMiddleName
Sometimes we know what strings should look like. This means we have patterns for strings. We
can check for basic patterns with LIKE T-SQL operator. For example, we would not expect any
letters in phone numbers. Lets check them with the following query.
SELECT
CustomerKey
,FirstName
,LastName
,Phone
FROM dbo.vTargetMailDirty
WHERE Phone LIKE '%[A-Z]%';
From the abbreviated results, we can see that there are some phone numbers that include
characters. It seems like some operator constantly uses prefix Phone when entering phone
numbers.
CustomerKey
FirstName
LastName
Phone
12003
Audrey
Munoz
12503
Casey
Shen
13003
Jill
Hernandez
13503
Theodore
Gomez
14003
Angel
Ramirez
Phone: 488-555-0166
Page 132
More advanced pattern matching can be done with regular expressions. Regular expressions
can be treated as LIKE operator on steroids. T-SQL does not support regular expressions out of
the box. However, we are not powerless. From version 2005, SQL Server supports CLR objects,
including functions, stored procedures, triggers, user-defined types and user-defined
aggregates. A simple .NET, either Visual C# or Visual Basic function could do the work. Going
into details of using CLR code inside SQL Server is out of the scope of this book. Nevertheless,
the CLR project with this function is added to the accompanying code, and we can use it. Here is
Visual C# code for the function.
using
using
using
using
using
using
System;
System.Data;
System.Data.SqlClient;
System.Data.SqlTypes;
Microsoft.SqlServer.Server;
System.Text.RegularExpressions;
The Boolean function accepts two parameters: string to check and regular expression. It returns
true if string matches the pattern and false otherwise. Before we can use the function, we have
to import the assembly into SQL Server database, and create and register the function. The
CREATE ASSEMBLY command imports an assembly. The CREATE FUNCTION registers CLR
function. After the function is registered, we can use it like any built-in T-SQL function. The TSQL code for importing the assembly and registering function is as follows:
CREATE ASSEMBLY MDSBook_Ch03_CLR
FROM
'C:\MDSBook\Chapter03\MDSBook_Ch03\MDSBook_Ch03_CLR\bin\Debug\MDSBook_Ch03_CLR
.dll'
WITH PERMISSION_SET = SAFE;
GO
Page 133
CustomerKey
,FirstName
,LastName
,EmailAddress
,dbo.IsRegExMatch(EmailAddress,
N'(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})')
AS IsEmailValid
FROM dbo.vTargetMailDirty
WHERE dbo.IsRegExMatch(EmailAddress,
N'(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})')
= CAST(0 AS bit);
FirstName
LastName
EmailAddress
IsEmailValid
99001
Eugene
Huang
eugene10#adventure-
works.com
You can learn more about regular expressions on the MSDN. In addition, there are many sites
on the Web where developers freely exchange regular expressions.
Page 134
CustomerKey
,COUNT(*) AS Number
FROM dbo.vTargetMailDirty
GROUP BY CustomerKey
HAVING COUNT(*) > 1
ORDER BY Number DESC;
Number
99000
Let us only mention how we can profile XML data. In the completeness part of this chapter, we
have mentioned XML data type methods, and have shown how we can use them in T-SQL
queries. For the accuracy, we used the .value() method of the XML data type to extract element
and attribute values and represent them as scalar values of the built-in SQL Server data types.
After that, we used the same methods as we used for finding inaccuracies in attributes of builtin scalar data types. Therefore, dealing with XML data type attributes does not differ much from
what we have seen so far in the accuracy part of this chapter.
Finally, we have to mention that SQL Server supports validating the complete XML instance
against an XML schema collection. In an XML schema collection, we can have multiple XML
schemas in XSD format. An XML instance has to validate against one of the schemas for the
collection, otherwise SQL Server rejects the instance. Similarly like we mentioned for check
constraints, we should also try to use XML schema collection validation in our databases, in
order to enforce the data integrity rules.
Page 135
Page 136
Page 137
Page 138
Page 139
Figure 6: SETTING THE ALGORITHM PARAMETERS FOR THE DECISION TREES ANALYSIS
17. Save, deploy and process the model (or the whole project).
18. Click on the Mining Model Viewer tab. Change the Background drop-down list to
False. Now you can easily spot what leads to invalid phone numbers. Darker blue
nodes mean nodes with more invalid phone numbers. In the example in figure 7
it seems that having House Owner Flag set to 0, Yearly Income between 75,650
and 98,550, and Commute Distance not between zero and one mile, is a good
way to have an invalid phone number.
Page 140
Page 141
Measuring Information
The next hard data quality dimension is information. The amount of information in data is not
important from a correctness point of view; however, it makes sense to know it if we want to
use our data for business intelligence, for analyses.
From mathematics, from Information Theory, the measure for the amount of information is
Entropy. Before quoting the formula, lets explain the idea behind the Entropy.
In real life, information is the same thing as surprise. If a friend tells you something, and you are
not surprised, this means you already knew that. Now, lets say we have a discrete attribute.
How many times can we get surprised with its value, i.e. its state? For a start, lets say we have
two possible states, with equal distribution. We would expect one state; in 50% of cases we
would be surprised, because we would get another state. Now imagine we have skewed
distribution, and one state has 90% probability. We would expect this state, and be surprised in
10% of cases only. With totally uniform distribution, we would not have any chance to be
surprised ever.
Now lets say the attribute has three possible states, with equal distribution. We still expect one
specific state. However, in this case, we would be surprised in 67% examples. With four possible
states and equal distribution, we would be surprised in 75% examples. We can conclude the
following: the more possible states a discrete variable has, the higher the maximal possible
information in this attribute. The more equal across possible values (the less uniform) the
distribution is, the more actual information is in the attribute. The Entropy formula reflects this:
Page 142
The query calculates the number of distinct states in a CTE. This information is used to calculate
the maximal possible Entropy. Then the query calculates in derived tables, in the FROM part of
Page 143
DistinctVals
Actual
Maximal
Normalized
H(Education)
2.14581759656851
2.32192809488736
0.924153336743447
H(Region)
1.54525421415102
1.58496250072116
0.974946860539558
H(Gender)
0.999841673765157
0.999841673765157
From the results, we can see that we have the most information in the Education column.
However, Gender column has the most equal distribution.
For continuous attributes, we can calculate the Entropy in the same way if we discretize them.
We have to discretize them in classes (or bins) of equal width, in order to preserve the shape of
the original distribution function as much as possible. We can try with different number of bins.
Page 144
SSAS Cubes
SSAS cubes are typically built on Star Schema, using multiple dimensions and fact tables, with
business measures like quantities and amounts. Nevertheless, we can create a SSAS cube based
on a single table, and define par of columns for a dimension, and part for a fact table. In fact,
this way we are creating star schema inside Analysis Services cube, although we have flattened
(single table) model in our data warehouse.
For data profiling, the only measure we need is the count of rows. Then we can use all other
attributes for a dimension, and use them for analyzing counts, i.e. frequencies, of different
states of discrete attributes.
We are going to create a cube based on TargetMailMining view we used for Decision Trees
analysis. In order to do this, you will need to complete the following steps:
1. If you closed the BI project from this chapter, reopen it in BIDS.
2. In Solution Explorer, right-click on the Cubes folder, and select New Cube.
3. Use the existing tables in the Select Creation Method window.
4. Use Adventure Works data source view. Select TargetMailMining for the measure
group table.
Page 145
Page 146
Page 147
Figure 9: DISTRIBUTION OF THE OCCUPATION ATTRIBUTE IN EXCEL, DATA FROM SSAS CUBE
21. Play with other distributions. Try also to group over columns.
22. When finished, close Excel. Do not close BIDS yet.
As you can see, SSAS together with Excel is a powerful data profiling tool. This was an example
of using a tool for other than its intended purpose. In the next section, we are going to
introduce a tool that is intended for data profiling the SSIS Data Profiling task.
Page 148
Page 149
Page 150
Page 151
Figure 12: DISTRIBUTION OF THE OCCUPATION ATTRIBUTE IN EXCEL, DATA FROM POWERPIVOT
21. Play with other distributions. Try also to group over columns.
Page 152
Page 153
Page 154
Page 155
Page 156
Page 157
Page 158
Page 159
Clean-Up
As this was the last profiling option introduced in this chapter, we can clean up the
AdventureWorks2008R2 database with the following code:
USE AdventureWorks2008R2;
IF OBJECT_ID(N'dbo.ValueIsNULL', N'FN') IS NOT NULL
DROP FUNCTION dbo.ValueIsNULL;
IF OBJECT_ID(N'dbo.ProductsMining',N'V') IS NOT NULL
DROP VIEW dbo.ProductsMining;
IF OBJECT_ID(N'dbo.vTargetMailDirty',N'V') IS NOT NULL
DROP VIEW dbo.vTargetMailDirty;
IF OBJECT_ID(N'dbo.IsRegExMatch', N'FS') IS NOT NULL
DROP FUNCTION dbo.IsRegExMatch;
DROP ASSEMBLY MDSBook_Ch03_CLR;
IF OBJECT_ID(N'dbo.TargetMailMining',N'V') IS NOT NULL
DROP VIEW dbo.TargetMailMining;
GO
Page 160
Summary
The old rule garbage in garbage out is absolutely valid for MDM projects as well. Before
starting implementing a centralized MDM solution, like SQL Server Master Data Services
solution, we should have in-depth comprehension of the quality of our data. In addition, we
should find root causes for bad data.
We have shown in this chapter how we can use tools from SQL Server 2008 R2 suite for data
profiling and for finding the root cause. We have used Transact-SQL queries. We have used
XQuery and CLR code for controlling data quality of XML data and strings. We used SQL Server
Analysis Services intensively. The Unified Dimensional Model, or, if we prefer this expression,
OLAP cube, is a nice way for a quick, graphical overview of the data. In addition, PowerPivot for
Excel 2010 gives us opportunity to achieve the same graphical overview even without Analysis
Services. Data Mining helps us find interesting patterns, and thus this is a useful tool for finding
root causes for bad data. With Office Data Mining Add-Ins, Excel 2007 and 2010 became a
powerful data mining tool as well. SQL Server Integration Services Data Profiling task is another
quick tool for finding bad or suspicious data.
One of the most challenging tasks in preparing and maintaining master data is merging it from
multiple sources when we do not have the same identifier in all sources. This means we have to
do the merging based on similarities of column values, typically on similarity of string columns
like names and addresses. Even if we have a single source of master data, we can have duplicate
rows for the same entity, like duplicate rows for the same customer. In the next chapter, we are
going to tackle these two problems, merging and de-duplicating.
Page 161
References
Erik Veerman, Teo Lachev, Dejan Sarka: MCTS Self-Paced Training Kit (Exam 70-448):
Microsoft SQL Server 2008-Business Intelligence Development and Maintenance
(Microsoft Press, 2009)
Thomas C. Redman: Data Quality - The Field Guide (Digital Press, 2001)
Tamraparni Dasu, Theodore Johnson: Exploratory Data Mining and Data Cleaning (John
Wiley & Sons, 2003)
Itzik Ben-Gan, Lubor Kollar, Dejan Sarka, Steve Kass: Inside Microsoft SQL Server 2008: TSQL Querying (Microsoft Press, 2009)
Itzik Ben-Gan, Dejan Sarka, Roger Wolter, Greg Low, Ed Katibah, Isaac Kunen: Inside
Microsoft SQL Server 2008: T-SQL Programming (Microsoft Press, 2010)
Marco Russo, Alberto Ferrari: PowerPivot for Excel 2010 (Microsoft Press, 2011)
Page 162
Page 163
De-duplicating
Page 164
Identity Mapping
The problem with identity mapping arises when data is merged from multiple sources that can
update data independently. Each source has its own way of entity identification, or its own keys.
There is no common key to make simple joins. Data merging has to be done based on
similarities of strings, using names, addresses, e-mail addresses, and similar attributes. Figure 1
shows the problem: we have three rows in the first table in the top left corner, and two rows in
the second table in the top right corner. Keys of the rows from the left table (Id1 column) are
different from keys of the rows in the right table (Id2 column). The big table in the bottom
shows the result of approximate string matching. Note that each row from the left table is
matched to each row from the right table; similarities are different for different pairs of rows.
Page 165
Problems
Many problems arise with identity mapping. First, there is no way to get a 100 percent accurate
match programmatically; if you need a 100 percent accurate match, you must match entities
manually. But even with manual matching, you cannot guarantee 100 percent accurate matches
at all times, such as when you are matching people. For example, in a database table, you might
have two rows for people named John Smith, living at the same address; we cannot know
whether this is a single person or two people, maybe a father and son. Nevertheless, when you
perform the merging programmatically, you would like to get the best matching possible. You
must learn which method to use and how to use it in order to get the best possible results for
your data. In addition, you might even decide to use manual matching on the remaining
unmatched rows after programmatic matching is done. Later in this chapter, we compare a
couple of public algorithms that are shipped in SQL Server 2008 R2, in the Master Data Services
(MDS) database; we also add SSIS Fuzzy Lookup transformation to the analysis.
The next problem is performance. For approximate merging, any row from one side, from one
source table, can be matched to any row from the other side. This creates a cross join of two
tables. Even small data sets can produce huge performance problems, because cross join is an
algorithm with quadratic complexity. For example, cross join of 18,484 rows with 18,484 rows of
the AdventureWorksDW200R2 vTargetMail view to itself (we will use this view for testing),
means dealing with 341,658,256 rows after the cross join! We discuss techniques for optimizing
this matching (i.e., search space reduction techniques) later in this chapter.
Another problem related to identity mapping is de-duplicating. We deal briefly with this
problem at the end of the chapter.
Page 166
Jaccard index
Jaro-Winkler distance
All of these algorithms are well-known and publically documented (e.g., on Wikipedia). They are
implemented through a CLR function. Note that MDS comes only with SQL Server 2008 R2
Enterprise and Datacenter 64-bit editions. If you are not running either of these editions, you
can use any edition of SQL Server 2005 or later and implement these algorithms in the CLR
functions. Anastasios Yalanopouloss Beyond SoundEx - Functions for Fuzzy Searching in MS
SQL Server (http://anastasiosyal.com/archive/2009/01/11/18.aspx#soundex) provides a link to
a publically available library of CLR string matching functions for SQL Server.
Levenshtein (edit) distance measures the minimum number of edits needed to transform one
string into the other. It is the total number of character insertions, deletions, or substitutions
that it takes to convert one string to another. For example, the distance between kitten and
sitting is 3:
Page 167
The Jaro-Winkler distance is a variant of Jaro string similarity metrics. Jaro distance combines
matches and transpositions for two strings s1 and s2:
| |
| |
The symbol m means the number of matching characters and the symbol t is the number of
transpositions, while |s| denotes the length of a string. In order to define characters as
matching, their position in the strings must be close together, which is defined by character
position distance CPD; they should be no farther than the following formula calculates:
| || |
Jaro-Winkler distance uses a prefix scale p, which gives more favorable ratings to strings that
match from the beginning.
Page 168
p is a scaling factor for common prefixes; p should not exceed 0.25, otherwise the
distance can become larger than 1 (usually p is equal to 0.1)
Finally, the Simil algorithm looks for the longest common substring in two strings. Then it
removes this substring from the original strings. After that, it searches for the next longest
common substring in remainders of the two original strings from the left and the right. It
continues this process recursively until no common substrings of a defined length (e.g., two
characters) are found. Finally, the algorithm calculates a coefficient between 0 and 1 by dividing
the sum of the lengths of the substrings by the lengths of the strings themselves.
Preparing the Data
In order to compare functions efficiency, we need some data. The code in Listing 1 prepares
sample data. First, it prepares two tables. The CustomersMaster table will be the master table,
which is a table with keys that we want to keep and transfer to the target table. The
CustomersTarget table is the target of identity mapping; it will receive the keys from the
CustomersMaster table. We fill both tables from the
AdventureWorksDW2008R2.dbo.vTargetMail view. In the target table, we keep the original
keys, multiplied by -1, in order to control the efficiency of merging. The target table also has an
empty column to store the key from the master table (MasterCustomerId) and one additional
column (Updated) that will be used only in the code that produces errors in the data. Besides
common columns (Fullname, StreetAddress, and CityRegion), the two tables also have some
different columns, to show that identical structures are not necessary. All we need is at least
one character column in common, a column that is used to compare values.
Listing 1: Code to Prepare Sample Data
-- Assuming that MDSBook database exists
USE MDSBook;
GO
Page 169
We can perform matching based on FullName, StreetAddress, and CityRegion and get 100
percent accurate results. Of course, we have the same data in both tables. In order to test the
functions, we have to perform some updates in the target table and produce some errors in the
Page 170
Page 171
Page 172
LEN(CityRegion)) AS int),
26) + 96) +
26) + 96))
- CustomerId) < 0.16
After the update, we can check how many rows are different in the target table for the original
rows, which are still available in the master table. The maximum number of updates per row is
9; the more times a row was updated, the more common attribute values differ from the
original (and correct) ones. The probability that a row is updated many times drops quite quickly
with higher numbers of updates. The query in Listing 3 compares full names and addresses after
the update, selecting only rows with some changes and sorting them by the number of updates
in descending order, so we get the rows with maximal number of updates on the top.
Listing 3: Query to Compare Full Names and Addresses after Update
SELECT
m.FullName
,t.FullName
,m.StreetAddress
,t.StreetAddress
,m.CityRegion
,t.CityRegion
,t.Updated
FROM dbo.CustomersMaster AS m
INNER JOIN dbo.CustomersTarget AS t
ON m.CustomerId = t.CustomerId * (-1)
WHERE m.FullName <> t.FullName
OR m.StreetAddress <> t.StreetAddress
OR m.CityRegion <> t.CityRegion
ORDER BY t.Updated DESC;
The partial result in Figure 2 shows that three rows were updated 7 times (note that in order to
fit the number of updates of a row into the figure, CityRegion columns before and after the
update are omitted from the figure). Altogether, 7,790 rows were updated in this test. You
should get different results every time you run this test, because the updates are done
randomly (or better, with a controlled randomness, no matter how paradoxical this sounds).
You can also see that the values in the rows that were updated many times differ quite a lot
from the original values; therefore, our string matching algorithms are going to have a hard time
finding similarities.
Page 173
m.CustomerId
,m.FullName
,t.FullName
,DIFFERENCE(m.FullName, t.Fullname) AS SoundexDifference
,SOUNDEX(m.FullName) AS SoundexMaster
,SOUNDEX(t.FullName) AS SoundexTarget
FROM dbo.CustomersMaster AS m
INNER JOIN dbo.CustomersTarget AS t
ON m.CustomerId = t.CustomerId * (-1)
ORDER BY SoundexDifference;
The results in Figure 3 show that the DIFFERENCE() function based on the SOUNDEX() code did
not find any similarity for quite a few full names. However, as you can see from the highlighted
row, the name Zoe Rogers was not changed much; there should be some similarity in the strings
Zoe Rogers and rZoeRogers. This proves that the two functions included in T-SQL are not
efficient enough for a successful identity mapping.
Page 174
Figure 3: RESULTS OF TESTING THE SOUNDEX AND DIFFERENCE FUNCTIONS ON FULL NAMES
You could continue with checking the two T-SQL functions using the street address and city and
region strings.
All of the four algorithms implemented in the mdq.Similarity function in the MDS database
return similarity as a number between zero and one. A higher number means better similarity.
The query in Listing 5 checks how algorithms perform on full names. Again, because we retained
the original key (although multiplied by -1) in the target table, we can make an exact join,
compare the original (master) and changed (target) names, and visually evaluate which
algorithm gives the highest score.
Listing 5: Code to Check Algorithm Performance on Full Names
SELECT
m.CustomerId
,m.FullName
,t.FullName
,mdq.Similarity(m.FullName, t.Fullname,
,mdq.Similarity(m.FullName, t.Fullname,
,mdq.Similarity(m.FullName, t.Fullname,
,mdq.Similarity(m.FullName, t.Fullname,
FROM dbo.CustomersMaster AS m
INNER JOIN dbo.CustomersTarget AS t
ON m.CustomerId = t.CustomerId * (-1)
ORDER BY Levenshtein;
0,
1,
2,
3,
0.85,
0.85,
0.85,
0.85,
0.00)
0.00)
0.00)
0.00)
AS
AS
AS
AS
Levenshtein
Jaccard
JaroWinkler
Simil
For more information about the mdq.Similarity parameters, see SQL Server Books Online.
Page 175
Figure 4: RESULTS OF CHECKING ALGORITHM PERFORMANCE ON FULL NAMES FOR THE MDQ.SIMILARITY FUNCTION
From the results, we can see that the Jaro-Winkler algorithm gives the highest scores in our
example. You should check the algorithms on street address and city and region strings as well.
Although we cannot say that the Jaro-Winkler algorithm would always perform the best, and
although you should always check how algorithms perform on your data, we can say that this is
not a surprise. Jaro-Winkler is one of the most advanced public algorithms for string matching.
In addition, it forces higher scores for strings with the same characters in the beginning of the
string. From experience, we noticed that errors in data, produced by humans, typically do not
appear in the first few characters. Therefore, it seems that the Jaro-Winkler algorithm is the
winner in this case, and we will use it to do the real matching. However, before we do the
match, we need to discuss optimizing the matching by trying to avoid a full cross join.
Optimizing Mapping with Partitioning
As mentioned, the identity matching is a quadratic problem. The search space dimension is
equal to the cardinality of A x B of the Cartesian product of the sets included in the match.
There are multiple search space reduction techniques.
A partitioning or blocking technique partitions at least one set involved in matching in blocks.
For example, take the target rows in batches and match each batch with a full master table.
Page 176
In the next step, we select (nearly) randomly 1,000 rows from the target table and perform a
match with all rows from the master table. We measure the efficiency of the partitioning
technique and of the Jaro-Winkler string similarity algorithm. We select rows from the target
(nearly) randomly in order to prevent potential bias that we might get by selecting, for example,
rows based on CustomerId. We use the NEWID() function to simulate randomness. The values
Page 177
The next CTE, which Listing 8 shows, performs a cross join between the 1,000-row block from
the target table with the full master table. It also compares the strings and adds the JaroWinkler similarity coefficient to the output. In addition, it adds the row number sorted by the
Jaro-Winkler similarity coefficient, in descending order and partitioned by the target table key,
to the output. A row number equal to 1 will mean the highest Jaro-Winkler similarity coefficient
for a specific target table key. This way, we marked the master table row that, according to JaroWinkler algorithms, is the most similar to the target row. The last part of the UPDATE
Page 178
Finally, in the last statements of the batch, we use the fact that we know what key from the
master table we should receive in the target table. We are just counting how many rows got a
wrong key and comparing this number to the number of rows we updated in this pass. In
addition, we are also measuring the time needed to execute this update, as Listing 9 shows.
Listing 9: Code to Measure Efficiency
-- Measuring the efficiency
SET @RowsUpdated = @@ROWCOUNT;
SELECT @RowsUpdated AS RowsUpdated
,100.0 * @RowsUpdated / 7790 AS PctUpdated
,COUNT(*) AS NumErrors
,100.0 * COUNT(*) / @RowsUpdated AS PctErrors
,DATEDIFF(S, @starttime, GETDATE()) AS TimeElapsed
FROM dbo.CustomersTarget
WHERE MasterCustomerId <> CustomerId * (-1);
After we execute the update, we can harvest the results of the efficiency measurement, as
Figure 5 shows.
Page 179
Page 180
Page 181
We do not show the code and results of checking the content of this table here. Lets directly
create the nGrams frequency table, and store the frequency there, with the code in Listing 11.
Listing 11: Code to Create the nGrams Frequency Table
CREATE TABLE dbo.CustomersMasterNGramsFrequency
(
Token char(4) NOT NULL,
Cnt int NOT NULL,
Page 182
The following querys results display which nGrams values are the most frequent, as Figure 6
shows.
SELECT *
FROM dbo.CustomersMasterNGramsFrequency
ORDER BY RowNo DESC;
sp_spaceused
sp_spaceused
sp_spaceused
sp_spaceused
'dbo.CustomersMaster';
'dbo.CustomersTarget';
'dbo.CustomersMasterNGrams';
'dbo.CustomersMasterNGramsFrequency';
Page 183
Page 184
Now we can filter out exact matches in all of the following steps. If we have a large number of
rows, we can also use only a batch of still unmatched rows of the target table combined with
nGrams pre-selecting. This means we could actually combine the nGrams filtering with the
partitioning technique. However, because the number of rows to match is already small enough
in this proof of concept project, we do not split the unmatched rows in the target table in
batches here. We will measure the efficiency of the matching on the fly, while we are doing the
match. The first part of the code, as in the partitioning technique, declares two variables to
store the number of rows updated in this pass and the start time.
The actual merging query uses four CTEs. In the first one, we are tokenizing the target table
rows into 4Grams on the fly. Because we are using 4Grams, the n parameter is equal to 4 in this
pass. This part of code is shown in Listing 13.
Listing 13: Variable Declaration and the First CTE
-- Variables to store the number of rows updated in this pass
-- and start time
DECLARE @RowsUpdated AS int, @starttime AS datetime;
SET @starttime = GETDATE();
-- Tokenize target table rows
WITH CustomersTargetNGrams AS
(
SELECT t.CustomerId AS TCid
,g.Token AS TToken
,t.MasterCustomerId
FROM dbo.CustomersTarget AS t
CROSS APPLY (SELECT Token, Sequence
FROM mdq.NGrams(UPPER(t.FullName +
t.StreetAddress +
t.CityRegion), 4, 0)) AS g
WHERE t.MasterCustomerId IS NULL
AND CHARINDEX(' , g.Token) = 0
AND CHARINDEX('0, g.Token) = 0
),
The next CTE selects only target rows with 4Grams, with absolute frequency less than or equal
to 20. Thus, the p parameter is 20 in this case. The code is shown in Listing 14.
Listing 14: The Second CTE
Page 185
In the third CTE, we are selecting only matches that have in common at least three less frequent
4Grams. Thus, the m parameter is 3 in this example, as shown in Listing 15.
Listing 15: The Third CTE
-- Matches that have in common at least three less frequent 4Grams
NGramsMatch2 AS
(
SELECT TCid
,MCid
,COUNT(*) AS NMatches
FROM NGramsMatch1
GROUP BY TCid, MCid
HAVING COUNT(*) >= 3
),
The last CTE then compares the strings and adds the Jaro-Winkler similarity coefficient to the
output. In addition, it adds the row number sorted by Jaro-Winkler similarity coefficient, in
descending order and partitioned by the target table key, to the output. Row number equal to 1
means the row with the highest Jaro-Winkler similarity coefficient for a specific target table key.
This way, we marked the master table row that, according to Jaro-Winkler algorithms, is the
most similar to the target row. The code for the fourth CTE that calculates the Jaro-Winkler
coefficient and the row number is shown in Listing 16. Fortunately, this is the last CTE in this
long query; after this CTE, the only thing left to do is the actual UPDATE. It updates the target
table rows with the key from the master table row with the highest similarity. The UPDATE
statement is also shown in Listing 16.
Listing 16: The Fourth CTE and the UPDATE Statement
Page 186
We use the code in Listing 17 to measure the efficiency, and show the results in Figure 8.
Listing 17: Measuring the Efficiency
-- Measuring the efficiency
SET @RowsUpdated = @@ROWCOUNT;
SELECT @RowsUpdated AS RowsUpdated
,100.0 * @RowsUpdated / 7790 AS PctUpdated
,COUNT(*) AS NumErrors
,100.0 * COUNT(*) / @RowsUpdated AS PctErrors
,DATEDIFF(S, @starttime, GETDATE()) AS TimeElapsed
FROM dbo.CustomersTarget
WHERE MasterCustomerId <> CustomerId * (-1);
Page 187
Method
Rows to
Rows
Percent
Number
Percent
Elapsed
match
matched
matched
of errors
of errors
time (s)
Partitions
NA
NA
NA
7790
1000
12.83
63
6.30
133
Partitions
NA
NA
NA
7790
2000
25.67
93
4.65
262
n Grams
20
7790
7391
94.88
232
3.14
223
20
7790
7717
99.06
322
4.17
264
50
7790
7705
96.39
48
0.64
11
50
7790
7716
99.05
106
1.37
14
20
7790
6847
87.89
33
0.48
filtering
nGrams
filtering
nGrams
filtering
nGrams
filtering
nGrams
filtering
Page 188
nGrams
20
7790
7442
95.53
145
1.95
10
7790
6575
84.40
188
2.86
10
7790
5284
67.83
28
0.53
filtering
nGrams
filtering
nGrams
filtering
As you can see from the table, it is important to choose the correct values for the parameters of
the nGrams filtering method. For example, when using two common 3Grams with absolute
frequency less than or equal to 20, the performance (elapsed time 264s) and the accuracy (4.17
percent of errors) were not really shining. Nevertheless, the table proves that the nGrams
filtering method works. In all cases in the table, it is more efficient than the partitioning method.
With proper values for the parameters, it gives quite astonishing results.
In addition, note that in a real example, not as many errors would be present in the target table
rows, and the results could be even better. We should also try matching with other algorithms,
to check which one is the most suitable for our data. In addition, we could start with more strict
values for the three parameters and have even fewer errors in the first pass. So, did we find the
best possible method for identity mapping? Before making such a strong conclusion, lets try the
last option for matching we have in SQL Server out of the box, the SSIS Fuzzy Lookup data flow
transformation.
Page 189
Page 190
Page 191
Of course, even Fuzzy Lookup cannot perform magic. If we try to match every row from the left
table with every row from the right table, we still get a full cross join. Therefore, it is wise to do
some optimization here as well.
In the SSIS package, we start with exact matches again, using the Lookup transformation. The
target table is the source, and the master table is the lookup table for the Lookup
transformation. The rows that did not match (i.e., the Lookup no Match Output) are then
Page 192
Page 193
Page 194
Page 195
Page 196
Page 197
Figure 12: THE ADVANCED TAB OF THE FUZZY LOOKUP TRANSFORMATION EDITOR
Similarity Threshold
The closer the value of the similarity threshold is to 1, the closer the resemblance of the
lookup value to the source value must be to qualify as a match. Increasing the threshold
can improve the speed of matching because fewer candidate records need to be
considered. You can optimize the matching by having multiple Fuzzy Lookups. First you
match rows with higher similarity, then you lower the similarity threshold, then lower
again, and so on. Because every time you need to match fewer rows, you can control the
performance.
17. Drag and drop the Union All transformation to the working area. Connect it with
the Lookup Match Output from the Lookup transformation and with the default
output from the Fuzzy Lookup transformation.
Page 198
Page 199
From this number, it is easy to calculate the number of matched rows. In my case, the number
of unmatched row was 608. Because I had 7,790 rows to match, this means that Fuzzy Lookup
actually matched 92.20 percent of rows to match. With the following query we can measure the
number of errors:
SELECT *
FROM dbo.FuzzyLookupMatches
WHERE CustomerId <> MCid * (-1);
In my case, the result was fantastic: no errors at all. At first glimpse, it looked too good to be
true to me. Therefore, I ran many additional tests, and only here and there got an error. Out of
more than 20 tests, I got two errors only once!
What about the execution time? After you execute the package, you can read execution time
using the Execution Results tab of the SSIS Designer. In my case, execution (or elapsed) time was
15 seconds. However, this time included the time needed to read the data, to perform exact
matches with the Lookup transformation, to run the Fuzzy Lookup transformation, union the
data, and write the data to the destination.
Page 200
Page 201
Page 202
De-Duplicating
De-duplicating is actually a very similar, or more accurately, it is actually the same problem as
identity mapping. SSIS has a separate Fuzzy Grouping transformation for this task. It groups
rows based on string similarities. Lets see how we can use either of the two Fuzzy tasks for both
problems.
Lets start with using Fuzzy Grouping for identity mapping. This is a very simple task. We can just
make a not distinct union (in T-SQL, use UNION ALL rather than the UNION operator) of all rows
from both the master and target tables. Then we can perform the grouping on this union. The
result is the same as we would get with identity mapping.
To turn the problem around: How could we perform de-duplication with Fuzzy Lookup or
another string similarity merging? We should use the same table twice, once as the master and
once as the target table. We would immediately exclude exact matches for all the character
columns and with the same keys (i.e., matches of a row with itself). In the next step, we would
perform the identity mapping of exact matches of character columns of rows with different
keys. Then we would perform approximate matching on the rest of the rows. Finally, we could
delete all the rows that got a match (i.e., the same identification) except one. The Excel Fuzzy
Lookup add-in can actually be used for de-duplicating as well. For the left and the right table for
matching, you can define the same Excel table.
Nevertheless, you might find it easier to perform the de-duplicating by using the SSIS Fuzzy
Grouping transformation. Therefore, lets test it. Before testing, lets prepare some demo data.
We will add all the rows from the CustomersTarget table to the CustomersMaster table and
then try to de-duplicate them.
We can add the rows with the INSERT query, like shown in Listing 19.
Listing 19: Adding Duplicate Rows to the CustomersMaster Table
Page 203
Page 204
There will be some additional work after Fuzzy Grouping finishes. The transformation adds the
following columns to the output:
_key_in, a column that uniquely identifies each row for the transformation;
The _key_out column has the value of the _key_in column in the canonical data row. The
canonical row is the row that the Fuzzy Grouping identified as the most plausible correct
row and was used for comparison (i.e., the row used for standardizing data). Rows with
the same value in _key_out are part of the same group. The _key_out value for a group
corresponds to the value of _key_in in the canonical data row. We could suppose that
we could keep the canonical row from the group only;
_score, a value between 0 and 1 that indicates the similarity of the input row to the
canonical row. For the canonical row, the _score has a value of 1.
In addition, Fuzzy Grouping adds columns used for approximate string comparison with clean
values. Clean values are the values from the canonical row. In our example, these columns are
the FullName_clean, StreetAddress_clean, and CityRegion_clean columns. Finally, the
transformation adds columns with similarity scores for each character column used for
approximate string comparison. In our example, these are the _Similarity_FullName,
_Similarity_StreetAddress, and _Similarity_CityRegion columns.
Page 205
Page 206
Page 207
Page 208
Page 209
Page 210
_key_out
,COUNT(_key_out) AS NumberOfDuplicates
FROM dbo.FuzzyGroupingMatches
GROUP BY _key_out
ORDER BY NumberOfDuplicates DESC;
Page 211
Page 212
Figure 21: RESULTS OF CHECKING THE ROWS WITH HIGH NUMBER OF DUPLICATES
From the results, we can see that the canonical row was not identified properly for all the rows.
For example, the canonical row for the sixth row (i.e., the row with CustomerId -25321) should
be the second row (i.e., the row with CustomerId 25321). Many correct rows with positive
CustomerId values were identified incorrectly as duplicates of the first row (the row with
CustomerId equal to 25320), which was identified as the canonical row for this group.
Apparently, we would have to perform more manual work in order to finish de-duplicating by
using the Fuzzy Grouping than by using the Fuzzy Lookup transformation. Of course, we should
play more with Fuzzy Grouping with different similarity threshold settings. We could perform a
consecutive procedure using de-duplicating with a high similarity threshold first, then lower it a
bit, and then lower it more, and so on. Nevertheless, it seems that the Fuzzy Lookup
transformation could be more suitable for de-duplicating than Fuzzy Grouping. Not only did it
give us better results, but it also easily managed to outperform Fuzzy Grouping.
Clean-Up
To clean up the MDSBook database, use the code from Listing 22.
Listing 22: Clean-Up Code
USE MDSBook;
Page 213
Page 214
Summary
As you can see in this chapter, identity mapping and de-duplicating are not simple. We
developed a custom algorithm for identity mapping using the functions from the MDS database.
We tested it on data from the AdventureWorksDW2008R2 demo database. We made errors in
the data in a controllable way, so we could measure the results of the tests throughout the
chapter. Through the tests, we realized that the quality of the results and the performance of
our de-duplicating algorithm are highly dependent on proper selection of parameters.
After testing the manual procedure, we used SSIS Fuzzy Lookup transformation for deduplicating. We also introduced the Fuzzy Lookup add-in for Excel, which brings the power of
this transformation to advanced users on their desktops. We showed that de-duplicating is
actually the same problem as identity mapping. Nevertheless, we also tested the SSIS Fuzzy
Grouping transformation.
According to the tests in this chapter, the Fuzzy Lookup transformation is a clear winner. It gave
us better results than any other option, including the Fuzzy Grouping performance, with quite
astonishing performance. Nevertheless, this does not mean you should always use Fuzzy Lookup
for identity mapping and de-duplicating. You should test other possibilities on your data as well.
The tests presented in this chapter are quite exhaustive, so you should have the heavy work
with identity mapping and de-duplicating mitigated.
This is the last chapter in this version of the book. However, stay tuned; there are many exciting
new features coming with the next release of SQL Server, code-name Denali. Microsoft already
announced rewritten Master Data Services, improved Integration Services, and a complete new
application called Data Quality Services. We will update this book to incorporate these new
technologies when they become available.
Page 215
References
53 Microsoft SQL Server MVPs: SQL Server MVP Deep Dives (Manning, 2010)
Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005 article on
MSDN describing more details about Fuzzy transformations than Books Online
Page 216
Index
Index
.NET, 7, 11, 47, 57, 60, 133
attributes, 48, 49, 66, 69, 71, 74, 75, 82, 99, 100, 128,
146
accuracy, 18, 38, 45, 105, 118, 123, 126, 135, 136,
authority, 22
bias, 177
administrator, 95
BIDS, 121, 122, 125, 137, 141, 145, 147, 148, 153,
bill of materials, 16
aggregation, 30
Boolean, 133
algorithm, 38, 40, 44, 53, 121, 136, 156, 166, 167,
business interest, 23
Analysis Services, 11, 12, 53, 78, 121, 145, 147, 150,
application, 6, 7, 13, 15, 18, 20, 21, 28, 33, 39, 40, 44,
45, 47, 50, 52, 56, 59, 61, 62, 63, 101, 103, 115,
C. J. Date, 24, 55
cardinality, 176
assessment, 42, 43
Check constraint, 28
Circular References, 79
Attribute Management, 70
class libraries, 47
classify, 35
Page 217
Index
cleansing, 6, 18, 19, 20, 40, 44, 52
contract, 14, 15
client tier, 28
coordinated, 16
correcting, 44
corrective measures, 44
correctness, 40
Code attribute, 76
cross-system, 19
collation, 21
Collection Attributes, 82
customers, 6, 10, 14, 15, 16, 19, 24, 33, 39, 43, 44,
committed version, 97
Danette McGilvray, 55
Data conflict, 22
communicate, 19, 46
data integrity, 6, 7, 12, 17, 18, 23, 24, 28, 38, 54, 135
Data Mining, 11, 53, 139, 145, 156, 157, 158, 161,
COMPLEXITY_PENALTY, 139
162
compliance, 17, 41
data profiling, 36, 42, 105, 107, 131, 145, 148, 149,
compromise, 21
156, 161
concept, 14
Data Profiling task, 52, 109, 145, 148, 153, 155, 161
data quality, 6, 7, 9, 12, 15, 17, 18, 21, 22, 24, 28, 34,
36, 37, 38, 42, 43, 44, 45, 46, 47, 51, 53, 54, 67,
consistency, 39
consistent, 18, 79
Consolidated Attributes, 82
data steward, 19
constant, 113
data type, 21, 23, 33, 34, 35, 38, 74, 107, 109, 110,
145
Page 218
Index
database, 11, 13, 14, 18, 19, 22, 24, 28, 31, 33, 35,
41, 47, 48, 51, 54, 57, 58, 59, 65, 67, 75, 79, 89,
duplicate records, 43
90, 94, 98, 99, 100, 101, 104, 107, 112, 115, 120,
121, 122, 126, 127, 133, 134, 147, 149, 150, 154,
158, 159, 160, 166, 169, 175, 181, 191, 192, 193,
ease of use, 39
Ed Katibah, 162
DateTime, 74
education, 43
146
212
de-duplication, 163
definition, 15, 16, 17, 23, 33, 36, 62, 65, 67, 69, 80,
enterprise, 6, 12, 15, 18, 22, 23, 28, 33, 36, 42, 53,
85, 168
103, 163
entities, 14, 15, 23, 24, 29, 45, 48, 49, 56, 63, 64, 67,
69, 71, 74, 75, 78, 79, 99, 100, 101, 163, 166
entity, 14, 15, 23, 40, 48, 49, 54, 70, 71, 73, 75, 78,
79, 80, 81, 85, 98, 99, 100, 101, 149, 161, 163,
165, 181
derived hierarchies, 49
entropy, 38
entry point, 62
ETL, 52
Dimensional Model, 12, 14, 15, 23, 29, 30, 31, 32, 54,
161
event, 14, 23
Excel, 147, 148, 149, 152, 153, 156, 157, 158, 159,
documentation, 22, 41
160, 161, 162, 163, 190, 191, 201, 202, 203, 215,
216
domain knowledge, 22
Page 219
Index
explicit hierarchies, 49, 102
historical data, 31
Explicit Hierarchies, 80
Explicit Hierarchy, 79, 80, 81, 82, 89, 92, 93, 94, 99
HRM, 15
Explorer, 63, 82, 84, 91, 122, 123, 137, 138, 145, 146,
human resources, 31
exporting, 56
Fact table, 14
improvement plan, 44
inaccurate, 8, 15, 38, 44, 45, 126, 128, 129, 136, 141
Free-Form, 69
Free-Text, 73
inconsistent, 39, 44
information, 17, 18, 23, 25, 26, 29, 33, 34, 38, 39, 40,
44, 75, 76, 80, 91, 94, 105, 106, 112, 136, 139,
Fuzzy Lookup, 52, 164, 166, 189, 190, 191, 192, 193,
Information Principle, 23
195, 197, 198, 200, 201, 202, 203, 204, 209, 211,
input, 69, 74, 121, 122, 123, 136, 138, 146, 166, 192,
197, 205
generalization, 29
government, 17
installer package, 57
governor, 19
instance, 33, 38, 58, 59, 107, 109, 110, 122, 135, 149,
157, 158
integrating, 17
harmonizing, 17
header, 23
hierarchical data, 13
Hierarchies, 49, 71, 77, 79, 80, 81, 82, 83, 84, 85, 92,
99
215, 216
integrity rules, 28
Page 220
Index
interaction, 19, 39, 62
interactivity, 40
Master Data Services, 9, 12, 21, 45, 47, 54, 56, 57, 58,
62, 63, 64, 65, 66, 73, 74, 77, 78, 79, 82, 85, 89,
90, 91, 94, 95, 101, 103, 104, 158, 161, 166, 215
177, 178, 180, 181, 184, 185, 189, 190, 194, 195,
198, 203
MDM solution, 19, 21, 22, 28, 33, 35, 36, 45, 47, 54,
key data, 6
mdm.StgBatch, 101
Leaf Attributes, 71
leaf level, 77
mdm.tblEntityMemberType, 99
mdm.tblModel, 98
license, 14
location, 14, 16
MDS, 12, 21, 22, 47, 48, 49, 50, 51, 52, 54, 56, 57, 58,
59, 61, 76, 98, 99, 101, 103, 166, 167, 175, 181,
192, 215
member, 49, 63, 75, 79, 80, 82, 83, 84, 92, 94
member values, 67
magnifying glass, 83
merging, 19, 21, 36, 44, 52, 103, 161, 165, 166, 169,
managing data, 56
22, 32, 35, 36, 40, 42, 43, 45, 47, 48, 49, 50, 51,
middle tier, 28
52, 53, 54, 56, 77, 103, 126, 161, 163, 181
minimalization, 41
Page 221
Index
MINIMUM_SUPPORT, 139
missing, 38, 107, 109, 110, 111, 114, 115, 120, 125
Model, 14, 30, 60, 64, 65, 66, 67, 75, 76, 90, 95, 98,
operating system, 57
operational system, 17
modify(), 110
mutually independent, 27
over-fitting, 137
package, 153, 154, 192, 193, 196, 197, 199, 200, 206,
210, 211
natural hierarchies, 31
pattern, 38
navigation, 77
people, 14
People, 6
nodes(), 110
non-transitively dependent, 27
place, 14
notification, 47
PowerShell, 58, 60
noun, 14, 15
predicate, 14, 23
NULL, 28, 29, 37, 38, 41, 44, 90, 106, 107, 108, 109,
110, 112, 113, 114, 115, 116, 117, 118, 119, 120,
pre-select, 180
121, 125, 127, 131, 134, 160, 170, 178, 180, 182,
presentation quality, 40
preventing, 44
object, 11, 14, 35, 41, 67, 69, 153, 193, 206
priority, 43
ODS, 33
proactive, 6, 18
OLAP cubes, 53
product categories, 15
Page 222
Index
products, 14, 15, 19, 26, 30, 49, 109, 111, 115, 126
proposition, 14, 23
schema, 6, 13, 23, 24, 31, 32, 33, 34, 36, 40, 41, 42,
pruning, 180
45, 98, 101, 105, 106, 107, 112, 114, 126, 135, 145
quality, 7, 10, 17, 18, 22, 36, 39, 40, 41, 42, 43, 45,
52, 54, 85, 103, 105, 107, 112, 117, 120, 161, 215
query(), 109
segment, 14
SELECT, 89, 90, 94, 99, 108, 109, 110, 111, 112, 116,
117, 118, 119, 120, 121, 127, 128, 129, 130, 131,
reactive, 6, 18
132, 134, 135, 137, 143, 144, 170, 171, 172, 173,
174, 175, 177, 178, 179, 182, 183, 184, 185, 186,
Recursive Hierarchies, 79
semi-structured data, 13
service, 15, 59
set, 14, 18, 36, 112, 149, 166, 168, 176, 177, 181,
190, 191
shared dimensions, 31
relation, 23, 37, 106, 112, 114, 118, 119, 120, 126
relational database, 14, 15, 23, 35, 37, 67, 69, 74,
SolidQ, 8, 9, 10, 11
Relational Model, 6, 11, 12, 14, 23, 28, 31, 32, 54, 65,
70
relationship, 13, 14, 23, 29, 41, 49, 50, 69, 71, 74, 75,
78, 79, 99, 100, 101, 102
retention, 18, 42
re-usage, 17
reuse, 16, 17
spreadsheet, 34
root cause, 36, 43, 44, 105, 106, 107, 114, 125, 136,
SQL Server, 7, 9, 11, 12, 13, 21, 28, 34, 35, 45, 47, 51,
141, 161
52, 53, 54, 56, 57, 58, 74, 78, 90, 94, 103, 105,
Sakichi Toyoda, 43
107, 109, 110, 113, 121, 133, 134, 135, 136, 145,
sales representative, 14
Page 223
Index
149, 153, 154, 155, 156, 158, 161, 162, 166, 167,
tokenizer, 190
SSAS, 21, 53, 121, 122, 139, 145, 147, 148, 149, 152,
157, 158, 159
SSIS, 12, 21, 52, 109, 145, 148, 153, 156, 164, 166,
189, 191, 192, 193, 200, 203, 206, 211, 215
SSRS, 53
transpositions, 168
staging tables, 47
treatment, 16
trust, 40
T-SQL, 11, 89, 90, 101, 107, 113, 128, 132, 133, 134,
135, 143, 145, 162, 164, 166, 174, 175, 181, 203
unified, 19
string-matching, 44
subject, 14
subscription views, 47
unstructured data, 13
UPDATE, 171, 172, 177, 178, 179, 180, 184, 186, 187
user-defined attributes, 89
taxonomy, 13, 40
Validation Actions, 86
text mining, 35
verb, 14
thing, 14
version, 9, 50, 51, 67, 93, 95, 96, 97, 103, 115, 133,
versioning, 16, 17, 18, 19, 45, 50, 56, 93, 103
timeliness, 39
volatility, 16
timely, 6, 39
Page 224
Index
Web, 18, 39, 47, 56, 57, 59, 61, 62, 63, 101, 103, 134,
163
WHERE, 108, 109, 110, 111, 117, 118, 119, 120, 121,
XML, 13, 33, 34, 38, 107, 109, 110, 111, 135, 153,
127, 128, 130, 132, 134, 137, 172, 173, 178, 179,
182, 185, 187, 200, 212
Page 225