0% found this document useful (0 votes)
762 views

Chapter 6

knowledge developments
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
762 views

Chapter 6

knowledge developments
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Chapter 6 Basics of Data Integration

Fundamentals of Business Analytics RN Prasad and Seema Acharya

Learning Objectives and Learning Outcomes


Learning Objectives 1. Concepts of data integration Learning Outcomes (a) To realize the importance of metadata

2. Needs and advantages of using data integration


3. Introduction to common data integration approaches 4. Metadata types and sources 5. Introduction to data quality

(b) To understand data quality

(c) To be able to perform scrubbing/cleaning of data (d) To be able to apply de-duplication (e) To be able to enhance the quality of data

6. Data profiling concepts and applications


Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Session Plan

Lecture time

90 minutes

Q/A

15 minutes

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Agenda
BI the process What is Data Integration? Challenges in Data Integration Technologies in Data Integration ETL: Extract, Transform, Load Various stages in ETL Need for Data Integration Advantages of using Data Integration Common approaches to Data Integration Metadata and its types Data Quality and Data Profiling concepts

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

BI The Process

Data Integration

Data Analysis

Reporting

What Is Data Integration?

Process of coherent merging of data from various data sources and presenting a cohesive/consolidated view to the user

Involves combining data residing at different sources and providing users with a unified view of the data.

Significant in a variety of situations; both


commercial (e.g., two similar companies trying to merge their database) Scientific (e.g., combining research results from different bioinformatics research repositories)

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Answer a Quick Question

According to your understanding What are the problems faced in Data Integration?

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Challenges in Data Integration


Development challenges Translation of relational database to object-oriented applications Consistent and inconsistent metadata Handling redundant and missing data Normalization of data from different sources
Technological challenges Various formats of data Structured and unstructured data Huge volumes of data Organizational challenges Unavailability of data Manual integration risk, failure
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Technologies in Data Integration


Integration is divided into two main approaches: Schema integration reconciles schema elements Instance integration matches tuples and attribute values The technologies that are used for data integration include: Data interchange Object Brokering Modeling techniques Entity-Relational Modeling Dimensional Modeling

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Various Stages in ETL


Cycle initiation
Build reference data Extract (actual data) Validate Transform (clear, apply business rules) Stage (load into staging tables) Audit reports (success/failure log) Publish (load into target tables) Data Mapping

Data Staging

Archive
Clean up
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Various Stages in ETL


DATA MAPPING DATA STAGING

VALIDATE REFERENCE

STAGE

EXTRACT

TRANSFORM

ARCHIVE

----------------

AUDIT REPORTS

PUBLISH

Extract, Transform and Load


What is ETL? Extract, transform, and load (ETL) in database usage (and especially in data warehousing) involves:

Extracting data from different sources Transforming it to fit operational needs (which can include quality levels)
Loading it into the end target (database or data warehouse)

Allows to create efficient and consistent databases While ETL can be referred in the context of a data warehouse, the term ETL is in fact referred to as a process that loads any database. Usually ETL implementations store an audit trail on positive and negative process runs.

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Mapping
The process of creating data element mapping between two distinct data models It is used as the first step towards a wide variety of data integration tasks The various method of data mapping are
Hand-coded, graphical manual Graphical tools that allow a user to draw lines from fields in one set of data to fields in another Data-driven mapping Evaluating actual data values in two data sources using heuristics and statistics to automatically discover complex mappings Semantic mapping A metadata registry can be consulted to look up data element synonyms If the destination column does not match the source column, the mappings will be made if these data elements are listed as synonyms in the metadata registry Only able to discover exact matches between columns of data and will not discover any transformation logic or exceptions between columns
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Staging
A data staging area is an intermediate storage area between the sources of information and the Data Warehouse (DW) or Data Mart (DM) A staging area can be used for any of the following purposes:

Gather data from different sources at different times


Load information from the operational database Find changes against current DW/DM values.

Data cleansing
Pre-calculate aggregates.

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Extraction
Extraction is the operation of extracting data from the source system for further use in a data warehouse environment. This the first step in the ETL process. Designing this process means making decisions about the following main aspects: Which extraction method would I choose? How do I provide the extracted data for further processing?

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Extraction (cont)


The data has to be extracted both logically and physically. The logical extraction method Full extraction Incremental extraction The physical extraction method Online extraction Offline extraction

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Transformation
It is the most complex and, in terms of production the most costly part of ETL process. They can range from simple data conversion to extreme data scrubbing techniques. From an architectural perspective, transformations can be performed in two ways. Multistage data transformation

Pipelined data transformation

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Transformation

LOAD INTO

STAGE_01 TABLE

STAGIN
G TABLE

VALIDATE CUSTOMER KEYS

STAGE_02 TABLE

CONVERT SOURCE KEY TO WAREHOUSE KEYS

STAGE_03 TABLE

INSERT INTO WAREHOUSE TABLE

TARGET TABLE

MULTISTAGE TRANSFORMATION
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Transformation

EXTERNAL TABLE

VALIDATE CUSTOMER KEYS

CONVERT SOURCE KEYS TO WAREHOUSE KEYS

INSERT INTO WAREHOUSE TABLE

TARGET TABLE

PIPELINED TRANSFORMATION
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Loading
The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. The timing and scope to replace or append into the DW are strategic design choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the DW.

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Answer a Quick Question

According to your understanding What is the need for Data Integration in corporate world ?

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Need for Data Integration


It is done for providing data in a specific view as requested by users, applications, etc. The bigger the organization gets, the more data there is and the more data needs integration. Increases with the need for data sharing. What it means? DB2 Unified view of data Oracle

SQL

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Advantages of Using Data Integration


Of benefit to decision-makers, who have access to important information from past studies Reduces cost, overlaps and redundancies; reduces exposure to risks Helps to monitor key variables like trends and consumer behaviour, etc.

What it means?
DB2 Unified view of data Oracle

SQL

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Common Approaches to Data Integration

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Integration Approaches


There are currently various methods for performing data integration. The most popular ones are:

Federated databases Memory-mapped data structure Data warehousing

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Integration Approaches


Federated database (virtual database): Type of meta-database management system which transparently integrates multiple autonomous databases into a single federated database The constituent databases are interconnected via a computer network, geographically decentralized. The federated databases is the fully integrated, logical composite of all constituent databases in a federated database management system. Memory-mapped data structure: Useful when needed to do in-memory data manipulation and data structure is large. Its mainly used in the dot net platform and is always performed with C# or using VB.NET Its is a much faster way of accessing the data than using Memory Stream.

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Integration Approaches


Data Warehousing The various primary concepts used in data warehousing would be: ETL (Extract Transform Load) Component-based (Data Mart) Dimensional Models and Schemas Metadata driven

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Answer a Quick Question

According to your understanding What are the advantages and limitations of Data Warehouse?

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Warehouse Advantage and Limitations


ADVANTAGES Integration at the lowest level, eliminating need for integration queries.
Runtime schematic cleaning is not needed performed at the data staging environment Independent of original data source Query optimization is possible. LIMITATIONS

Process would take a considerable amount of time and effort


Requires an understanding of the domain

More scalable when accompanied with a metadata repository increased load.


Tightly coupled architecture

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Metadata and Its Types

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Metadata and Its Types


WHAT Business
HOW Process TYPE Technical WHO, WHEN Application

Data definitions, Metrics definitions, Subject models, Data models, Business rules, Data rules, Data owners/stewards, etc. Source/target maps, Transformation rules, data cleansing rules, extract audit trail, transform audit trail, load audit trail, data quality audit, etc. Data locations, Data formats, Technical names, Data sizes, Data types, indexing, data structures, etc. Data access history: Who is accessing? Frequency of access? When accessed? How accessed? , etc.

Data Quality and Data Profiling

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Building Blocks of Data Quality Management


Analyze, Improve and Control This methodology is used to encompass people, processes and technology. This is achieved through five methodological building blocks, namely: Profiling Quality

Integration
Enrichment Monitoring
Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Profiling
Beginning the data improvement efforts by knowing where to begin. Data profiling (sometimes called data discovery or data quality analysis) helps to gain a clear perspective on the current integrity of data. It helps: Discover the quality, characteristics and potential problems Reduce the time and resources in finding problematic data Gain more control on the maintenance and management of data Catalog and analyze metadata The various steps in profiling include Metadata analysis Outline detection Data validation Pattern analysis Relationship discovery Statistical analysis Fundamentals of Business Analytics RN Prasad and Seema Acharya Business rule validation Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Profiling (cont)


Metadata profiling Typical type of metadata profiling are Domain: Conformation of data in column to the defined value or range Type: Alphabetic or numeric Pattern: The proper pattern Frequency counts Interdependencies: Within a table: Between tables: Data profiling analysis Column profiling Dependency profiling Redundancy profiling

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Answer a Quick Question

According to your understanding What is data quality and why it is important?

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Quality
Correcting, standardizing and validating the information Creating business rules to correct, standardize and validate your data. High-quality data is essential to successful business operations.

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Quality (cont)


Data quality helps you to: Plan and prioritize data Parse data Standardize, correct and normalize data Verify and validate data accuracy Apply business rules Standardize and Transform Data The three components that ensure the quality and integrity of the data: Data rationalization Data standardization Data transformation

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Answer a Quick Question

What do you think are the major causes of bad data quality?

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Causes of Bad Data Quality

DURING PROCESS OF EXTRACTION


Initial Conversion of Data Consolidation of System Manual Data Entry Batch Feeds Real Time Interfaces

DATA DECAY DURING LOADING AND ARCHIVING


Changes Not Captured System Upgrades Use of New Data Loss of Expertise

Effect of Bad Quality


DURING DATA TRANSFORMATIONS
Processing Data Data Scrubbing

Automation Process

Data Purging

Data Quality in Data Integration


Building a unified view of the database from the information. An effective data integration strategy can lower costs and improve productivity by ensuring the consistency, accuracy and reliability of data. Data integration enables to: Match, link and consolidate multiple data sources Gain access to the right data sources at the right time Deliver high-quality information Increase the quality of information

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Data Quality in Data Integration


Understand Corporate Information Anywhere in the Enterprise Data integration involves combining processes and technology to ensure an effective use of the data can be made. Data integration can include: Data movement Data linking and matching Data house holding

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Popular ETL Tools

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

ETL Tools
ETL process can be create using programming language. Open source ETL framework tools Clover.ETL Enhydra Octopus Pentaho Data Integration (also known as Kettle) Talend Open Studio Popular ETL Tools Ab Initio Business Objects Data Integrator Informatica SQL Server 2005/08 Integration services

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

Summary please

Ask a few participants of the learning program to summarize the lecture.

Fundamentals of Business Analytics RN Prasad and Seema Acharya Copyright 2011 Wiley India Pvt. Ltd. All rights reserved.

You might also like