0% found this document useful (0 votes)

29 views

Week-3 Schema Matching and Mapping

The document discusses several key challenges in information integration, including managing different platforms and data sources, automated data transformations like data curation and quality, and schema and data heterogeneity. It also discusses uniform query access, distributed query processing, and consolidating/transforming data for analysis. The document suggests that AI may help reduce effort for integration tasks and enable graceful handling of uncertainty from sources. It then describes the typical steps involved in data integration: source selection, acquisition, understanding, cleansing, and transforming data.

Uploaded by

Saksham Sachwani

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Week-3 Schema Matching and Mapping

Uploaded by

Saksham Sachwani

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

W3

Schema Matching and Mapping

II Architecture: Virtualization Layer Approach
II Architecture: A Data Warehousing Approach
Information Integration Key Challenges
• Managing different platforms:
• Identifying relevant information from multiple data sources
• Logical specification of data desired
• Handle dynamic arrival and departure of data sources
• Automated data transformations:
• Data curation,
• Defining and working with data quality
• Schema and Data heterogeneity:
• Integrating diverse information from the recorded state of the business within
cost and skill constraints
• schema mapping, data mapping, information discovery, …

• Uniform (or source specific) query access to data sources

• Distributed query processing and optimization
• Consolidating, transforming, and mining data for analysis.

Can AI help in Information Integration cycle?

• Reduce the effort needed to set up an data curation, integration tasks.
• Enable the system to perform gracefully with uncertainty (e.g., on the
web/noisy source/…)
Steps involved in
Data Source Selection, Data Acquisition,
Understanding, Cleansing, Transforming
Data Source Selection
 Once you have the set of questions for querying integrated
data, the next step is to identify the data sources

 Must evaluate the data sources to know which to select

– Is it complete? In terms of coverage? In terms of missing data?
– Is it correct?
– Is it clean?
– Is it structured? If not, can you extract structured data out of it easily and
accurately or expose text document as an attribute?
– Is it up to date? How quickly is it changing?

– Hard to answer some of these questions until you have acquired some data
from the sources
Data Acquisition
 Then need to acquire the data from the sources
– This is highly non-trivial

– Types of data sources:

 Structured data sources: relational databases, XML files, …
 Textual data: emails, memos, documents, etc.
 Web data: need to crawl, maybe can get a dump, API may exist
 Other types of data: excel, pdf, images, etc.

– Being able to extract data from such sources is non-trivial, time

consuming

– Build connectors, wrappers,…

Data Acquisition
 Then need to acquire the data from the sources
– Some of the data come from within the company
 Need to go talk with data owner, convincing him/her, get help
 Can take months – acquisition due to legal and compliance
reasons.

– Some of the data come from outside the company

 Public data, Open source data, Paid data
– Pros: clean, quick
-- Cons: untrustworthy, noisy, expensive
Understanding, Cleaning, & Transformation
Do These for Each Source, then Integrate
 For data from each source
– Data problems
– missing values
– incorrect values, illegal values, outliers
– synonyms
– misspellings
– conflicting data (eg, age and birth year)
– wrong value formats
– variations of values
– duplicate tuples
– understand current vs ideal schema/data
– Attribute values profiling, relationship between attributes, integrity constraints…
– Tools exist for data profiling, relationship discovery,…
– compare the two and identify possible problems
– violations of constraints for the ideal schema/data
– clean and transform
– possibly enrich/enhance
 Integrate data from the multiple sources
– schema matching/merging, data matching/merging
– misspelt names
– violating constraints (key, uniqueness, foreign key, etc)
The Schema Matching Problem Query

Global Schema
Given two input schemas in any data model and, optionally, auxiliary
information, compute a mapping between schema elements of the two
input schemas that passes user validation.
in
Source-1 Source-2

BookInfo
Books ID char(15) Primary Key
AuthorID integer references AuthorInfo
ISBN char(15) key BookTitle varchar(150)
Title varchar(100) ListPrice float
DiscountPrice float
Author varchar(50)
MarkedPrice float

AuthorInfo
πISBN,Title,MarkedPrice(Books) AuthorID integer key
= πID,BookTitle,ListPrice(BookInfo) LastName varchar(25)
πAuthor(Books) = πFirstName+LastName(AuthorInfo) FirstName varchar(25)

Books= πID, BookTitle, FirstName+LastName, ListPrice(BookInfo ⋈ AuthorInfo)

Inputs to Matching Technique

• Schema structure • Constraints: data type, keys, nullability

• Attribute names

• Synonyms • Acronyms
Code = Id = Num = No ◦ PO = Purchase Order

Zip=PIN = Postal [code] ◦ UOM = Unit of Measure

Node = Server ◦ SS# = Social Security Number

• Data instances
Attributes match if they have similar instances or value
distributions
Schema-based hybrid matching algorithm
Based on combining multiple approaches that use only schema (no instances)

Input: Two schema graphs

Output: Similarity matrix and candidate mapping

• Linguistic matching: compare attributes based on names, data types, etc

• Use a thesaurus to help match names by identifying short-forms (Qty for Quantity),
acronyms (UoM for UnitOfMeasure) and synonyms (Bill and Invoice). The result is a
linguistic similarity coefficient, Lsim, between each pair of elements.

• Structure matching: compare elements based on the similarity of their contexts or

vicinities. The result is a structural similarity coefficient, Ssim, for each pair of elements.

• Compute the Weighted similarity: Wsim = w * Lsim + (1 – w) * Ssim

• Mapping generation: a mapping is created by choosing pairs of schema

elements with maximal weighted similarity.
Example

PO PurchaseOrder

POLines Items
POShipTo DeliverTo

Item Item
Name Address
Name City
Street
Line ItemNumber City Street
UoM UnitOfMeasure
Qty Quantity
Linguistic Matching

• Tokenization of names
• PurchaseOrder  purchase + order

• Expansion of acronyms
• UOM  unit + of + measure

• Clustering based on keywords and data-types

• Street, City, POAddress  Address

• Linguistic similarity
• Pair-wise comparison of elements that belong to the same cluster
• Token similarity = f(string matching, synonyms score)
• Token set similarity = average (best matching token similarity)

• Thesaurus: acronymns, synonyms, stop words and categories

Structure Matching
PO PurchaseOrder

POLines Items
POShipTo DeliverTo

Item Item
Name Address
Name City
Street Line ItemNumber
City Street
UoM UnitOfMeasure
Qty Quantity

Tree Match Algorithm (Bottom-up Traversal)

• Atomic elements (leaves) are similar if
• Mutually dependent formulation
• Linguistically and data-type similar
• Leaves determine internal node similarity
• Their contexts, i.e., ancestors, are similar
• Similarity of internal nodes leads to increase in
• Compound elements (non-leaves) are similar if
leaf similarity
• Linguistically similar
• Elements in their context, i.e., subtrees rooted at the elements, are similar
Collective Schema Matching

allcars.com

craigslist auto
[He+, SIGMOD’03]: Build mediated schema for a domain by clustering elements in
multiple schemas

craigslist auto
Schema Mapping
• Global schema defined in terms of sources
(global schema centric or Global-As-View Query
(GAV))
• Query reformulation easier
• Any change in sources, needs change in Global Schema
global schema
• Global relations cannot model any information
not present in at least one source.

• Sources defined in terms of global schema

(source-centric or Local-As-View (LAV))
Source-1 Source-2
• High modularity and extensibility (if the
global schema is well designed, when a
source changes, only its definition is
affected)
• Query reformulation complex
• It allows to add a source to the system
independently of other sources.
Example

Example taken from Dr. Subbarao Kambhampati’s lecture notes.

Global-as-View (GAV)
Q1: The ‘BookStore.com’ is an aggregator that provides a single stop solution to their clients
for buying books. The business has agreement with three suppliers, and the schema of each
supplier is given below. Note, each supplier may have a different price for the same book. The
primary key in each schema is underlined.
Supplier1(ISBN, Book_Name, Book_Price)
Supplier2(ISBN, BTitle, BPrice)
Supplier3(ISBN, Book_Title, Price)

a. Design the global schema of the BookStore.com where (1) the client can create his/her
profile (Name, Home#, St-Name, Area, City, PIN, Mobile), (2) the client can view/search all
his/her current and past transactions (Order#, Date, ISBN, Book-Title, Price, Supplier-
Name), and (3) the client can search the books availability by <title> and/or <ISBN> and can
also view the supplier name. (3)

b. Write the schema mapping between the global schema (obtained in part (a)) and local
schemas using GAV (2)

c. The BookStore.com decides to show the supplier name selling the book at the lowest price
among all, that is, 1 record per ISBN unless more than one supplier has the same lowest price.
this case, it displays information of all suppliers. For this constraint, design the schema and its
mapping to global schema (obtained in part (a)).
Local-as-View (LAV)
GAV vs. LAV
• Not modular • Modular--adding new sources is easy
– Addition of new sources
changes the mediated schema
• Very flexible--power of the entire
• Can be awkward to write mediated query language available to describe
schema without loss of information sources

• Query reformulation easy • Reformulation is hard

– reduces to view unfolding – Involves answering queries only
(polynomial)
using views
– Can build hierarchies of
mediated schemas
• Best when
• Best when
– Many, relatively unknown data
– Few, stable, data sources sources
– well-known to the mediator (e.g. – possibility of addition/deletion of
corporate integration) sources
• Garlic, TSIMMIS, • Information Manifold,
HERMES InfoMaster, Emerac, Havasu
Clio: Schema Discovery and Mapping for Integration

Clio Grows Up: From Research Prototype to Industrial Tool

CLIO Architecture

(SQL/XQuery/XSLT/…)

Information Integration: Existing Methods and Solutions
No ratings yet
Information Integration: Existing Methods and Solutions
25 pages
modelisation (1)
No ratings yet
modelisation (1)
76 pages
Phase 2 & Database Design: MSTU 5031
No ratings yet
Phase 2 & Database Design: MSTU 5031
24 pages
Data Structures and Computer Algorithms
No ratings yet
Data Structures and Computer Algorithms
7 pages
Chap5of SDBMS
No ratings yet
Chap5of SDBMS
20 pages
Data Structure
No ratings yet
Data Structure
75 pages
CST463 M5 Ktunotes.in
No ratings yet
CST463 M5 Ktunotes.in
45 pages
Understanding and Designing of Databases
No ratings yet
Understanding and Designing of Databases
63 pages
Er VLDB2012 PDF
No ratings yet
Er VLDB2012 PDF
179 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
3.1 Data Modeling en
No ratings yet
3.1 Data Modeling en
55 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
61 pages
Info Sys 222 Notes
No ratings yet
Info Sys 222 Notes
53 pages
GK NU CS 503 - Data Preprocessing
No ratings yet
GK NU CS 503 - Data Preprocessing
62 pages
Relational Model: SECD2523 Database Semester 1 2020/2021
No ratings yet
Relational Model: SECD2523 Database Semester 1 2020/2021
27 pages
Cardinality
No ratings yet
Cardinality
29 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
04 The Entity Relationship Model
No ratings yet
04 The Entity Relationship Model
34 pages
Logical Database Design
No ratings yet
Logical Database Design
20 pages
Lecture 09
No ratings yet
Lecture 09
56 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Week4,5
No ratings yet
Week4,5
73 pages
Data Mining Ch2-Lec4
No ratings yet
Data Mining Ch2-Lec4
22 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Web Data Integration Summary
No ratings yet
Web Data Integration Summary
10 pages
08-Text_Mining
No ratings yet
08-Text_Mining
38 pages
06 Adbms PDF
No ratings yet
06 Adbms PDF
43 pages
AFM_Module 4
No ratings yet
AFM_Module 4
48 pages
Unit 1 Introduction To Data Structures
No ratings yet
Unit 1 Introduction To Data Structures
22 pages
Chapter 0.0 Introduction To Data Structures
No ratings yet
Chapter 0.0 Introduction To Data Structures
30 pages
Apache Calcite Tutorial
No ratings yet
Apache Calcite Tutorial
83 pages
String DS
No ratings yet
String DS
13 pages
Database: Structured Programming & Database Systems
No ratings yet
Database: Structured Programming & Database Systems
31 pages
Slide 3
No ratings yet
Slide 3
35 pages
E-Note_4003_Content_Document_20240914115506AM
No ratings yet
E-Note_4003_Content_Document_20240914115506AM
42 pages
Mod 3 Information Integration - Instructor
No ratings yet
Mod 3 Information Integration - Instructor
30 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
Understanding ER Diagrams a Visual Guide to Database Design (1)
No ratings yet
Understanding ER Diagrams a Visual Guide to Database Design (1)
9 pages
Lec01-Introduction and Overview
No ratings yet
Lec01-Introduction and Overview
45 pages
Data Structures Semester 3
No ratings yet
Data Structures Semester 3
329 pages
Entity Relationship Model
No ratings yet
Entity Relationship Model
16 pages
Facebook Natural Language Engineering
No ratings yet
Facebook Natural Language Engineering
8 pages
Chapter 7
No ratings yet
Chapter 7
44 pages
ER
No ratings yet
ER
33 pages
DATA STRUCTURES AND ALGORITHM-1
No ratings yet
DATA STRUCTURES AND ALGORITHM-1
306 pages
Mod4 1 3attributes
No ratings yet
Mod4 1 3attributes
30 pages
Unit 1-Data Modeling using Entity Relationship (E-R) Diagram.docx
No ratings yet
Unit 1-Data Modeling using Entity Relationship (E-R) Diagram.docx
14 pages
Consolidate AmitRana (1)
No ratings yet
Consolidate AmitRana (1)
58 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
Chap03 (ICS12) (2)
No ratings yet
Chap03 (ICS12) (2)
39 pages
Unit III Relational Model 1
No ratings yet
Unit III Relational Model 1
25 pages
Lecture 4 Creating Entity Relationship Diagram ERD (1)
No ratings yet
Lecture 4 Creating Entity Relationship Diagram ERD (1)
13 pages
ISR chap..1
No ratings yet
ISR chap..1
27 pages
Introduction To Data Structures
No ratings yet
Introduction To Data Structures
40 pages
ch05 PDF
No ratings yet
ch05 PDF
56 pages
DS in 7 Hours
No ratings yet
DS in 7 Hours
346 pages
Mastering Data Mining with Python – Find patterns hidden in your data
From Everand
Mastering Data Mining with Python – Find patterns hidden in your data
Megan Squire
No ratings yet
Beginning Data Structures Using C
From Everand
Beginning Data Structures Using C
Yogish Sachdeva
4.5/5 (7)

Week-3 Schema Matching and Mapping

Uploaded by

Week-3 Schema Matching and Mapping

Uploaded by

W3

Schema Matching and Mapping

• Uniform (or source specific) query access to data sources

Can AI help in Information Integration cycle?

 Must evaluate the data sources to know which to select

– Types of data sources:

– Being able to extract data from such sources is non-trivial, time

– Build connectors, wrappers,…

– Some of the data come from outside the company

Books= πID, BookTitle, FirstName+LastName, ListPrice(BookInfo ⋈ AuthorInfo)

• Schema structure • Constraints: data type, keys, nullability

Zip=PIN = Postal [code] ◦ UOM = Unit of Measure

Node = Server ◦ SS# = Social Security Number

Input: Two schema graphs

• Linguistic matching: compare attributes based on names, data types, etc

• Structure matching: compare elements based on the similarity of their contexts or

• Compute the Weighted similarity: Wsim = w * Lsim + (1 – w) * Ssim

• Mapping generation: a mapping is created by choosing pairs of schema

• Clustering based on keywords and data-types

• Thesaurus: acronymns, synonyms, stop words and categories

Tree Match Algorithm (Bottom-up Traversal)

• Sources defined in terms of global schema

Example taken from Dr. Subbarao Kambhampati’s lecture notes.

• Query reformulation easy • Reformulation is hard

Clio Grows Up: From Research Prototype to Industrial Tool

You might also like