Week-3 Schema Matching and Mapping
Week-3 Schema Matching and Mapping
– Hard to answer some of these questions until you have acquired some data
from the sources
Data Acquisition
Then need to acquire the data from the sources
– This is highly non-trivial
Global Schema
Given two input schemas in any data model and, optionally, auxiliary
information, compute a mapping between schema elements of the two
input schemas that passes user validation.
in
Source-1 Source-2
BookInfo
Books ID char(15) Primary Key
AuthorID integer references AuthorInfo
ISBN char(15) key BookTitle varchar(150)
Title varchar(100) ListPrice float
DiscountPrice float
Author varchar(50)
MarkedPrice float
AuthorInfo
πISBN,Title,MarkedPrice(Books) AuthorID integer key
= πID,BookTitle,ListPrice(BookInfo) LastName varchar(25)
πAuthor(Books) = πFirstName+LastName(AuthorInfo) FirstName varchar(25)
• Attribute names
• Synonyms • Acronyms
Code = Id = Num = No ◦ PO = Purchase Order
• Data instances
Attributes match if they have similar instances or value
distributions
Schema-based hybrid matching algorithm
Based on combining multiple approaches that use only schema (no instances)
PO PurchaseOrder
POLines Items
POShipTo DeliverTo
Item Item
Name Address
Name City
Street
Line ItemNumber City Street
UoM UnitOfMeasure
Qty Quantity
Linguistic Matching
• Tokenization of names
• PurchaseOrder purchase + order
• Expansion of acronyms
• UOM unit + of + measure
• Linguistic similarity
• Pair-wise comparison of elements that belong to the same cluster
• Token similarity = f(string matching, synonyms score)
• Token set similarity = average (best matching token similarity)
POLines Items
POShipTo DeliverTo
Item Item
Name Address
Name City
Street Line ItemNumber
City Street
UoM UnitOfMeasure
Qty Quantity
allcars.com
craigslist auto
[He+, SIGMOD’03]: Build mediated schema for a domain by clustering elements in
multiple schemas
craigslist auto
Schema Mapping
• Global schema defined in terms of sources
(global schema centric or Global-As-View Query
(GAV))
• Query reformulation easier
• Any change in sources, needs change in Global Schema
global schema
• Global relations cannot model any information
not present in at least one source.
a. Design the global schema of the BookStore.com where (1) the client can create his/her
profile (Name, Home#, St-Name, Area, City, PIN, Mobile), (2) the client can view/search all
his/her current and past transactions (Order#, Date, ISBN, Book-Title, Price, Supplier-
Name), and (3) the client can search the books availability by <title> and/or <ISBN> and can
also view the supplier name. (3)
b. Write the schema mapping between the global schema (obtained in part (a)) and local
schemas using GAV (2)
c. The BookStore.com decides to show the supplier name selling the book at the lowest price
among all, that is, 1 record per ISBN unless more than one supplier has the same lowest price.
this case, it displays information of all suppliers. For this constraint, design the schema and its
mapping to global schema (obtained in part (a)).
Local-as-View (LAV)
GAV vs. LAV
• Not modular • Modular--adding new sources is easy
– Addition of new sources
changes the mediated schema
• Very flexible--power of the entire
• Can be awkward to write mediated query language available to describe
schema without loss of information sources
(SQL/XQuery/XSLT/…)