SlideShare a Scribd company logo
How to Clean data Less through
Linked (Open Data) Approach
Andrea Wei-Ching Huang
Institute of Information Science, Academia Sinica, Taipei, Taiwan
Dec. 7 2015 @ IIS R101
1. Data Quality: data, metadata, linked data
2. The case of 840,000 cc licensed data
3. How LOD approach can improve data quality?
1. Data Quality:
data, metadata, linked data
Information Quality Data Quality Metadata Quality Linked Data Quality Data Quality Vocabulary
Stvilia et al.(2007):
22 dimensions
Batini et al. (2009):
28 dimensions
Tani et al. (2013):
10 parameters
Zaveri et al. (2016):
18 dimensions
W3C (2015):
10 dimensions
Naturalness (I) Interoperability (RP) Statistics
Accessibility (R) Accessibility Accessibility Availability (A) Availability
Accuracy (R) Accuracy Accuracy (S) Semantic Accuracy (I) Accuracy
Accuracy/Validity (I) Applicability Pertinence Syntactic Validity (I)
Appropriate amount of data
Complexity (R) Clarity
Precision/Completeness(R) Completeness Completeness(S) Completeness (I) Completeness
Informativeness/Redundancy(R) Comprehensiveness Understandability (C)
Informativeness/Redundancy(I) Conciseness Conciseness (I)
Structural Consistency (I) Consistency Similarity Consistency (I) Consistency
Convenience
Structural Consistency(R) Correctness
Verifiability (R) Credibility Trustworthiness (C) Credibility
Currency (I) Currency
Semantic Consistency(I) Derivation Integrity
Ease of operation Processability
Naturalness (R) Interactivity Conformance(S) Interlinking (A) Conformance
Semantic Consistency(R) Interpretability Interpretability (RP)
Precision/Completeness(I) Maintainability Preservability
Complexity(I) Objectivity
Relevance/ Aboutness(R) Relevancy Relevance Relevancy (C) Relevance
Authority (Reputational) Reputation
Security(R) Security Security (A)
Speed Performance (A)
Timeliness Timeliness Timeliness (C) Timeliness
Traceability RP Conciseness (RP)
Cohesiveness (I) Uniqueness Significance
Usability Licensing (A)
Volatility(R) Volatility
Versatility (RP)
(I): Intrinsic; (R): Relational; (S): Metadata Spec.; (RP): Representational; (A):Accessibility; (C): Contextual
1. Accessibility/Availability (可取得性)
2. Accuracy (正確性)
3. Completeness (完整性)
4. Consistency (一致性)
5. Credibility/Trustworthiness (可信度)
6. Relevance (相關性)
7. Timeliness (適時性)
7 dimensions/parameters are common ground
Quantitative/ Qualitative
Methodologies are mutual utilized.
Metadata Quality: Problems & Solutions (1)
Record Problems
Yasser, Chuttur M. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62
Metadata Quality: Problems & Solutions (2)
Dublin Core Semantic Problems
Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of
information professionals." Journal of Information Science 35.6 (2009): 727-739.
• Type is a subjective value.
• Source is a confusing field. It is difficult to apply it consistently.
• Creator can be very varied and it can be tricky determining exactly who the
creator is.
• The information from the publisher is vague.
• Can’t define different role of contributor.
• There is often great ambiguity in terms of Type and Relation.
• between Format and Type.
• between Creator, Publisher, and Contributor.
• between Source and Relation.
• The high degree of difficulty (55.3%) engendered by the Relation field
o discernment of the dynamic and interrelated nature of
information objects presents challenges in using the Relation
element.
Metadata Quality: Problems & Solutions (3)
Current Solutions
Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts."
Information Processing & Management 49.6 (2013): 1194-1205.2
Tani et al. (2013): Summary of metadata quality approaches .
-------------------------------------------------------------------------------------------------------------------------------------------
Metadata guidelines, standard and Application Profiles
 Pros: potentially effective; if shared among organizations, they promote cross organization interoperability
 Cons: challenging to agree between different organizations; often end-up being complex combinations of features
reflecting the interests of many disparate parties; they infringe autonomy of the entities adopting them
Metadata evaluation approaches (analytic-oriented and crowdsourcing-oriented)
 Pros: helpful to identify specific problems
 Cons: based on community specific criteria
Semi-automatic metadata generation approaches
 Pros: helpful to deal with the data deluge
 Cons: human assessment
Metadata cleaning, enhancement, augmentation approaches
Pros: fundamental to enable cross-community exploitation of metadata
Cons: information loss; information inconsistency
-------------------------------------------------------------------------------------------------------------------------------------------
2. The case of 840,000 cc licensed data
In Union Catalogue of Digital Archives Taiwan
“Fitness for Use” is the Key:
Data Quality (DQ) Definition for Digital Data
 Nicholas R. Chrisman (1986):
“Digital data can adapt to a broader range of
uses with a broader range of special demand,
…The root of data abuse is not in the quality
of the data, but in the awareness and
understanding of the quality of the data. By
converting to the fitness for use approach,
the problem of data abuse is moved from
producer to consumer (data user).
 W3C Data Quality Vocabulary (2015) :
“...quality lies in the eye of the beholder; that
there is no objective, ideal definition of it.
Some datasets will be judged as low-quality
resources by some data consumers, while
they will perfectly fit others' needs.
 Quality from perspectives of supply
and demand sides:
ex. Data Publishers, Certification
Agencies, Data Aggregators and Data
Consumers.
 Pragmatic
 User-specific
 Context-dependent
physical object digital object digital collection digital aggregation & publication reusing & semantic representation
Creation Conversion 1 Conversion 2 Conversion 3 Conversion 4 Clean & Enrich Conversion 5
Local Curation (90 projects)
DC 15 elements as the requirement
for Union Catalog
Locally developed schemes
Digital Archive Curation (1 portal)
XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle
Linked Open Data (globally linked & semantically represented)
Globally linked, machine accessible
semantics & domain knowledge vocabularies
are needed for LOD.
CONTEX I CONTEX II CONTEX III
“Fitness for Use” in different contexts:
physical object digital object digital collection digital aggregation & publication reusing & semantic representation
Digital Archive Curation (1 portal)
XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle
Linked Open Data (globally linked & semantically represented)
Provide metadata guidelines
& standard (DC 15)
Metadata Generation
Local Curation (90 projects)
Data
Quality
Data
Quality
Data
Quality
Linked Data Generation
Metadata evaluation
approaches
Semi-automatic metadata
generation approaches
Metadata
cleaning,
enhancement,
augmentation
approaches
Information Loss ?
Interpretation Problems?
Time & Resource Cost?
Problems identified in the case of 840,000 cc data
1. Confusion of Dublin Core (DC 定義混淆)
2. Name Ambiguity (名稱模糊)
3. Inconsistent Encoding (編碼不一致)
4. Semantic Overlaps (語意超載)
5. Duplicate Records (資料重複)
6. Insufficient Element Usage (語意缺漏)
7. Errors / Mistakes / Others (其它錯誤)
Considerations in the case of 840,000 cc data for LOD
1. We are not data creators. Can we
clean/revise the data “correctly”?
 Keep original CSV data open.
 Revised/Cleaned data as diff/mapping files.
2. How can we prevent “information loss”?
 Mapping activities often result in information loss.
 Reconsider the value of broken links.
3. Limited Resources & Time to handel the
clean tasks.
3. How Linked (Open Data) approach can
improve data quality?
1. Raw data, New data (cleaned data, semantically refined
data) can be benefited from Open Data Approach:
Creation of new data based on combining data.
External quality checks of data (validation).
Sustainability of data (no data loss).
The ability to merge, integrate and mesh public
and private data.
Janssen, Marijn, Yannis Charalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open government."
Information Systems Management 29.4 (2012): 258-268.
2. Using SPARQL Queries to identify problems:
Identify DQ Problems before RDF generated:
 Use W3C mapping language R2RML and RDF validation
framework (RDFUnit) for mapping definitions and allow
publishers to catch & correct violations before they even
happened. (Dimou et al, 2015)
Identify DQ Problems after RDF generated:
 Using SPARQL and Public Shared LOD resources (ex. DBPedia,
Geonames)as reference to identified problems. (Furber and
Hepp, 2010)
Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge
Engineering and Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225.
Dimou, Anastasia, et al. "Assessing and Refining Mappingsto RDF to Improve Dataset Quality." The Semantic
Web-ISWC 2015. Springer International Publishing, 2015. 133-149.
Above five points are summarized from Furber and Hepp (2013): "Using Semantic Web
Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin
Heidelberg, 2013. 141-161.
 Collaborative representation and use of quality-relevant knowledge
 Automatic identification of conflicting data requirement
 Semantic definition of data.
 Use Semantic Web data as a Trusted Reference data
 Content Integration with Ontologies
3. Use Vocabularies, Ontologies & LOD Knowledge Base :
To improve data quality for every step of a
dataset's lifecycle (ex. W3C Data Quality Vocabulary) .
To enrich data semantics and increase data
reused and refined values.
http://www.w3.org/TR/vocab-dqv/
The importance pf
provenance and
metadata quality.
(Carata, Lucian, et al.
2014)
The Story of A Fish
http://catalog.digitalarchives.tw/item/00/5f/ca/d5.html
Parapercis kentingensis
http://URI of this Fish/6277845
2012
2015 2016
TEXT/Image
XLSX/Table/HTML
XML/HTML
CSV: (raw data published as open data)
6277845
(1)12/15 triples (statements)
Metadata(DC 15)
(2) 12/15 triples (statements)
Provence
wikidataerr
+ one “diff” triple
new
(3) Mapping replace Cleaning
+ one “time mapping” triple
time
new
err
Place information is not described in the Coverage but
Description in this stage. This should be cleaned & mapped to
external resources like Geoname and TaiwanPlaceName by us, or
by some others when time and resources are available.
(5) When the raw CSV and DC 15 represented triples
(DC 15 Version) are published, they are easily for
others to detect the errors, reused and enriched by
their own Fitness of Use and Interpretations. Even
there are errors from the beginning, more statements
about this Fish (6277845) are thus can be generated
by the interests of community.
(4) Refined Version:
semantically enriched by using domain
vocabularies like Darwin Core Terms
prov r4r schema cc odw
1. Keep original CSV data open.
2. Less clean with mapping more: revised/cleaned data as
diff/mapping files.
3. Publish the original DC 15 statements as 15 triples and
provide Provenance information.
4. Assign each item resource a URI.
5. Use domain vocabulary to enrich the resource (e.x. dwc)
6. Mapping and Linking to external databases to enrich
statements. (GenNames, TaiwanPlaceNames,
Encyclopedia of Life).
7. More errors or meanings will be stated by third parties
and crowdsourcing for their own interests.
How we clean data less through Linked (Open Data) Approach
1. Batini, Carlo, et al. "Methodologies for data quality assessment and improvement." ACM Computing Surveys (CSUR) 41.3 (2009): 16.
2. Chrisman, Nicholas R. "Obtaining information on quality of digital data." Proc. AutoCarto London. Vol. 1. 1986.
3. Carata, Lucian, et al. "A primer on provenance." Communications of the ACM 57.5 (2014): 52-60.
4. Dimou, Anastasia, et al. "Assessing and Refining Mappings to RDF to Improve Dataset Quality." The Semantic Web-ISWC 2015.
Springer International Publishing, 2015. 133-149
5. Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge Engineering and
Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225.
6. Furberand Hepp(2013): "Using Semantic Web Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin
Heidelberg, 2013. 141-161
7. Hooland, Seth van, and Ruben Verborgh. Linked data for libraries, archives and museums. (2014).
8. Janssen, Marijn, YannisCharalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open
government." Information Systems Management 29.4 (2012): 258-268.
9. Manus, Susan, The Value of a Broken Link (2012): http://blogs.loc.gov/digitalpreservation/2012/03/the-value-of-a-broken-link/
10. Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of information professionals."
Journal of Information Science 35.6 (2009): 727-739.
11. Stvilia, Besiki, et al. "A framework for information quality assessment." Journal of the American Society for Information Science and
Technology 58.12 (2007): 1720-1733.
12. Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts." Information
Processing & Management 49.6 (2013): 1194-1205.
13. W3C, Data Quality Vocabulary (2015), http://www.w3.org/TR/vocab-dqv/
14. Yasser, ChutturM. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62
15. Zaveri, Amrapali, et al. "Quality assessment for linked open data: A survey." Semantic Web 7.1 (2016).
REFERENCE
Merry Christmas
Happy New Year
We will release the DC 15 Versions and the Refined Version (Biology) shortly.

More Related Content

What's hot (20)

PPTX
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
New York University
 
PPT
Web 3 Mark Greaves
Mediabistro
 
PDF
Named Entity Recognition from Online News
Bernardo Najlis
 
ODP
2009 0807 Lod Gmod
Jun Zhao
 
PPTX
Timbuctoo 2 EASY
henkvandenberg16
 
PPTX
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
aceas13tern
 
PPTX
Scaling up Linked Data
EUCLID project
 
PPTX
LiveLinkedData - TransWebData - Nantes 2013
Luis Daniel Ibáñez
 
PPT
Introduction | Categories for Description of Works of Art | CDWA-LITE
Kymberly Keeton
 
PDF
160606 data lifecycle project outline
Ian Duncan
 
PDF
Mappings Validation
andimou
 
PPTX
Microtask Crowdsourcing Applications for Linked Data
EUCLID project
 
PPTX
Role of Semantic Web in Health Informatics
Artificial Intelligence Institute at UofSC
 
PPTX
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Mark Wilkinson
 
PPTX
How to describe a dataset. Interoperability issues
Valeria Pesce
 
PPTX
NISO Forum, Denver, Sept. 24, 2012: EZID: Easy dataset identification & manag...
National Information Standards Organization (NISO)
 
PDF
Tese phd
Rodrigo Senra
 
PDF
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...
DuraSpace
 
PPTX
Washington Linked Data Authority Service at University of Houston
National Information Standards Organization (NISO)
 
PDF
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Bernardo Najlis
 
Alphabet soup: CDM, VRA, CCO, METS, MODS, RDF - Why Metadata Matters
New York University
 
Web 3 Mark Greaves
Mediabistro
 
Named Entity Recognition from Online News
Bernardo Najlis
 
2009 0807 Lod Gmod
Jun Zhao
 
Timbuctoo 2 EASY
henkvandenberg16
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
aceas13tern
 
Scaling up Linked Data
EUCLID project
 
LiveLinkedData - TransWebData - Nantes 2013
Luis Daniel Ibáñez
 
Introduction | Categories for Description of Works of Art | CDWA-LITE
Kymberly Keeton
 
160606 data lifecycle project outline
Ian Duncan
 
Mappings Validation
andimou
 
Microtask Crowdsourcing Applications for Linked Data
EUCLID project
 
Role of Semantic Web in Health Informatics
Artificial Intelligence Institute at UofSC
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Mark Wilkinson
 
How to describe a dataset. Interoperability issues
Valeria Pesce
 
NISO Forum, Denver, Sept. 24, 2012: EZID: Easy dataset identification & manag...
National Information Standards Organization (NISO)
 
Tese phd
Rodrigo Senra
 
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...
DuraSpace
 
Washington Linked Data Authority Service at University of Houston
National Information Standards Organization (NISO)
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Bernardo Najlis
 

Viewers also liked (20)

PDF
20160602 典藏目錄的語意與連結
andrea huang
 
PDF
A preliminary study on Wikipedia Dbpdeia and Wikidata
andrea huang
 
PPTX
Data Quality & Data Governance
Tuba Yaman Him
 
PPS
splendori si incantare
sokoban
 
PPS
Inaguration of President Obama
sokoban
 
PPS
Expo Milano 2015
Francesca Plescia
 
PDF
三峽隆恩埔原住民族文化部落新建工程
relax.chi
 
PPS
Good shots
sokoban
 
PDF
Legal Aspects Of New Media 2nd Annual New Media
Paul Jacobson
 
PPS
HAMBARUL.pps
sokoban
 
PPS
railway routes
sokoban
 
PPT
Pavlov
maguipsi
 
PDF
2014-09-18 Protection of Personal Information Act readiness workshop
Paul Jacobson
 
KEY
Jisc rsc morris_2012
JISC Regional Support Centre YH
 
PPTX
Biblioteche di ateneo e Iris
Vincenzo D'Aguanno
 
PPT
Work Effectively In An1
AliaSlides
 
PPT
Ochrona Dziecka W Sieci (Przed Niebezpiecznymi TreśCiami)
guest3b97e2
 
PPS
cand turistii se amuza / when tourists have fun
sokoban
 
PPS
Valencia - orasul artelor si a stiintei
sokoban
 
PPS
Cel mai periculos loc turistic din lume!
sokoban
 
20160602 典藏目錄的語意與連結
andrea huang
 
A preliminary study on Wikipedia Dbpdeia and Wikidata
andrea huang
 
Data Quality & Data Governance
Tuba Yaman Him
 
splendori si incantare
sokoban
 
Inaguration of President Obama
sokoban
 
Expo Milano 2015
Francesca Plescia
 
三峽隆恩埔原住民族文化部落新建工程
relax.chi
 
Good shots
sokoban
 
Legal Aspects Of New Media 2nd Annual New Media
Paul Jacobson
 
HAMBARUL.pps
sokoban
 
railway routes
sokoban
 
Pavlov
maguipsi
 
2014-09-18 Protection of Personal Information Act readiness workshop
Paul Jacobson
 
Jisc rsc morris_2012
JISC Regional Support Centre YH
 
Biblioteche di ateneo e Iris
Vincenzo D'Aguanno
 
Work Effectively In An1
AliaSlides
 
Ochrona Dziecka W Sieci (Przed Niebezpiecznymi TreśCiami)
guest3b97e2
 
cand turistii se amuza / when tourists have fun
sokoban
 
Valencia - orasul artelor si a stiintei
sokoban
 
Cel mai periculos loc turistic din lume!
sokoban
 
Ad

Similar to How to clean data less through Linked (Open Data) approach? (20)

PDF
Data Quality
jerdeb
 
PPTX
Understanding Information Professionals: A Survey on the Quality of Linked Da...
jerdeb
 
PDF
Linked Open Data - State of the Art, Challenges and Applications
Rui Vieira
 
PDF
RDF Data Quality Assessment - connecting the pieces
Connected Data World
 
PPTX
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Péter Király
 
PPT
Pragmatic Approaches to the Semantic Web
Mike Bergman
 
PPT
Data Quality Plan Pilot Tutorial: EPA Report on the Environment
guest8c518a8
 
PDF
Data Quality and the FAIR principles
Amrapali Zaveri, PhD
 
PDF
Linked Data Quality Assessment: A Survey
Amrapali Zaveri, PhD
 
PDF
Metadata Quality assessment tool for Open Access
Paolo Nesi
 
PDF
Metadata Quality assessment tool for Open Access Cultural Heritage institutio...
Paolo Nesi
 
PDF
Implementing Linked Data in Low-Resource Conditions
AIMS (Agricultural Information Management Standards)
 
PDF
Open data Websmatch
data publica
 
PPTX
Fair data vs 5 star open data final
Syed Muhammad Ali Hasnain
 
PPTX
Open data quality
Open Data Support
 
PPTX
The Unreasonable Effectiveness of Metadata
James Hendler
 
PDF
DataGraft: Data-as-a-Service for Open Data
dapaasproject
 
PDF
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Anastasija Nikiforova
 
PPT
Open Conceptual Data Models
rumito
 
PPT
Metadata quality in digital repositories
Nikos Palavitsinis, PhD
 
Data Quality
jerdeb
 
Understanding Information Professionals: A Survey on the Quality of Linked Da...
jerdeb
 
Linked Open Data - State of the Art, Challenges and Applications
Rui Vieira
 
RDF Data Quality Assessment - connecting the pieces
Connected Data World
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Péter Király
 
Pragmatic Approaches to the Semantic Web
Mike Bergman
 
Data Quality Plan Pilot Tutorial: EPA Report on the Environment
guest8c518a8
 
Data Quality and the FAIR principles
Amrapali Zaveri, PhD
 
Linked Data Quality Assessment: A Survey
Amrapali Zaveri, PhD
 
Metadata Quality assessment tool for Open Access
Paolo Nesi
 
Metadata Quality assessment tool for Open Access Cultural Heritage institutio...
Paolo Nesi
 
Implementing Linked Data in Low-Resource Conditions
AIMS (Agricultural Information Management Standards)
 
Open data Websmatch
data publica
 
Fair data vs 5 star open data final
Syed Muhammad Ali Hasnain
 
Open data quality
Open Data Support
 
The Unreasonable Effectiveness of Metadata
James Hendler
 
DataGraft: Data-as-a-Service for Open Data
dapaasproject
 
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Anastasija Nikiforova
 
Open Conceptual Data Models
rumito
 
Metadata quality in digital repositories
Nikos Palavitsinis, PhD
 
Ad

More from andrea huang (11)

PDF
Reuse of Structured Data: Semantics, Linkage, and Realization
andrea huang
 
PDF
結構資料的再次使用:語意、連結與實作
andrea huang
 
PDF
Metadata as Linked Data for Research Data Repositories
andrea huang
 
PDF
20130805 Activating Linked Open Data in Libraries Archives and Museums
andrea huang
 
PDF
101203 An event ontology for crisis-disaster information
andrea huang
 
PDF
081016 Social Tagging, Online Communication, and Peircean Semiotics
andrea huang
 
PDF
060817 Participation Collaboration Mapping
andrea huang
 
PDF
070928 Collaborative Geospatial Mapping And Data Authorization
andrea huang
 
PDF
041018 Community Gis
andrea huang
 
PDF
051102 Online Community Mapping
andrea huang
 
PDF
051207 Commonsense Geography Meets Web Technology
andrea huang
 
Reuse of Structured Data: Semantics, Linkage, and Realization
andrea huang
 
結構資料的再次使用:語意、連結與實作
andrea huang
 
Metadata as Linked Data for Research Data Repositories
andrea huang
 
20130805 Activating Linked Open Data in Libraries Archives and Museums
andrea huang
 
101203 An event ontology for crisis-disaster information
andrea huang
 
081016 Social Tagging, Online Communication, and Peircean Semiotics
andrea huang
 
060817 Participation Collaboration Mapping
andrea huang
 
070928 Collaborative Geospatial Mapping And Data Authorization
andrea huang
 
041018 Community Gis
andrea huang
 
051102 Online Community Mapping
andrea huang
 
051207 Commonsense Geography Meets Web Technology
andrea huang
 

Recently uploaded (20)

PDF
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PDF
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PPTX
Machine Learning Benefits Across Industries
SynapseIndia
 
PDF
Agentic Artificial Intelligence (AI) and its growing impact on business opera...
Alakmalak Technologies Pvt. Ltd.
 
PDF
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
PDF
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PPTX
Lecture 5 - Agentic AI and model context protocol.pptx
Dr. LAM Yat-fai (林日辉)
 
PDF
Productivity Management Software | Workstatus
Lovely Baghel
 
PPTX
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
PDF
CIFDAQ'S Token Spotlight for 16th July 2025 - ALGORAND
CIFDAQ
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Alpha Altcoin Setup : TIA - 19th July 2025
CIFDAQ
 
PDF
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
Top Managed Service Providers in Los Angeles
Captain IT
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
Machine Learning Benefits Across Industries
SynapseIndia
 
Agentic Artificial Intelligence (AI) and its growing impact on business opera...
Alakmalak Technologies Pvt. Ltd.
 
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Lecture 5 - Agentic AI and model context protocol.pptx
Dr. LAM Yat-fai (林日辉)
 
Productivity Management Software | Workstatus
Lovely Baghel
 
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
CIFDAQ'S Token Spotlight for 16th July 2025 - ALGORAND
CIFDAQ
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Alpha Altcoin Setup : TIA - 19th July 2025
CIFDAQ
 
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 

How to clean data less through Linked (Open Data) approach?

  • 1. How to Clean data Less through Linked (Open Data) Approach Andrea Wei-Ching Huang Institute of Information Science, Academia Sinica, Taipei, Taiwan Dec. 7 2015 @ IIS R101 1. Data Quality: data, metadata, linked data 2. The case of 840,000 cc licensed data 3. How LOD approach can improve data quality?
  • 2. 1. Data Quality: data, metadata, linked data
  • 3. Information Quality Data Quality Metadata Quality Linked Data Quality Data Quality Vocabulary Stvilia et al.(2007): 22 dimensions Batini et al. (2009): 28 dimensions Tani et al. (2013): 10 parameters Zaveri et al. (2016): 18 dimensions W3C (2015): 10 dimensions Naturalness (I) Interoperability (RP) Statistics Accessibility (R) Accessibility Accessibility Availability (A) Availability Accuracy (R) Accuracy Accuracy (S) Semantic Accuracy (I) Accuracy Accuracy/Validity (I) Applicability Pertinence Syntactic Validity (I) Appropriate amount of data Complexity (R) Clarity Precision/Completeness(R) Completeness Completeness(S) Completeness (I) Completeness Informativeness/Redundancy(R) Comprehensiveness Understandability (C) Informativeness/Redundancy(I) Conciseness Conciseness (I) Structural Consistency (I) Consistency Similarity Consistency (I) Consistency Convenience Structural Consistency(R) Correctness Verifiability (R) Credibility Trustworthiness (C) Credibility Currency (I) Currency Semantic Consistency(I) Derivation Integrity Ease of operation Processability Naturalness (R) Interactivity Conformance(S) Interlinking (A) Conformance Semantic Consistency(R) Interpretability Interpretability (RP) Precision/Completeness(I) Maintainability Preservability Complexity(I) Objectivity Relevance/ Aboutness(R) Relevancy Relevance Relevancy (C) Relevance Authority (Reputational) Reputation Security(R) Security Security (A) Speed Performance (A) Timeliness Timeliness Timeliness (C) Timeliness Traceability RP Conciseness (RP) Cohesiveness (I) Uniqueness Significance Usability Licensing (A) Volatility(R) Volatility Versatility (RP) (I): Intrinsic; (R): Relational; (S): Metadata Spec.; (RP): Representational; (A):Accessibility; (C): Contextual
  • 4. 1. Accessibility/Availability (可取得性) 2. Accuracy (正確性) 3. Completeness (完整性) 4. Consistency (一致性) 5. Credibility/Trustworthiness (可信度) 6. Relevance (相關性) 7. Timeliness (適時性) 7 dimensions/parameters are common ground Quantitative/ Qualitative Methodologies are mutual utilized.
  • 5. Metadata Quality: Problems & Solutions (1) Record Problems Yasser, Chuttur M. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62
  • 6. Metadata Quality: Problems & Solutions (2) Dublin Core Semantic Problems Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of information professionals." Journal of Information Science 35.6 (2009): 727-739. • Type is a subjective value. • Source is a confusing field. It is difficult to apply it consistently. • Creator can be very varied and it can be tricky determining exactly who the creator is. • The information from the publisher is vague. • Can’t define different role of contributor. • There is often great ambiguity in terms of Type and Relation. • between Format and Type. • between Creator, Publisher, and Contributor. • between Source and Relation. • The high degree of difficulty (55.3%) engendered by the Relation field o discernment of the dynamic and interrelated nature of information objects presents challenges in using the Relation element.
  • 7. Metadata Quality: Problems & Solutions (3) Current Solutions Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts." Information Processing & Management 49.6 (2013): 1194-1205.2 Tani et al. (2013): Summary of metadata quality approaches . ------------------------------------------------------------------------------------------------------------------------------------------- Metadata guidelines, standard and Application Profiles  Pros: potentially effective; if shared among organizations, they promote cross organization interoperability  Cons: challenging to agree between different organizations; often end-up being complex combinations of features reflecting the interests of many disparate parties; they infringe autonomy of the entities adopting them Metadata evaluation approaches (analytic-oriented and crowdsourcing-oriented)  Pros: helpful to identify specific problems  Cons: based on community specific criteria Semi-automatic metadata generation approaches  Pros: helpful to deal with the data deluge  Cons: human assessment Metadata cleaning, enhancement, augmentation approaches Pros: fundamental to enable cross-community exploitation of metadata Cons: information loss; information inconsistency -------------------------------------------------------------------------------------------------------------------------------------------
  • 8. 2. The case of 840,000 cc licensed data In Union Catalogue of Digital Archives Taiwan
  • 9. “Fitness for Use” is the Key: Data Quality (DQ) Definition for Digital Data  Nicholas R. Chrisman (1986): “Digital data can adapt to a broader range of uses with a broader range of special demand, …The root of data abuse is not in the quality of the data, but in the awareness and understanding of the quality of the data. By converting to the fitness for use approach, the problem of data abuse is moved from producer to consumer (data user).  W3C Data Quality Vocabulary (2015) : “...quality lies in the eye of the beholder; that there is no objective, ideal definition of it. Some datasets will be judged as low-quality resources by some data consumers, while they will perfectly fit others' needs.  Quality from perspectives of supply and demand sides: ex. Data Publishers, Certification Agencies, Data Aggregators and Data Consumers.  Pragmatic  User-specific  Context-dependent
  • 10. physical object digital object digital collection digital aggregation & publication reusing & semantic representation Creation Conversion 1 Conversion 2 Conversion 3 Conversion 4 Clean & Enrich Conversion 5 Local Curation (90 projects) DC 15 elements as the requirement for Union Catalog Locally developed schemes Digital Archive Curation (1 portal) XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle Linked Open Data (globally linked & semantically represented) Globally linked, machine accessible semantics & domain knowledge vocabularies are needed for LOD. CONTEX I CONTEX II CONTEX III “Fitness for Use” in different contexts:
  • 11. physical object digital object digital collection digital aggregation & publication reusing & semantic representation Digital Archive Curation (1 portal) XML HTMLTEXT/Image XLSX/Table/HTML CSV CSV RDF/Turtle Linked Open Data (globally linked & semantically represented) Provide metadata guidelines & standard (DC 15) Metadata Generation Local Curation (90 projects) Data Quality Data Quality Data Quality Linked Data Generation Metadata evaluation approaches Semi-automatic metadata generation approaches Metadata cleaning, enhancement, augmentation approaches Information Loss ? Interpretation Problems? Time & Resource Cost?
  • 12. Problems identified in the case of 840,000 cc data 1. Confusion of Dublin Core (DC 定義混淆) 2. Name Ambiguity (名稱模糊) 3. Inconsistent Encoding (編碼不一致) 4. Semantic Overlaps (語意超載) 5. Duplicate Records (資料重複) 6. Insufficient Element Usage (語意缺漏) 7. Errors / Mistakes / Others (其它錯誤)
  • 13. Considerations in the case of 840,000 cc data for LOD 1. We are not data creators. Can we clean/revise the data “correctly”?  Keep original CSV data open.  Revised/Cleaned data as diff/mapping files. 2. How can we prevent “information loss”?  Mapping activities often result in information loss.  Reconsider the value of broken links. 3. Limited Resources & Time to handel the clean tasks.
  • 14. 3. How Linked (Open Data) approach can improve data quality?
  • 15. 1. Raw data, New data (cleaned data, semantically refined data) can be benefited from Open Data Approach: Creation of new data based on combining data. External quality checks of data (validation). Sustainability of data (no data loss). The ability to merge, integrate and mesh public and private data. Janssen, Marijn, Yannis Charalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open government." Information Systems Management 29.4 (2012): 258-268.
  • 16. 2. Using SPARQL Queries to identify problems: Identify DQ Problems before RDF generated:  Use W3C mapping language R2RML and RDF validation framework (RDFUnit) for mapping definitions and allow publishers to catch & correct violations before they even happened. (Dimou et al, 2015) Identify DQ Problems after RDF generated:  Using SPARQL and Public Shared LOD resources (ex. DBPedia, Geonames)as reference to identified problems. (Furber and Hepp, 2010) Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge Engineering and Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225. Dimou, Anastasia, et al. "Assessing and Refining Mappingsto RDF to Improve Dataset Quality." The Semantic Web-ISWC 2015. Springer International Publishing, 2015. 133-149.
  • 17. Above five points are summarized from Furber and Hepp (2013): "Using Semantic Web Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin Heidelberg, 2013. 141-161.  Collaborative representation and use of quality-relevant knowledge  Automatic identification of conflicting data requirement  Semantic definition of data.  Use Semantic Web data as a Trusted Reference data  Content Integration with Ontologies 3. Use Vocabularies, Ontologies & LOD Knowledge Base : To improve data quality for every step of a dataset's lifecycle (ex. W3C Data Quality Vocabulary) . To enrich data semantics and increase data reused and refined values.
  • 19. The Story of A Fish http://catalog.digitalarchives.tw/item/00/5f/ca/d5.html Parapercis kentingensis
  • 20. http://URI of this Fish/6277845 2012 2015 2016 TEXT/Image XLSX/Table/HTML XML/HTML CSV: (raw data published as open data) 6277845 (1)12/15 triples (statements) Metadata(DC 15) (2) 12/15 triples (statements) Provence wikidataerr + one “diff” triple new (3) Mapping replace Cleaning + one “time mapping” triple time new err Place information is not described in the Coverage but Description in this stage. This should be cleaned & mapped to external resources like Geoname and TaiwanPlaceName by us, or by some others when time and resources are available. (5) When the raw CSV and DC 15 represented triples (DC 15 Version) are published, they are easily for others to detect the errors, reused and enriched by their own Fitness of Use and Interpretations. Even there are errors from the beginning, more statements about this Fish (6277845) are thus can be generated by the interests of community. (4) Refined Version: semantically enriched by using domain vocabularies like Darwin Core Terms prov r4r schema cc odw
  • 21. 1. Keep original CSV data open. 2. Less clean with mapping more: revised/cleaned data as diff/mapping files. 3. Publish the original DC 15 statements as 15 triples and provide Provenance information. 4. Assign each item resource a URI. 5. Use domain vocabulary to enrich the resource (e.x. dwc) 6. Mapping and Linking to external databases to enrich statements. (GenNames, TaiwanPlaceNames, Encyclopedia of Life). 7. More errors or meanings will be stated by third parties and crowdsourcing for their own interests. How we clean data less through Linked (Open Data) Approach
  • 22. 1. Batini, Carlo, et al. "Methodologies for data quality assessment and improvement." ACM Computing Surveys (CSUR) 41.3 (2009): 16. 2. Chrisman, Nicholas R. "Obtaining information on quality of digital data." Proc. AutoCarto London. Vol. 1. 1986. 3. Carata, Lucian, et al. "A primer on provenance." Communications of the ACM 57.5 (2014): 52-60. 4. Dimou, Anastasia, et al. "Assessing and Refining Mappings to RDF to Improve Dataset Quality." The Semantic Web-ISWC 2015. Springer International Publishing, 2015. 133-149 5. Fürber, Christian, and Martin Hepp. "Using semantic web resources for data quality management." Knowledge Engineering and Management by the Masses. Springer Berlin Heidelberg, 2010. 211-225. 6. Furberand Hepp(2013): "Using Semantic Web Technologies for Data Quality Management." Handbook of Data Quality. Springer Berlin Heidelberg, 2013. 141-161 7. Hooland, Seth van, and Ruben Verborgh. Linked data for libraries, archives and museums. (2014). 8. Janssen, Marijn, YannisCharalabidis, and Anneke Zuiderwijk. "Benefits, adoption barriers and myths of open data and open government." Information Systems Management 29.4 (2012): 258-268. 9. Manus, Susan, The Value of a Broken Link (2012): http://blogs.loc.gov/digitalpreservation/2012/03/the-value-of-a-broken-link/ 10. Park, Jung-ran, and Eric Childress. "Dublin Core metadata semantics: An analysis of the perspectives of information professionals." Journal of Information Science 35.6 (2009): 727-739. 11. Stvilia, Besiki, et al. "A framework for information quality assessment." Journal of the American Society for Information Science and Technology 58.12 (2007): 1720-1733. 12. Tani, Alice, Leonardo Candela, and Donatella Castelli. "Dealing with metadata quality: The legacy of digital library efforts." Information Processing & Management 49.6 (2013): 1194-1205. 13. W3C, Data Quality Vocabulary (2015), http://www.w3.org/TR/vocab-dqv/ 14. Yasser, ChutturM. "An analysis of problems in metadata records." Journal of Library Metadata 11.2 (2011): 51-62 15. Zaveri, Amrapali, et al. "Quality assessment for linked open data: A survey." Semantic Web 7.1 (2016). REFERENCE
  • 23. Merry Christmas Happy New Year We will release the DC 15 Versions and the Refined Version (Biology) shortly.