SlideShare a Scribd company logo
Data Quality
Standards and Application to Open Data
February 21, 2018 – Brunel University, UK
Marco Torchiano
marco.torchiano@polito.it
Version 1.1.0
© Marco Torchiano, 2018
About me
 Marco Torchiano
 Associate Professor, Politecnico di Torino
 Senior Member IEEE
 Faculty Fellow – Nexa Center for Internet
and Society
 Member UNI CT504–Software Engineering
 Contacts:
– mailto:marco.torchiano@polito.it
– http://softeng.polito.it/torchiano/
– Twitter: @mtorchiano
3
Current Research Interests
 Mobile UI Automated Testing
 PhD student working on fragility
 (Open)Data Quality
 PhD student working on KB quality
 Software Energy Consumption
 Several collaborations
 Also: MDD, Survey methodology, code
obfuscation, SE education, …
4
Acknowledgments
 Antonio Vetrò
 The counterpart for
this line of research
 Many other people
 L.Canova, R.Iemma, F.Iuliano, F.Morando,
C.Orozco Minotas, G.Procaccianti,
R.Rashid
5
OPEN DATA QUALITY
7
Open Coesione
 portal about the fulfilment of
investments using the 2007-2013
European Cohesion funds
 Interactive Interface
 Downloadable .csv datasets
 ~100 billion Euros are being tracked,
~100K projects
 http://www.opencoesione.gov.it/
9
Errors in data
10
43 !
* extraction, transformation, and loading
11
Accuracy
12
» Refer always to raw data
» If not possible, estimate accuracy on analysis (e.g., about 5% in the example above)
43 !
13
Missing data
14
15
»Outliers can point to interesting facts
Outliers
16
»… or to something which deserves a second look
Outliers
17
Valu
e
pcvc= percentage of cells with correct value
18
ISO DATA QUALITY
STANDARDS
19
ISO - SQuaRE
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Family of standards
20
ISO SQuaRE
 Internal Quality
 Values, formats, relation
 External Quality
 Technological environment
 Quality in Use
 Context of use of the data user
21
ISO 25012
Data Quality Model
22
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Roles
 Data Quality evaluator
 Data Producer
 Data Acquirer
 Data User
23
Data evaluator
 Defines/adapts a quality model
 Evaluate and act
 Data correction
 Technological adjustments
 Organizational measures
24
Model structure
 Characteristic
 Main aspects, e.g., usability
 Sub-Characteristic (optional)
 A detailed aspect of a characteristic, e.g.
Understandability
 Metric
 A set of rules to assign and interpret a
(numerical) evaluation to a specific (sub)-
characteristic
25
Characteristics
 Accuracy
 Completeness
 Consistency
 Credibility
 Currentness
 Accessibility
 Compliance
 Confidentiality
 Efficiency
 Precision
 Traceability
 Understandability
 Availability
 Portability
 Recoverability
26
Characteristics
 Accuracy
 Correspondence between data and reality
(syntactic and semantic)
 Completeness
 Computer: presence of all necessary
values
 User: how much the data is able to satisfy
the needs
 Consistency
 Absence of contradictions in the data
27
Characteristics
 Credibility
 The extent to which data are regarded as
true and credible by users
 Currentness
 the extent to which data is up-to-date
 Accessibility
 The capability of data to be accessed,
particularly by people who need
supporting technology or special
configuration because of some disability
28
Characteristics
 Regulatory compliance
 The capability of data to adhere to standards,
conventions or regulations in force and similar
rules relating to data quality
 Confidentiality
 The capability of the data to be accessed and
interpreted only by authorized users
 Efficiency
 The capability of data to be processed (accessed,
acquired, updated, etc) and to provide
appropriate levels of performance using the
appropriate amounts and types of resources
under stated conditions
29
Characteristics
 Precision
 Capability of the value assigned to an
attribute to provide the degree of
information needed in a stated context of
use
 Traceability
 Presence of attributes providing an audit trail
of access and changes made to data
 Understandability
 The extent to which data can be read and
interpreted by users
30
Characteristics
 Availability
 The capability of data to be always
retrievable.
 Recoverability
 The capability to preserve a specified level of
operations and its physical and logical
integrity, even in the event of failure
 Portability
 The capability of data to be moved to
another platform preserving quality
31
Inherent System
Dependent
Facts
(Data)
Artefacts
(D+Hw+Sw+Sys)
Accuracy
Completeness
Consistency
Credibility
Currentness
Accessibility
UnderstandabilityHCI
Support
Compliance
Confidentiality
Efficiency
Precision
Traceability
Perspectives
32
Availability
Portability
Recoverability
ISO 25024
Measurement of Data Quality
33
2503x
Quality
Requirements
2504x
Quality
Evaluation
2501x
Quality Model
2500x
Quality Management
2502x
Quality Measurement
Relationships among standards
ISO/IEC 25010
System and Software
Product Quality
ISO/IEC 25012
Data Quality
composed of
Quality characteristics
Quality sub-characteristics
composed of
Quality Measure
ISO/IEC 25022, 25023, 25024
Measuremen
t function
defines
composed of
Quality Measure Elements
QME
Measuremen
t method
ISO/IEC 25021
Property to quantifyTarget Entity
Source: ISO/IEC 25024 34
Data Life Cycle: examples
Data
design
Data
collection
Data
integration
External
data
acquisition
Source: ISO/IEC 25024
Data
processin
g
Presentation
Other use
Data store
Delete
35
Data design: target entities
 Architecture
 Contextual schema
 Data models (conceptual, logical,
physical)
 Data dictionary
 Document
36
Data design: properties
 Attribute
 Element
 Information
 Metadata
 Vocabulary
37
Other stages: target entities
 Data file
 DBMS
 RDBMS
 Form
 Presentation device
38
Properties
 Data format
 Data item
 Data value
 Information item
 Information item content
 Data record
39
Metrics definition
A) ID: abbreviated code of the quality characteristics +
(I/D)+serial number
b) Name: QM name related to data;
c) Description
d) Measurement function: formula showing how the QMEs
are combined to produce the QM;
e) DLC, Target entities, Properties: DLC includes stages of
the DLC where the data QMEs are applicable, target
entities and properties of target entities;
f) Note: in the note, additional information such as an
acceptable range of values, reference to other standards,
explanations or interpretation or criteria, measurement
method used to obtain the
40
ACCURACY (Acc-I-1)
Copyright: ISO/IEC 25024
42
CASE STUDIES
Open Government Data
50
Open Government Data
OD: open data, data that can be
 Used
 Reused
 Redistributed
 By anyone and with any goal
G: Government produced or commissioned
by a government or an institutional
entity controlled by the government
http://opengovernmentdata.org
51
Why OGD ?
 Transparency
 Social and commercial value
 Participation
52
Case 1: Open Coesione
 Published data
 Structured
 Open data format
OpenCoesione
Statistical data from municipalities
 Residents
 Weddings
 Commercial activities
60
Datasets analyzed
61
Orchestrated disclosure Decentralized disclosure
● Open Coesione
● portal about the
fulfilment of
investments using the
2007-2013 European
Cohesion funds
● 85 billion Euros are
being tracked, 850K
projects
Dataset
Torino
Roma
Milano
Firenze
Bologna
Residents X X X X X
Weddings X X X
Business
Activities
X X X
Open Coesione
Measures
Characteristic Description ISO name
Completeness
Percentage of complete cells Com-I-1 (cell)
Percentage of complete rows Com-I-1 (row)
Accuracy
Percentage of syntactically accurate
cells
Acc-I-1
Traceability
Track of creation Tra-D-2 ( c )
Track of update Tra-D-2 (u)
Currentness
Percentage of current rows Cur-I-2
Delay in publication ~Cur-I-1
Compliance
eGSM compliance Cmp-D-1
five stars open data Cmp-D-1
Understandability
Percentage of columns with metadata Und-I-3
Percentage of columns in
comprehensible format
Und-I-4
63
e.GMS
1. Accessibility (mandatory if
appl)
2. Addressee (optional)
3. Aggregation (optional)
4. Audience (optional)
5. Contributor (optional)
6. Coverage (recommended)
7. Creator (mandatory)
8. Date (mandatory)
9. Description (optional)
10. Digital signature (optional)
11. Disposal (optional)
12. Format (optional)
13. Identifier (mandatory if appl)
14. Language (recommended)
15. Location (optional)
16. Mandate (optional)
17. Preservation (optional)
18. Publisher (mandatory if appl)
19. Relation (optional)
20. Rights (optional)
21. Source (optional)
22. Status (optional)
23. Subject (mandatory)
24. Title (mandatory)
25. Type (optional)
UK - e-Governmant Metadata Standard
https://www.oasis-open.org/committees/download.php/7271/eGMS%20version%203.pdf 64
Results – Open Coesione
65
0.00 0.20 0.40 0.60 0.80 1.00
Com-I-1 (cell)
Com-I-1 (row)
Acc-I-1
Tra-D-2 ( c )
Tra-D-2 (u)
Cur-I-2
~Cur-I-1
Cmp-D-1
Cmp-D-1
Und-I-3
Und-I-4
Null/zero
values :
domain
uncertain
Track
updates
missing
Missing
metadata
data not
linked
0 0.2 0.4 0.6 0.8 1
Com-I-1 (cell)
Com-I-1 (row)
Acc-I-1
Tra-D-2 ( c )
Tra-D-2 (u)
Cur-I-2
~Cur-I-1
Cmp-D-1
Cmp-D-1
Und-I-3
Und-I-4
Results – Municipality data
66
Discrepancies
of values with
domain
No info on
updates
Missing
metadata
Findings
 Disclosure strategy implies different
data quality
 Centralized vs.
 Decentralized
 Traceability is generally lacking
 Proposals to use Sw Conf Mgmt tools
 Metadata is often missing or
incomplete
67
Case 2: Public Contracts
 Published data
 Structured
 Open format
Data on public contracts ex Art.37
Decree Transparency + prescriptions
ANAC
68
Public contracts
 Decree Transparency (14 march 2013
n.33)
 Public contracts (Art.37 & Art 9.)
 Open Data Publication
 XML Standard Format (ANAC)
 Selected administrations: Italian
universities
69
Data Structure
XML
METADATA
DATA
LOTS
PARTICIPANTS
WINNER
70
<lotto>
<cig>4421574E47</cig>
<strutturaProponente>
<codiceFiscaleProp>00518460019</codiceFiscaleProp>
<denominazione>Politecnico di Torino</denominazione>
</strutturaProponente>
<oggetto>
Procedura di cottimo fiduciario per affidamento servizio di manutenzione e
assistenza di primo livello stazioni self-service
</oggetto>
<sceltaContraente>08-AFFIDAMENTO IN ECONOMIA - COTTIMO FIDUCIARIO</sceltaContraente>
<partecipanti>
<partecipante>
<codiceFiscale>06267040019</codiceFiscale>
<ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale>
</partecipante>
</partecipanti>
<aggiudicatari>
<aggiudicatario>
<codiceFiscale>06267040019</codiceFiscale>
<ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale>
</aggiudicatario>
</aggiudicatari>
<importoAggiudicazione>7500.00</importoAggiudicazione>
<tempiCompletamento>
<dataInizio>2014-09-01</dataInizio>
<dataUltimazione>2014-11-30</dataUltimazione>
</tempiCompletamento>
<importoSommeLiquidate>7500.00</importoSommeLiquidate>
</lotto>
71
Quality Evaluation Framework
Intrinsic
Dimensions
Domain
Dependent
Dimension Measure
Accuracy Percentage of elements
with correct values.
Completeness
Percentage of complete
elements.
Percentage of complete
aggregate elements.
Dimension Measure
Consistency Percentage of lots that
meet the Intrarelational
and Interrelational
Integrity Constraints.
Duplication Number of duplicates.
72
Identification of datasets
 First 25 universities of the overall ranking for
the 2014 provided by the newspaper Il Sole 24
Ore.
 Only 12 universities provide summary tables in
XML format.
Total numer of assessed lots: 123702
Average number of published lots:10308,5
 The remaining 13 universities either do not
provide the summary tables or they provide
summary tables but not in XML format.
73
CIG
74
The University of Torino
publishes summary tables that
have 100% cig completeness,
that is, the 100% of lots have the
cig element but about 32% of
them are out of domain.
1
0.94
0.9999
0.999
0.67
1
0.99
1
0.998
0.997
0.9998
0.99
1
1
1
1
1
1
1
1
1
1
1
1
0.600.700.800.901.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
Unique Tender Identifier
A lot of “00000000000”.
The element is present for
each lot but it is always
empty.
Choice of
contracting part
75
0.9999
0.998
0.9999
0
1
1
1
0.9991
1
1
1
1
1
1
1
1
1
1
1
0.999
1
1
1
1
0.000.200.400.600.801.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
All the lots published by
University of Milano have a
winner but no information about
the participants.
Fiscal Code
76
1
0.97
0.99
1
1
1
1
1
1
1
1
1
1
1
1
1
0.974
1
1
0.996
1
0.951
0.900.920.940.960.981.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
In 14% of lots the amount paid is
greater than the awarded amount.
Amount paid
vs. Total paid
78
0.87
0.97
0.96
0.9999
0.998
0.99
0.999
0.93
0.995
0.9999
0.98
0.98
0.800.850.900.951.00
UniBo
PoliMi
PoliTo
UniMi
UniTo
UniVe
UniUpo
UniFe
UniMib
UniPv
UniSa
UnivPm
PayedlessorequaltoAwarded
Final considerations
 ISO standard provides several
predefined measures
 Must be adapted to the case at hand
 Can be aggregated in different ways
 Possibility to define new measures
 ISO standard is intended for
structured data
 What about semantic knowledge bases?
79
References
 ISO/IEC 25012:2008, Software engineering — Software
product Quality Requirements and Evaluation (SQuaRE) —
Data quality model
 ISO 25024:2015, Software engineering — Software product
Quality Requirements and Evaluation (SQuaRE) —
Measurement of data quality
 Vetrò, Antonio; Canova, Lorenzo; Torchiano, Marco; Orozco
Minotas, Camilo; Iemma, Raimondo; Morando, Federico “Open
Data Quality Measurement Framework: Definition and
Application to Open Government Data”GOVERNMENT
INFORMATION QUARTERLY, Vol.33, pp.325-337, ISSN:0740-
624X
 Torchiano, Marco; Vetro', Antonio; Iuliano, Francesca
“Preserving the Benefits of Open Government Data by
Measuring and Improving Their Quality: An Empirical Study” in
IEEE 41st Annual Computer Software and Applications
Conference (COMPSAC 2017)
80
Ad

Recommended

Data quality architecture
Data quality architecture
anicewick
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
DATAVERSITY
 
Data Quality & Data Governance
Data Quality & Data Governance
Tuba Yaman Him
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
dmurph4
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
DATAVERSITY
 
The Business Value of Metadata for Data Governance
The Business Value of Metadata for Data Governance
Roland Bullivant
 
Data Governance Best Practices
Data Governance Best Practices
Boris Otto
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
DATAVERSITY
 
BI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and Strategy
Shivam Dhawan
 
Data Quality
Data Quality
jerdeb
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
DATAVERSITY
 
Gartner: Master Data Management Functionality
Gartner: Master Data Management Functionality
Gartner
 
Data Quality Strategies
Data Quality Strategies
DATAVERSITY
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
DATAVERSITY
 
Informatica MDM Presentation
Informatica MDM Presentation
MaxHung
 
Data Quality Rules introduction
Data Quality Rules introduction
datatovalue
 
Data Governance
Data Governance
Boris Otto
 
Master Data Management methodology
Master Data Management methodology
Database Architechs
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
CDMP preparation workshop EDW2016
CDMP preparation workshop EDW2016
Christopher Bradley
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
Precisely
 
Data Management is Data Governance
Data Management is Data Governance
DATAVERSITY
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
DATAVERSITY
 
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Element22
 
Business Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected Approach
DATAVERSITY
 
Data Quality for Non-Data People
Data Quality for Non-Data People
DATAVERSITY
 
Data Quality Best Practices
Data Quality Best Practices
DATAVERSITY
 
Modern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
The Great Data Debate (3) ISO8000: Systemic and systematic data quality, T.King
The Great Data Debate (3) ISO8000: Systemic and systematic data quality, T.King
BCS Data Management Specialist Group
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Anastasija Nikiforova
 

More Related Content

What's hot (20)

BI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and Strategy
Shivam Dhawan
 
Data Quality
Data Quality
jerdeb
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
DATAVERSITY
 
Gartner: Master Data Management Functionality
Gartner: Master Data Management Functionality
Gartner
 
Data Quality Strategies
Data Quality Strategies
DATAVERSITY
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
DATAVERSITY
 
Informatica MDM Presentation
Informatica MDM Presentation
MaxHung
 
Data Quality Rules introduction
Data Quality Rules introduction
datatovalue
 
Data Governance
Data Governance
Boris Otto
 
Master Data Management methodology
Master Data Management methodology
Database Architechs
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
CDMP preparation workshop EDW2016
CDMP preparation workshop EDW2016
Christopher Bradley
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
Precisely
 
Data Management is Data Governance
Data Management is Data Governance
DATAVERSITY
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
DATAVERSITY
 
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Element22
 
Business Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected Approach
DATAVERSITY
 
Data Quality for Non-Data People
Data Quality for Non-Data People
DATAVERSITY
 
Data Quality Best Practices
Data Quality Best Practices
DATAVERSITY
 
Modern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
BI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and Strategy
Shivam Dhawan
 
Data Quality
Data Quality
jerdeb
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
DATAVERSITY
 
Gartner: Master Data Management Functionality
Gartner: Master Data Management Functionality
Gartner
 
Data Quality Strategies
Data Quality Strategies
DATAVERSITY
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
DATAVERSITY
 
Informatica MDM Presentation
Informatica MDM Presentation
MaxHung
 
Data Quality Rules introduction
Data Quality Rules introduction
datatovalue
 
Data Governance
Data Governance
Boris Otto
 
Master Data Management methodology
Master Data Management methodology
Database Architechs
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
CDMP preparation workshop EDW2016
CDMP preparation workshop EDW2016
Christopher Bradley
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
Precisely
 
Data Management is Data Governance
Data Management is Data Governance
DATAVERSITY
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
DATAVERSITY
 
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Introduction to DCAM, the Data Management Capability Assessment Model - Editi...
Element22
 
Business Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected Approach
DATAVERSITY
 
Data Quality for Non-Data People
Data Quality for Non-Data People
DATAVERSITY
 
Data Quality Best Practices
Data Quality Best Practices
DATAVERSITY
 
Modern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 

Similar to Data Quality - Standards and Application to Open Data (20)

The Great Data Debate (3) ISO8000: Systemic and systematic data quality, T.King
The Great Data Debate (3) ISO8000: Systemic and systematic data quality, T.King
BCS Data Management Specialist Group
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Anastasija Nikiforova
 
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
Anastasija Nikiforova
 
Open data quality
Open data quality
Open Data Support
 
Exploiting data quality tools to meet the expectation of strategic business u...
Exploiting data quality tools to meet the expectation of strategic business u...
Zubair Abbasi
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data Challenge
Stefan Kühn
 
Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...
Alex Rayón Jerez
 
Data quality overview
Data quality overview
Alex Meadows
 
customized eager lazy data cleansing for satisfactory big data veracity
customized eager lazy data cleansing for satisfactory big data veracity
Rim Moussa
 
Nicolson
Nicolson
UKSG: connecting the knowledge community
 
Data Quality
Data Quality
Vijaya K
 
Rubbish in Rubbish out: applying good data governance techniques to gain maxi...
Rubbish in Rubbish out: applying good data governance techniques to gain maxi...
Ringgold Inc
 
Institutionalising open data quality - Processes Standards, Tools
Institutionalising open data quality - Processes Standards, Tools
Johann Höchtl
 
Data Quality_ the holy grail for a Data Fluent Organization.pptx
Data Quality_ the holy grail for a Data Fluent Organization.pptx
Balvinder Hira
 
Transform Your Downstream Cloud Analytics with Data Quality 
Transform Your Downstream Cloud Analytics with Data Quality 
Precisely
 
A step towards a data quality theory
A step towards a data quality theory
Anastasija Nikiforova
 
Data Quality
Data Quality
Shameek Ghosh
 
Defence IT 2012 - Data Quality and Financial Services - Solvency II
Defence IT 2012 - Data Quality and Financial Services - Solvency II
David Twaddell
 
Metadata Quality Assurance
Metadata Quality Assurance
Péter Király
 
Konrad cedem praesi
Konrad cedem praesi
Danube University Krems, Centre for E-Governance
 
The Great Data Debate (3) ISO8000: Systemic and systematic data quality, T.King
The Great Data Debate (3) ISO8000: Systemic and systematic data quality, T.King
BCS Data Management Specialist Group
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Anastasija Nikiforova
 
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
Anastasija Nikiforova
 
Exploiting data quality tools to meet the expectation of strategic business u...
Exploiting data quality tools to meet the expectation of strategic business u...
Zubair Abbasi
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data Challenge
Stefan Kühn
 
Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...
Alex Rayón Jerez
 
Data quality overview
Data quality overview
Alex Meadows
 
customized eager lazy data cleansing for satisfactory big data veracity
customized eager lazy data cleansing for satisfactory big data veracity
Rim Moussa
 
Data Quality
Data Quality
Vijaya K
 
Rubbish in Rubbish out: applying good data governance techniques to gain maxi...
Rubbish in Rubbish out: applying good data governance techniques to gain maxi...
Ringgold Inc
 
Institutionalising open data quality - Processes Standards, Tools
Institutionalising open data quality - Processes Standards, Tools
Johann Höchtl
 
Data Quality_ the holy grail for a Data Fluent Organization.pptx
Data Quality_ the holy grail for a Data Fluent Organization.pptx
Balvinder Hira
 
Transform Your Downstream Cloud Analytics with Data Quality 
Transform Your Downstream Cloud Analytics with Data Quality 
Precisely
 
A step towards a data quality theory
A step towards a data quality theory
Anastasija Nikiforova
 
Defence IT 2012 - Data Quality and Financial Services - Solvency II
Defence IT 2012 - Data Quality and Financial Services - Solvency II
David Twaddell
 
Metadata Quality Assurance
Metadata Quality Assurance
Péter Király
 
Ad

More from Marco Torchiano (14)

Testing the UI of Mobile Applications
Testing the UI of Mobile Applications
Marco Torchiano
 
Software Engineering II Course at Politecnico di Torino
Software Engineering II Course at Politecnico di Torino
Marco Torchiano
 
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Marco Torchiano
 
Research Activities: past, present, and future.
Research Activities: past, present, and future.
Marco Torchiano
 
Data Quality - Standards e Applicazioni
Data Quality - Standards e Applicazioni
Marco Torchiano
 
Data Visualization
Data Visualization
Marco Torchiano
 
Riflessioni su Riforma Costituzionale "Renzi-Boschi"
Riflessioni su Riforma Costituzionale "Renzi-Boschi"
Marco Torchiano
 
Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech...
Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech...
Marco Torchiano
 
Energy Consumption Analysis
 of Image Encoding and Decoding Algorithms
Energy Consumption Analysis
 of Image Encoding and Decoding Algorithms
Marco Torchiano
 
Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech...
Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech...
Marco Torchiano
 
A Model-Based Approach to Language Integration
A Model-Based Approach to Language Integration
Marco Torchiano
 
On the computation of Truck Factor
On the computation of Truck Factor
Marco Torchiano
 
Language Interaction and Quality Issues: An Exploratory Study
Language Interaction and Quality Issues: An Exploratory Study
Marco Torchiano
 
The impact of process maturity on defect density
The impact of process maturity on defect density
Marco Torchiano
 
Testing the UI of Mobile Applications
Testing the UI of Mobile Applications
Marco Torchiano
 
Software Engineering II Course at Politecnico di Torino
Software Engineering II Course at Politecnico di Torino
Marco Torchiano
 
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Espresso vs. EyeAutomate: comparing two generations of Android GUI testing tools
Marco Torchiano
 
Research Activities: past, present, and future.
Research Activities: past, present, and future.
Marco Torchiano
 
Data Quality - Standards e Applicazioni
Data Quality - Standards e Applicazioni
Marco Torchiano
 
Riflessioni su Riforma Costituzionale "Renzi-Boschi"
Riflessioni su Riforma Costituzionale "Renzi-Boschi"
Marco Torchiano
 
Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech...
Relevance, Benefits, and Barriers of Software Modelling and Model Driven Tech...
Marco Torchiano
 
Energy Consumption Analysis
 of Image Encoding and Decoding Algorithms
Energy Consumption Analysis
 of Image Encoding and Decoding Algorithms
Marco Torchiano
 
Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech...
Relevance, Benefits, and Problems of Software Modelling and Model-Driven Tech...
Marco Torchiano
 
A Model-Based Approach to Language Integration
A Model-Based Approach to Language Integration
Marco Torchiano
 
On the computation of Truck Factor
On the computation of Truck Factor
Marco Torchiano
 
Language Interaction and Quality Issues: An Exploratory Study
Language Interaction and Quality Issues: An Exploratory Study
Marco Torchiano
 
The impact of process maturity on defect density
The impact of process maturity on defect density
Marco Torchiano
 
Ad

Recently uploaded (20)

Data Warehousing and Analytics IFI Techsolutions .pptx
Data Warehousing and Analytics IFI Techsolutions .pptx
IFI Techsolutions
 
NASA ESE Study Results v4 05.29.2020.pptx
NASA ESE Study Results v4 05.29.2020.pptx
CiroAlejandroCamacho
 
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
jacoba18
 
lecture12.pdf Introduction to bioinformatics
lecture12.pdf Introduction to bioinformatics
SergeyTsygankov6
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
benediktnetzer1
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
Verweven van EM Legacy en OTL-data bij AWV
Verweven van EM Legacy en OTL-data bij AWV
jacoba18
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Boost Business Efficiency with Professional Data Entry Services
Boost Business Efficiency with Professional Data Entry Services
eloiacs eloiacs
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
Measurecamp Copenhagen - Consent Context
Measurecamp Copenhagen - Consent Context
Human37
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
Flextronics Employee Safety Data-Project-2.pptx
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
SUNSSE Engineering Introduction 2021.pdf
SUNSSE Engineering Introduction 2021.pdf
Ongkino
 
@Reset-Password.pptx presentakh;kenvtion
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
reporting monthly for genset & Air Compressor.pptx
reporting monthly for genset & Air Compressor.pptx
dacripapanjaitan
 
deep_learning_presentation related to llm
deep_learning_presentation related to llm
sayedabdussalam11
 
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
bhavaniteacher99
 
Data Warehousing and Analytics IFI Techsolutions .pptx
Data Warehousing and Analytics IFI Techsolutions .pptx
IFI Techsolutions
 
NASA ESE Study Results v4 05.29.2020.pptx
NASA ESE Study Results v4 05.29.2020.pptx
CiroAlejandroCamacho
 
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
Veilig en vlot fietsen in Oost-Vlaanderen: Fietssnelwegen geoptimaliseerd met...
jacoba18
 
lecture12.pdf Introduction to bioinformatics
lecture12.pdf Introduction to bioinformatics
SergeyTsygankov6
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
BCG-Executive-Perspectives-CEOs-Guide-to-Maximizing-Value-from-AI-EP0-3July20...
benediktnetzer1
 
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
REGRESSION DIAGNOSTIC I: MULTICOLLINEARITY
Ameya Patekar
 
Verweven van EM Legacy en OTL-data bij AWV
Verweven van EM Legacy en OTL-data bij AWV
jacoba18
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Boost Business Efficiency with Professional Data Entry Services
Boost Business Efficiency with Professional Data Entry Services
eloiacs eloiacs
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
Measurecamp Copenhagen - Consent Context
Measurecamp Copenhagen - Consent Context
Human37
 
KLIP2Data voor de herinrichting van R4 West en Oost
KLIP2Data voor de herinrichting van R4 West en Oost
jacoba18
 
FME Beyond Data Processing: Creating a Dartboard Accuracy App
FME Beyond Data Processing: Creating a Dartboard Accuracy App
jacoba18
 
Flextronics Employee Safety Data-Project-2.pptx
Flextronics Employee Safety Data-Project-2.pptx
kilarihemadri
 
SUNSSE Engineering Introduction 2021.pdf
SUNSSE Engineering Introduction 2021.pdf
Ongkino
 
@Reset-Password.pptx presentakh;kenvtion
@Reset-Password.pptx presentakh;kenvtion
MarkLariosa1
 
reporting monthly for genset & Air Compressor.pptx
reporting monthly for genset & Air Compressor.pptx
dacripapanjaitan
 
deep_learning_presentation related to llm
deep_learning_presentation related to llm
sayedabdussalam11
 
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
SQL-Demystified-A-Beginners-Guide-to-Database-Mastery.pptx
bhavaniteacher99
 

Data Quality - Standards and Application to Open Data

  • 1. Data Quality Standards and Application to Open Data February 21, 2018 – Brunel University, UK Marco Torchiano [email protected] Version 1.1.0 © Marco Torchiano, 2018
  • 2. About me  Marco Torchiano  Associate Professor, Politecnico di Torino  Senior Member IEEE  Faculty Fellow – Nexa Center for Internet and Society  Member UNI CT504–Software Engineering  Contacts: – mailto:[email protected] – http://softeng.polito.it/torchiano/ – Twitter: @mtorchiano 3
  • 3. Current Research Interests  Mobile UI Automated Testing  PhD student working on fragility  (Open)Data Quality  PhD student working on KB quality  Software Energy Consumption  Several collaborations  Also: MDD, Survey methodology, code obfuscation, SE education, … 4
  • 4. Acknowledgments  Antonio Vetrò  The counterpart for this line of research  Many other people  L.Canova, R.Iemma, F.Iuliano, F.Morando, C.Orozco Minotas, G.Procaccianti, R.Rashid 5
  • 6. Open Coesione  portal about the fulfilment of investments using the 2007-2013 European Cohesion funds  Interactive Interface  Downloadable .csv datasets  ~100 billion Euros are being tracked, ~100K projects  http://www.opencoesione.gov.it/
  • 7. 9
  • 9. 43 ! * extraction, transformation, and loading 11
  • 11. » Refer always to raw data » If not possible, estimate accuracy on analysis (e.g., about 5% in the example above) 43 ! 13
  • 13. 15
  • 14. »Outliers can point to interesting facts Outliers 16
  • 15. »… or to something which deserves a second look Outliers 17
  • 16. Valu e pcvc= percentage of cells with correct value 18
  • 18. ISO - SQuaRE 2503x Quality Requirements 2504x Quality Evaluation 2501x Quality Model 2500x Quality Management 2502x Quality Measurement Family of standards 20
  • 19. ISO SQuaRE  Internal Quality  Values, formats, relation  External Quality  Technological environment  Quality in Use  Context of use of the data user 21
  • 20. ISO 25012 Data Quality Model 22 2503x Quality Requirements 2504x Quality Evaluation 2501x Quality Model 2500x Quality Management 2502x Quality Measurement
  • 21. Roles  Data Quality evaluator  Data Producer  Data Acquirer  Data User 23
  • 22. Data evaluator  Defines/adapts a quality model  Evaluate and act  Data correction  Technological adjustments  Organizational measures 24
  • 23. Model structure  Characteristic  Main aspects, e.g., usability  Sub-Characteristic (optional)  A detailed aspect of a characteristic, e.g. Understandability  Metric  A set of rules to assign and interpret a (numerical) evaluation to a specific (sub)- characteristic 25
  • 24. Characteristics  Accuracy  Completeness  Consistency  Credibility  Currentness  Accessibility  Compliance  Confidentiality  Efficiency  Precision  Traceability  Understandability  Availability  Portability  Recoverability 26
  • 25. Characteristics  Accuracy  Correspondence between data and reality (syntactic and semantic)  Completeness  Computer: presence of all necessary values  User: how much the data is able to satisfy the needs  Consistency  Absence of contradictions in the data 27
  • 26. Characteristics  Credibility  The extent to which data are regarded as true and credible by users  Currentness  the extent to which data is up-to-date  Accessibility  The capability of data to be accessed, particularly by people who need supporting technology or special configuration because of some disability 28
  • 27. Characteristics  Regulatory compliance  The capability of data to adhere to standards, conventions or regulations in force and similar rules relating to data quality  Confidentiality  The capability of the data to be accessed and interpreted only by authorized users  Efficiency  The capability of data to be processed (accessed, acquired, updated, etc) and to provide appropriate levels of performance using the appropriate amounts and types of resources under stated conditions 29
  • 28. Characteristics  Precision  Capability of the value assigned to an attribute to provide the degree of information needed in a stated context of use  Traceability  Presence of attributes providing an audit trail of access and changes made to data  Understandability  The extent to which data can be read and interpreted by users 30
  • 29. Characteristics  Availability  The capability of data to be always retrievable.  Recoverability  The capability to preserve a specified level of operations and its physical and logical integrity, even in the event of failure  Portability  The capability of data to be moved to another platform preserving quality 31
  • 31. ISO 25024 Measurement of Data Quality 33 2503x Quality Requirements 2504x Quality Evaluation 2501x Quality Model 2500x Quality Management 2502x Quality Measurement
  • 32. Relationships among standards ISO/IEC 25010 System and Software Product Quality ISO/IEC 25012 Data Quality composed of Quality characteristics Quality sub-characteristics composed of Quality Measure ISO/IEC 25022, 25023, 25024 Measuremen t function defines composed of Quality Measure Elements QME Measuremen t method ISO/IEC 25021 Property to quantifyTarget Entity Source: ISO/IEC 25024 34
  • 33. Data Life Cycle: examples Data design Data collection Data integration External data acquisition Source: ISO/IEC 25024 Data processin g Presentation Other use Data store Delete 35
  • 34. Data design: target entities  Architecture  Contextual schema  Data models (conceptual, logical, physical)  Data dictionary  Document 36
  • 35. Data design: properties  Attribute  Element  Information  Metadata  Vocabulary 37
  • 36. Other stages: target entities  Data file  DBMS  RDBMS  Form  Presentation device 38
  • 37. Properties  Data format  Data item  Data value  Information item  Information item content  Data record 39
  • 38. Metrics definition A) ID: abbreviated code of the quality characteristics + (I/D)+serial number b) Name: QM name related to data; c) Description d) Measurement function: formula showing how the QMEs are combined to produce the QM; e) DLC, Target entities, Properties: DLC includes stages of the DLC where the data QMEs are applicable, target entities and properties of target entities; f) Note: in the note, additional information such as an acceptable range of values, reference to other standards, explanations or interpretation or criteria, measurement method used to obtain the 40
  • 41. Open Government Data OD: open data, data that can be  Used  Reused  Redistributed  By anyone and with any goal G: Government produced or commissioned by a government or an institutional entity controlled by the government http://opengovernmentdata.org 51
  • 42. Why OGD ?  Transparency  Social and commercial value  Participation 52
  • 43. Case 1: Open Coesione  Published data  Structured  Open data format OpenCoesione Statistical data from municipalities  Residents  Weddings  Commercial activities 60
  • 44. Datasets analyzed 61 Orchestrated disclosure Decentralized disclosure ● Open Coesione ● portal about the fulfilment of investments using the 2007-2013 European Cohesion funds ● 85 billion Euros are being tracked, 850K projects Dataset Torino Roma Milano Firenze Bologna Residents X X X X X Weddings X X X Business Activities X X X
  • 46. Measures Characteristic Description ISO name Completeness Percentage of complete cells Com-I-1 (cell) Percentage of complete rows Com-I-1 (row) Accuracy Percentage of syntactically accurate cells Acc-I-1 Traceability Track of creation Tra-D-2 ( c ) Track of update Tra-D-2 (u) Currentness Percentage of current rows Cur-I-2 Delay in publication ~Cur-I-1 Compliance eGSM compliance Cmp-D-1 five stars open data Cmp-D-1 Understandability Percentage of columns with metadata Und-I-3 Percentage of columns in comprehensible format Und-I-4 63
  • 47. e.GMS 1. Accessibility (mandatory if appl) 2. Addressee (optional) 3. Aggregation (optional) 4. Audience (optional) 5. Contributor (optional) 6. Coverage (recommended) 7. Creator (mandatory) 8. Date (mandatory) 9. Description (optional) 10. Digital signature (optional) 11. Disposal (optional) 12. Format (optional) 13. Identifier (mandatory if appl) 14. Language (recommended) 15. Location (optional) 16. Mandate (optional) 17. Preservation (optional) 18. Publisher (mandatory if appl) 19. Relation (optional) 20. Rights (optional) 21. Source (optional) 22. Status (optional) 23. Subject (mandatory) 24. Title (mandatory) 25. Type (optional) UK - e-Governmant Metadata Standard https://www.oasis-open.org/committees/download.php/7271/eGMS%20version%203.pdf 64
  • 48. Results – Open Coesione 65 0.00 0.20 0.40 0.60 0.80 1.00 Com-I-1 (cell) Com-I-1 (row) Acc-I-1 Tra-D-2 ( c ) Tra-D-2 (u) Cur-I-2 ~Cur-I-1 Cmp-D-1 Cmp-D-1 Und-I-3 Und-I-4 Null/zero values : domain uncertain Track updates missing Missing metadata data not linked
  • 49. 0 0.2 0.4 0.6 0.8 1 Com-I-1 (cell) Com-I-1 (row) Acc-I-1 Tra-D-2 ( c ) Tra-D-2 (u) Cur-I-2 ~Cur-I-1 Cmp-D-1 Cmp-D-1 Und-I-3 Und-I-4 Results – Municipality data 66 Discrepancies of values with domain No info on updates Missing metadata
  • 50. Findings  Disclosure strategy implies different data quality  Centralized vs.  Decentralized  Traceability is generally lacking  Proposals to use Sw Conf Mgmt tools  Metadata is often missing or incomplete 67
  • 51. Case 2: Public Contracts  Published data  Structured  Open format Data on public contracts ex Art.37 Decree Transparency + prescriptions ANAC 68
  • 52. Public contracts  Decree Transparency (14 march 2013 n.33)  Public contracts (Art.37 & Art 9.)  Open Data Publication  XML Standard Format (ANAC)  Selected administrations: Italian universities 69
  • 54. <lotto> <cig>4421574E47</cig> <strutturaProponente> <codiceFiscaleProp>00518460019</codiceFiscaleProp> <denominazione>Politecnico di Torino</denominazione> </strutturaProponente> <oggetto> Procedura di cottimo fiduciario per affidamento servizio di manutenzione e assistenza di primo livello stazioni self-service </oggetto> <sceltaContraente>08-AFFIDAMENTO IN ECONOMIA - COTTIMO FIDUCIARIO</sceltaContraente> <partecipanti> <partecipante> <codiceFiscale>06267040019</codiceFiscale> <ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale> </partecipante> </partecipanti> <aggiudicatari> <aggiudicatario> <codiceFiscale>06267040019</codiceFiscale> <ragioneSociale>OVER S.A.S. DI VERGNANO CARLO & C.</ragioneSociale> </aggiudicatario> </aggiudicatari> <importoAggiudicazione>7500.00</importoAggiudicazione> <tempiCompletamento> <dataInizio>2014-09-01</dataInizio> <dataUltimazione>2014-11-30</dataUltimazione> </tempiCompletamento> <importoSommeLiquidate>7500.00</importoSommeLiquidate> </lotto> 71
  • 55. Quality Evaluation Framework Intrinsic Dimensions Domain Dependent Dimension Measure Accuracy Percentage of elements with correct values. Completeness Percentage of complete elements. Percentage of complete aggregate elements. Dimension Measure Consistency Percentage of lots that meet the Intrarelational and Interrelational Integrity Constraints. Duplication Number of duplicates. 72
  • 56. Identification of datasets  First 25 universities of the overall ranking for the 2014 provided by the newspaper Il Sole 24 Ore.  Only 12 universities provide summary tables in XML format. Total numer of assessed lots: 123702 Average number of published lots:10308,5  The remaining 13 universities either do not provide the summary tables or they provide summary tables but not in XML format. 73
  • 57. CIG 74 The University of Torino publishes summary tables that have 100% cig completeness, that is, the 100% of lots have the cig element but about 32% of them are out of domain. 1 0.94 0.9999 0.999 0.67 1 0.99 1 0.998 0.997 0.9998 0.99 1 1 1 1 1 1 1 1 1 1 1 1 0.600.700.800.901.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm Unique Tender Identifier A lot of “00000000000”.
  • 58. The element is present for each lot but it is always empty. Choice of contracting part 75 0.9999 0.998 0.9999 0 1 1 1 0.9991 1 1 1 1 1 1 1 1 1 1 1 0.999 1 1 1 1 0.000.200.400.600.801.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm
  • 59. All the lots published by University of Milano have a winner but no information about the participants. Fiscal Code 76 1 0.97 0.99 1 1 1 1 1 1 1 1 1 1 1 1 1 0.974 1 1 0.996 1 0.951 0.900.920.940.960.981.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm
  • 60. In 14% of lots the amount paid is greater than the awarded amount. Amount paid vs. Total paid 78 0.87 0.97 0.96 0.9999 0.998 0.99 0.999 0.93 0.995 0.9999 0.98 0.98 0.800.850.900.951.00 UniBo PoliMi PoliTo UniMi UniTo UniVe UniUpo UniFe UniMib UniPv UniSa UnivPm PayedlessorequaltoAwarded
  • 61. Final considerations  ISO standard provides several predefined measures  Must be adapted to the case at hand  Can be aggregated in different ways  Possibility to define new measures  ISO standard is intended for structured data  What about semantic knowledge bases? 79
  • 62. References  ISO/IEC 25012:2008, Software engineering — Software product Quality Requirements and Evaluation (SQuaRE) — Data quality model  ISO 25024:2015, Software engineering — Software product Quality Requirements and Evaluation (SQuaRE) — Measurement of data quality  Vetrò, Antonio; Canova, Lorenzo; Torchiano, Marco; Orozco Minotas, Camilo; Iemma, Raimondo; Morando, Federico “Open Data Quality Measurement Framework: Definition and Application to Open Government Data”GOVERNMENT INFORMATION QUARTERLY, Vol.33, pp.325-337, ISSN:0740- 624X  Torchiano, Marco; Vetro', Antonio; Iuliano, Francesca “Preserving the Benefits of Open Government Data by Measuring and Improving Their Quality: An Empirical Study” in IEEE 41st Annual Computer Software and Applications Conference (COMPSAC 2017) 80

Editor's Notes

  • #53: Transparency isn’t just about access, it is also about sharing and reuse — often, to understand material it needs to be analyzed and visualized and this requires that the material be open so that it can be freely used and reused.
  • #73: To assess the quality we consider different dimensions:intrinsic dimesions which do not depend on the type of the data and domain dependent dimensions. As intrinsic dimension we evaluate the Accuracy computed as the percentage of elements with correct values and Completeness computed as percentage of complete elements and the percentage of complete aggregate elements, where an element is considered not correct or incomplete if it does not meet the specification of its domain or the number occurrences specified in the XML schema. For the domain dependent dimensions we evaluate the consistency by defining a set of integrity constraints that strictly depend on public constracts domain as for axample that the amountPaid must be less than or equal to the award amount or that if a public contract does not have a successful tenderer the amount paid must be equal to zero.
  • #74: To conduct the evaluation we selected the first 25 universities of the general ranking for the 2014 provided by the newspaper Il Sole 24 Ore. Only 12 of them provide summary tables in the xml format for a total of 123702 assessed lot. The remaining 13 Universities either do not provide the summary tables or they provide summary table but not in XML format.
  • #75: The accuracy and completeness were computed for all elements but we will show the most interesting and moreover we wiil see only some of the integrity constraints defined to asses the consistency. The cig is the unique identifier of a lot. The university of torino has a completeness on the cig of 100% this means that the cig element is present in all analysed lots but in the 32% of cases it is out of domain.
  • #76: The scelta contraente is one of the most important element because it specifies the procedure for the selection of the contractor and it can be used by the authorities to detect illegal award of contracts. High accuracy and completeness will improve the transparency of contracts. The completeness sceltaContraente element for the university of Milano is 100% but percentage of correct elements is equal to 0 this because the scelta contraente is always present in all the lots provided by the university of Milano but its value is always empty.
  • #77: The codiceFiscale is the unique identifier for the participants, an interesting aspect is that the University of Milano is not classified because in all the summary tables provided by the University there isn’t information about the participants.
  • #78: This results is highlighted by the lots has participant and the successful tenderer is participant interrelational constraints. The first one computes the percentage of lots which have a successful tenderer and have at least one participant while the second constraints computes the percentage of cells in which the sucessful tenderer of a lot is a participant for the same lot. In both cases the percentage for the university of Milano is equal to zero because there isn’t information on participants in the analysed files. For the university of Milano-Bococca the percentage of lot has participant constraint is slightly higher than the successfulTenderer is Participant and this means that althought in some lots there are participants, the successful tenderer is not one of those participant.
  • #79: The first IntrarelationalConstraint computes the percentage of lots in which the amount paid is less than or equal to the award amount and we can see that the 14% of lot of University of Bologna have an amount paid greater then the award amount this shows that more public money than requested is spent. The successfulTenderer_amountPaint computes the percentage of cells in which there isn’t information about the successful tenderer but the amount paid is different by zero. For the 40 % of lots of the University of Bologna there is not information about the successfull tenderer but an amount of money is distributed and it is not possible to track the money, that is, it is not known who receives the money.