SlideShare a Scribd company logo
Towards Linked Open
Government Data
Vlad Posea
vlad.posea@cs.pub.ro
about me
• bachelor and PhD from Politehnica University of Bucharest
• Master in Data Mining from University Lumiere of Lyon
• research on competence management, semantic web, e-learning
• business on career management and recruiting
• now fellow of the Romanian American Foundation at the University of
Rochester (Fulbright scholarship starting with 2017) for developing
entrepreneurship in Romania
Linked Data and Open Data
• linked data = a way to connect data on the web using URIs and RDF,
the most successful result of the Semantic Web initiative
• open data = Open data is data that can be freely used, re-used and
redistributed by anyone - subject only, at most, to the requirement to
attribute and sharealike.
• open government data = data regarding public institutions, published
on governmental sites
Very Short Intro on RDF
• data represented as
statements
• statements contain
• subject
• predicate
• object
• subject, predicate and
sometimes objects are URIs
• URIs are used to uniquely
identify entities or properties
Linked Data
http://lod-cloud.net/
Why Do We Need Open Data?
• transparency
• how does the government spend money
• fuel innovation and entrepreneurship
• https://www.youtube.com/watch?v=sUqY5ySylXg (Todd Park discussing
benefits of Open Government Data)
• opening weather data and GPS data allowed people to build businesses
• “last year alone civilian and commercial access to GPS created 90 billion $
worth of value” (2013)
• participatory governance
• citizens enabled in decision making
• “making a full read/write society” (http://opengovernmentdata.org/why/)
Open Data Quality
• five stars of open data proposed by Tim Berners Lee
• (1) be available on the Web under an open licence,
• (2) be in the form of structured data,
• (3) be in a non-proprietary file format,
• (4) use URIs as itsidentifiers (see also RDF),
• (5) include links to other data sources (see linked data).
http://opendatahandbook.org/glossary/en/terms/five-stars-of-open-data/
Open Data in the World
• Global Open Data Quality measures how governments implement Open
Data
• evaluates if a country posted data on
• national statistics
• government budget
• government spending
• legislation
• election results
• national map
• pollution
• also evaluates the quality of the posted data
• companies
• location datasets
• government procurement
• water quality
• weather forecast
• land ownership
• transport timetables
• health performance
Global Open Data Quality
relevant progress has been made in terms of
opening data
scores would be much lower if 5 star data
would have a bigger weight
http://index.okfn.org/place/
http://index.okfn.org/methodology/
Open Data in the US
• data.gov – 190k datasets
• mostly html (70k)
• RDF below 5% of the total number of datasets
• more than a quarter are either pdf, jpg, tiff
• relevant steps
• data.gov launched in 2009
• Open Government Partnership 2011 (http://www.opengovpartnership.org/)
• Digital Accountability and Transparency Act (2014)
• creating publishing standards for public spending data
https://max.gov/maxportal/assets/public/offm/DataStandardsFinal.htm
Open Data in Saint Louis
• https://www.stlouis-mo.gov/data/ - list of data sets
• most of them html or pdf
• some confuse open data with reports
Open Data in Romania
• Data.gov.ro
• National portal where public institutions put all the data
• Types of resources published: CSV (***), PDF(*), XLS (**)
• There is no connection between files (zero files with 4 or 5 *)
• September 2016:
• 72 public institutions
• 8185 files
• Each file can have its own structure
• uses CKAN (http://ckan.org/)
Why do we need Linked Open Data
• classic workflow when working with open data:
• analyze CSV files
• define own data model
• import data from CSV files into data model
• solve import problems (naming differences, character encoding issues)
• identify entities and link them to other entities existing in the model
• link data from different CSV files in a common model
• extract relevant information
• write the program logic to exploit the data
Why Do We Need Linked Open Data
• classic workflow when working with linked open data:
• analyze models
• write query to extract relevant information
• write the program logic to exploit the data
• can use directly more than one dataset by performing “joins” in the
queries
• much faster to develop an application
• much easier to reuse data
Linked Open Data in Romania
• Our goal is to transform open data from Romania into Linked Open
Data.
• Transform data into RDF triples (Subject, Predicate, Object)
• Link entities with existing online resources, especially from dbpedia.org and
Geonames
• Create a platform where each published file is transformed into RDF
• Create rich applications using SPARQL queries
Vision
• create tools and workflows to allow non-technical users to add Linked
Data to the government website
• offer an API for developers who want to create apps based on open
government data
• integrate the software into CKAN (the open data portal used by most
governments) to allow every government to create linked data
Stages
1. modeling data
2. massively transforming data
3. linking data to external data sets
4. embed into CKAN
First Stage – Modeling Data
• Identify the most common ontologies used
• Create naming rules for creating the same URIs that identify the same
resources
• Identify the most common properties of the open data and the
ontological properties associated to them
• Identify the most common naming problems
• different encodings
• different spelling
• different lexicalization of the same concepts
and write hacks to solve them
Open Data Types
• Numerical data:
• Different budgets or revenues
• Different statistical data: number of cars/type/year, number of bed/hospital
• Etc.
• “Plain” data:
• Information about entities:
• Lawyers, Schools, Pharmacies, Museums, Archeological sites
• Etc.
• Found in tabular files, such as CSV or XLS
Most Common Vocabularies for Open Data
https://lov.okfn.org/dataset/lov/
Most Common Vocabularies for Open Data
• Dublin Core (DCTerms, DCE) – describes metadata terms
(http://dublincore.org/schemas/)
• SKOS – Simple Knowledge Organisation Systems – representing
taxonomies
• FOAF – Friend of a Friend – representing people and the relations
between them
• CC – Creative Commons – copyright information
• GEO – Geonames – data about locations
• VANN – data about vocabularies
• DBPedia –
Ontologies used
• We used especially OWL classes defined by dbpedia such as:
• http://dbpedia.org/ontology/Location
• http://dbpedia.org/ontology/Place
• http://dbpedia.org/ontology/PopulatedPlace
• http://dbpedia.org/ontology/Museum
• http://dbpedia.org/ontology/Hospital
• http://dbpedia.org/ontology/EducationalInstitution
• http://dbpedia.org/class/yago/CitiesInRomania
• Other used OWL classes:
• http://umbel.org/umbel/rc/Village
• https://schema.org/PostalAddress
Naming rules for creating URIs
• Each URI has as prefix: http://opendata.cs.pub.ro/resource
• Our goal is to make URIs for each resource as easy to understand as
possible for humans
• Our statement is: Once you read the URI, you know what it is about
• For example:
• Locality: http://opendata.cs.pub.ro/resource/<localityName>_judet_<localityCounty>
• Hospital: http://opendata.cs.pub.ro/resource/<hospitalName>_hospital_<localityCounty>
Most common properties
• Most used properties were taken from well-known vocabularies such
as:
• VCARD: vcard:region, vcard:locality
• FOAF: foaf:mbox, foaf:fax
• GEO: geo:lat, geo:long
• Other properties were taken from those defined by dbpedia.org:
• http://dbpedia.org/property/postcode
• http://dbpedia.org/property/phonenumber
• We also defined properties defined in our own namespace:
• http://opendata.cs.pub.ro/property
Most common naming problems and how we
solve them
• Resource’s name was written with diacritics
• Replace diacritics with normal letter
• Resource’s name was written with non-alphanumeric characters, such as:
space, hyphen, comma
• Replace them with underscore
• After initial choosing the naming convention, we saw that there can be
some conflicts
• For example: we chose initial the URI for museums only the name of the museum,
but there can be a museum with same name in multiple towns, so we added for the
URI also the museum’s town
Linked Open Data in Romania
Linked Open Data in Romania
Stage 2: massively transforming data
• experiment with 20 students from a master’s class
• groups of 2 asked to choose 2 datasets and transform them
• + greatly increased the number of data transformed
• - big amount of work to correct the errors introduced by students
• have to involve a larger number of volunteers
• students will be asked to offer expert support
• 2016 result: 10 new datasets, more than 500000 triples added
Biggest problem PDF Files
• Unfortunately, there are tabular
data hidden in scanned PDF files
• We created an algorithm to extract
only the tables from these scanned
files
• This way, we transform the
unmanageable scanned files into
tabular ones
• We want to improve existing tools
using contextual information
regarding the type of document
Stage 3 Linking to external data sets
• most important datasets:
• dbpedia
• people
• events
• places
• geonames
• all the places
How to link?
• query and disambiguate
• sometimes really difficult
• disambiguation
• by type
• by context
• not always possible automatically
Relevant tool: SILK
• http://silkframework.org/
• Generating links between related data items within different Linked Data
sources.
• Linked Data publishers can use Silk to set RDF links from their data sources to
other data sources on the Web.
• Applying data transformations to structured data sources.
or write some code
geoloc=loc.decode("utf8")[:-1]
query=strip_accents("http://api.geonames.org/search?q=%s&maxRows=1&type=rdf&username=vladposea"%geoloc)
g=rdflib.Graph()
g.parse(query.encode("ascii","ignore"),format="xml", encoding="utf-8")
for s,p,o in g.triples( (None, rdflib.RDF.type, gn.Feature) ):
fullGraph.add((locURI,OWL.sameAs, s))
<rdf:RDF >
<gn:Feature rdf:about="http://sws.geonames.org/686254/">
...
Stage 4 – Embed into CKAN
• CKAN - http://ckan.org/
• tool for publishing data
• aimed at governments and other public organizations
• specially designed for open data
• used internationally
• not built for linked data
• we envision developing a plugin to semi-automatically construct
linked data from the open data published
What do we have so far?
• Our focus was on “plain” data:
• Cities dataset published in RDF and linked with geonames.org
• Each created resource has a <owl:sameAs> property that links to geonames.org
• Schools dataset published in RDF
• Pharmacies dataset published in RDF
• Museums dataset published in RDF and linked with dbpedia.org
• Churches dataset publishe in RDF and linked with dbpedia.org
• 207382 URIs with overall 2683968 RDF triples
How we transformed the data?
• For each dataset:
• We identified what vocabulary should be used for each property
• We identified what additional properties should be created for each resource
• Each physical entity has an address and using Google Geocode service we obtained the
geographical coordinates for that address
• We created one unique URI for each resource
• We generated the URIs by putting a lot of information inside them for example the URI
for one school is :http://opendata.cs.pub.ro/resource/<school_name>_<city>
• We opted for this encoding schema to create more verbose URIs, not just hashes
• We linked each possible resource using online semantic repositories, such as
dbpedia.org and geonames.org
• The linking is done by searching entities with the same type and name
How can someone access the resources?
• We have published all RDF triples in a semantic repository:
• http://opendata.cs.pub.ro/repo
• It supports SPARQL queries
• http://opendata.cs.pub.ro/repo/sparql
• We document all published datasets in :
• Blog: http://opendata.cs.pub.ro/blog
• Wiki: http://opendata.cs.pub.ro/wiki
SPARQL queries
Towns where there are no schools
SELECT ?loc
WHERE { ?loc rdf:type
<http://dbpedia.org/ontology/Settlement> .
FILTER NOT EXISTS { ?x
<http://opendata.cs.pub.ro/property/institutie_in_lo
calitate> ?loc . } }
Find the museums linked with dbpedia.org
SELECT ?MusRO ?MusDB
WHERE {
?MusRO rdf:type
<http://dbpedia.org/ontology/Museum>.
?MusRO owl:sameAs ?MusDB. }
ORDER BY ?MusRO
Example application
• All physical entities have an address and we obtained the
geographical coordinates of this address.
• We put on a map all these entities and someone can see the nearest
museums, hospitals or pharmacies from its location
• The app is online:
• http://opendata.cs.pub.ro:3000
Linked Open Data in Romania
Technologies used in this project
• Storage layer:
• Apache Marmotta HEAD version
• Processing layer:
• JAVA using Apache POI for reading tabular data and Apache Jena for
converting data to RDF
• C with OpenCV and Tesseract for extracting tabular data from scanned PDF
files
• Visualization layer:
• Backend: node.js using sparql-client module for SPARQL queries
• Fronted: angular.js
Alternative Technologies
• Open Refine
• http://openrefine.org/
• formerly Google Refine
• allows to
• explore data in various formats
• clean and transform data (clustering, easy or scripted transformations)
• reconcile and match data
• supports external web services
Karma
• semantic mapping tool http://labs.europeana.eu/apps/karma
• imports data in various formats
• transforms it to semantic data
• links it to DBPedia or GeoNames
• no features for statistical data integration
• no features for parsing pdf files
Named Entity Recognition
• Named Entity Recognition – identify entities in texts, apply tags, link
to permanent entities
• Open Calais – up to 5k free requests/day
• http://www.opencalais.com/
• Alchemy – made by IBM
• http://www.alchemyapi.com/
• 1k/day free
Apache Marmotta
• http://marmotta.apache.org/
• read – write linked data server
• open implementation of W3C’s
Linked Data Platform
Recommendation
https://www.w3.org/TR/ldp/
• repository
• SPARQL 1.1 engine
RDF Data Cube Vocabulary
• statistical data can’t be expressed using just subject predicate and
attribute
• RDF – graph
• statistical data – hypergraph
• RDF Data Cube https://www.w3.org/TR/vocab-data-cube -
recommendation for a vocabulary to describe multi-dimensional data
• compatible with Statistical Data and Metadata eXchange - SDMX
Plan for the future
• Develop an automated way to choose the vocabulary for one dataset
• Focus on statistical data and publish them using RDF Data Cube
vocabulary
• Develop a more accurate method of linking resources
• Create more applications that use the published data
Papers
• LODRo: Using cultural Romanian open data to build new learning
applications
Octavian Rinciog, Vlad Posea, The International Scientific Conference eLearning and
Software for Education, Bucharest, 2016
• Publishing Romanian public health data as Linked Open Data
Octavian Rinciog, Vlad Posea, E-Health and Bioengineering Conference (EHB), Iasi,
2015
• The Semantic Representation of Open Data Regarding the Romanian
Companies
Marian Spoiala, Octavian Rinciog, Vlad Posea, RoEDU Conference, Bucharest, 2016
• GovLOD: Towards a Linked Open Data Portal
Octavian Rinciog, Vlad Posea, Poster in ISWC Conference, Tokyo, 2016
instead of references
http://www.ted.com/talks/tim_berners_lee_on_the_next_web

More Related Content

What's hot (20)

PDF
From the Semantic Web to the Web of Data: ten years of linking up
Davide Palmisano
 
PPTX
Usage of Linked Data: Introduction and Application Scenarios
EUCLID project
 
PPTX
Introduction to the Semantic Web
Tomek Pluskiewicz
 
PDF
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
eswcsummerschool
 
PPSX
An Introduction to Semantic Web Technology
Ankur Biswas
 
PPTX
semantic web-unique presentation
ramesh kumar
 
PDF
Semantic web technology
Stanley Wang
 
PDF
Chapter 1 semantic web
R A Akerkar
 
PPTX
NISO Webinar: Library Linked Data: From Vision to Reality
National Information Standards Organization (NISO)
 
PPTX
Familiarization with Web Tools
Marlon Jamera
 
PPTX
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
National Information Standards Organization (NISO)
 
PDF
Webinar: Semantic web for developers
Semantic Web Company
 
PPTX
Linked Open Data and Digital Curation (Islandora)
Hong (Jenny) Jing
 
PPTX
Hacking with Semantic Web
Tom Praison Praison
 
PPTX
Linked data MLA 2015
Cason Snow
 
PPTX
Linked Data MLA 2015
Cason Snow
 
PPT
Metadata Training for Staff and Librarians for the New Data Environment
Diane Hillmann
 
PDF
The Semantic Web: What IAs Need to Know About Web 3.0
Chiara Fox Ogan
 
PPT
A review of the state of the art in Machine Learning on the Semantic Web
Simon Price
 
PPTX
Jarrar: The Next Generation of the Web 3.0: The Semantic Web
Mustafa Jarrar
 
From the Semantic Web to the Web of Data: ten years of linking up
Davide Palmisano
 
Usage of Linked Data: Introduction and Application Scenarios
EUCLID project
 
Introduction to the Semantic Web
Tomek Pluskiewicz
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
eswcsummerschool
 
An Introduction to Semantic Web Technology
Ankur Biswas
 
semantic web-unique presentation
ramesh kumar
 
Semantic web technology
Stanley Wang
 
Chapter 1 semantic web
R A Akerkar
 
NISO Webinar: Library Linked Data: From Vision to Reality
National Information Standards Organization (NISO)
 
Familiarization with Web Tools
Marlon Jamera
 
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
National Information Standards Organization (NISO)
 
Webinar: Semantic web for developers
Semantic Web Company
 
Linked Open Data and Digital Curation (Islandora)
Hong (Jenny) Jing
 
Hacking with Semantic Web
Tom Praison Praison
 
Linked data MLA 2015
Cason Snow
 
Linked Data MLA 2015
Cason Snow
 
Metadata Training for Staff and Librarians for the New Data Environment
Diane Hillmann
 
The Semantic Web: What IAs Need to Know About Web 3.0
Chiara Fox Ogan
 
A review of the state of the art in Machine Learning on the Semantic Web
Simon Price
 
Jarrar: The Next Generation of the Web 3.0: The Semantic Web
Mustafa Jarrar
 

Viewers also liked (20)

PPTX
Ce mă fac când o să fiu mare - optiuni pentru o cariera in IT
Vlad Posea
 
PPTX
Programarea calculatoarelor c2
Vlad Posea
 
PPTX
IPW HTML course
Vlad Posea
 
PDF
Jena based implementation of a iso 11179 meta data registry
A. Anil Sinaci
 
PPTX
IPW 2eme course - HTML
Vlad Posea
 
PPT
Usability and accessibility on the web
Vlad Posea
 
PPT
C5 Javascript French
Vlad Posea
 
PPT
C5 Javascript
Vlad Posea
 
PPT
IPW 3rd Course - CSS
Vlad Posea
 
PPT
Introduction dans la Programmation Web Course 1
Vlad Posea
 
PPT
C5 Javascript
Vlad Posea
 
PPT
utilisabilite et accessibilite au web
Vlad Posea
 
PPT
HTML 5 - intro - en francais
Vlad Posea
 
PPT
IPW Course 3 CSS
Vlad Posea
 
PPT
Intro to HTML5
Vlad Posea
 
PPT
Introduction to Web Programming - first course
Vlad Posea
 
PPTX
Css+html
Vlad Posea
 
PPT
Présentation html5
Kénium
 
PDF
Cours HTML/CSS
Axel Chalon
 
PDF
Beautiful CSS : Structurer, documenter, maintenir
Yves Van Goethem
 
Ce mă fac când o să fiu mare - optiuni pentru o cariera in IT
Vlad Posea
 
Programarea calculatoarelor c2
Vlad Posea
 
IPW HTML course
Vlad Posea
 
Jena based implementation of a iso 11179 meta data registry
A. Anil Sinaci
 
IPW 2eme course - HTML
Vlad Posea
 
Usability and accessibility on the web
Vlad Posea
 
C5 Javascript French
Vlad Posea
 
C5 Javascript
Vlad Posea
 
IPW 3rd Course - CSS
Vlad Posea
 
Introduction dans la Programmation Web Course 1
Vlad Posea
 
C5 Javascript
Vlad Posea
 
utilisabilite et accessibilite au web
Vlad Posea
 
HTML 5 - intro - en francais
Vlad Posea
 
IPW Course 3 CSS
Vlad Posea
 
Intro to HTML5
Vlad Posea
 
Introduction to Web Programming - first course
Vlad Posea
 
Css+html
Vlad Posea
 
Présentation html5
Kénium
 
Cours HTML/CSS
Axel Chalon
 
Beautiful CSS : Structurer, documenter, maintenir
Yves Van Goethem
 
Ad

Similar to Linked Open Data in Romania (20)

PDF
Linked Data (1st Linked Data Meetup Malmö)
Anja Jentzsch
 
PDF
Linked Data
Anja Jentzsch
 
PPTX
Linked open data project
Faathima Fayaza
 
PDF
Exploration, visualization and querying of linked open data sources
Laura Po
 
PDF
OpenDataCourse-04-HowToMakeOpenData
routetopa
 
PDF
Introduction to linked data
Laura Po
 
PDF
Linked Open Government Data: What’s Next?
Li Ding
 
PDF
Cloud-based Linked Data Management for Self-service Application Development
Peter Haase
 
PDF
Tutorial Data Management and workflows
SSSW
 
PPTX
Is Linked Open Data the way forward?
American Art Collaborative
 
PDF
Finding Data Sets
Anja Jentzsch
 
PPTX
Linked data 20171106
Synaptica, LLC
 
PDF
Linked Data for the Masses: The approach and the Software
IMC Technologies
 
PPTX
Linked Open Data for Cultural Heritage
Noreen Whysel
 
PDF
Linked Data at the OU - the story so far
Enrico Daga
 
PPTX
Research into Practice case study 2: Library linked data implementations an...
Hazel Hall
 
PDF
Methodological Guidelines for Publishing Linked Data
Boris Villazón-Terrazas
 
PPT
Establishing the Connection: Creating a Linked Data Version of the BNB
nw13
 
PDF
Here Comes Everything
Nigel Shadbolt
 
PDF
Linked data and the future of libraries
Regan Harper
 
Linked Data (1st Linked Data Meetup Malmö)
Anja Jentzsch
 
Linked Data
Anja Jentzsch
 
Linked open data project
Faathima Fayaza
 
Exploration, visualization and querying of linked open data sources
Laura Po
 
OpenDataCourse-04-HowToMakeOpenData
routetopa
 
Introduction to linked data
Laura Po
 
Linked Open Government Data: What’s Next?
Li Ding
 
Cloud-based Linked Data Management for Self-service Application Development
Peter Haase
 
Tutorial Data Management and workflows
SSSW
 
Is Linked Open Data the way forward?
American Art Collaborative
 
Finding Data Sets
Anja Jentzsch
 
Linked data 20171106
Synaptica, LLC
 
Linked Data for the Masses: The approach and the Software
IMC Technologies
 
Linked Open Data for Cultural Heritage
Noreen Whysel
 
Linked Data at the OU - the story so far
Enrico Daga
 
Research into Practice case study 2: Library linked data implementations an...
Hazel Hall
 
Methodological Guidelines for Publishing Linked Data
Boris Villazón-Terrazas
 
Establishing the Connection: Creating a Linked Data Version of the BNB
nw13
 
Here Comes Everything
Nigel Shadbolt
 
Linked data and the future of libraries
Regan Harper
 
Ad

More from Vlad Posea (13)

PPTX
Design thinking
Vlad Posea
 
PPTX
Talentul meu – mersul pe bicicletă
Vlad Posea
 
PPTX
Programarea calculatoarelor - Limbajul C
Vlad Posea
 
PDF
Ghidul Bobocului de la Facultatea de Automatica si Calculatoare vers 2011-2012
Vlad Posea
 
PDF
Json tutorial
Vlad Posea
 
PDF
Javascript ajax tutorial
Vlad Posea
 
PPT
Studiu Referitor La Insertia Pe Piata Muncii (1)
Vlad Posea
 
PPT
Aplicații Web Semantice - Descriere Proiect
Vlad Posea
 
PPT
Stagii In Strainatate
Vlad Posea
 
PPT
Student si/sau Angajat
Vlad Posea
 
PDF
Ghidul bobocului de la Facultatea de Automatica si Calculatoare
Vlad Posea
 
PPT
Tips & Tricks Proiect
Vlad Posea
 
PPT
Boboc Advisory Board Intalnire 1
Vlad Posea
 
Design thinking
Vlad Posea
 
Talentul meu – mersul pe bicicletă
Vlad Posea
 
Programarea calculatoarelor - Limbajul C
Vlad Posea
 
Ghidul Bobocului de la Facultatea de Automatica si Calculatoare vers 2011-2012
Vlad Posea
 
Json tutorial
Vlad Posea
 
Javascript ajax tutorial
Vlad Posea
 
Studiu Referitor La Insertia Pe Piata Muncii (1)
Vlad Posea
 
Aplicații Web Semantice - Descriere Proiect
Vlad Posea
 
Stagii In Strainatate
Vlad Posea
 
Student si/sau Angajat
Vlad Posea
 
Ghidul bobocului de la Facultatea de Automatica si Calculatoare
Vlad Posea
 
Tips & Tricks Proiect
Vlad Posea
 
Boboc Advisory Board Intalnire 1
Vlad Posea
 

Recently uploaded (20)

PDF
How the Abhay Bhutada Foundation Is Shaping Maharashtra’s Educational and Cul...
Swapnil Pednekar
 
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
PDF
Inside the Abhay Bhutada Foundation’s Impactful Work in Maharashtra
Roshan Rai
 
PDF
Programme - CAWASA 8th Caribbean Water Operators Conference 2025
CAWASA
 
PPTX
vaginal birth.pptx cord prolapsing conditions
MuhammadTalha323286
 
PDF
PPT Item # 7 - Boards & Commission Appointments
ahcitycouncil
 
PPTX
United nations event of scientific .pptx
birtharetanvi
 
DOCX
NDP NPC Plan 2030 David Lipschitz Comments
David Lipschitz
 
PDF
Dimensions features and indicators of Governance.pdf
sarthakg2080
 
PDF
Item # 8 - Noise Ordinance Proposed Amendments
ahcitycouncil
 
PPTX
原版美国加利福尼亚大学旧金山分校毕业证(UCSF毕业证书)如何办理
Taqyea
 
PDF
Citizen Perception Survey (CPS) 2024 | Bangladesh Bureau of Statistics
Razin Mustafiz
 
PDF
AIHA Heat Stress App Version 2 - Introduction and Overview
AIHA
 
PDF
Shivsrushti’s Special Summer Access Made Possible by Abhay Bhutada Foundation
Lokesh Agrawal
 
PPTX
Presentation - Master the CV and Interview Game.pptx
biormicah3
 
PPTX
加拿大爱德华王子岛大学成绩单范本{UPEI毕业完成信UPEI成绩单防伪}100%复刻
Taqyea
 
PDF
Outreach Proramme on Sensitizing the Diductor for Better TDS / TCS Compliance
COLOURIMPRESSION
 
PDF
Item # 1a - June 5, 2025 SAP Work Session Minutes
ahcitycouncil
 
PDF
PML N Manifesto election February 2024 pdf
hammadyousafzai777
 
PPTX
dawsoncityyukoncommunityrollingjune22_25.pptx
pmenzies
 
How the Abhay Bhutada Foundation Is Shaping Maharashtra’s Educational and Cul...
Swapnil Pednekar
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
Inside the Abhay Bhutada Foundation’s Impactful Work in Maharashtra
Roshan Rai
 
Programme - CAWASA 8th Caribbean Water Operators Conference 2025
CAWASA
 
vaginal birth.pptx cord prolapsing conditions
MuhammadTalha323286
 
PPT Item # 7 - Boards & Commission Appointments
ahcitycouncil
 
United nations event of scientific .pptx
birtharetanvi
 
NDP NPC Plan 2030 David Lipschitz Comments
David Lipschitz
 
Dimensions features and indicators of Governance.pdf
sarthakg2080
 
Item # 8 - Noise Ordinance Proposed Amendments
ahcitycouncil
 
原版美国加利福尼亚大学旧金山分校毕业证(UCSF毕业证书)如何办理
Taqyea
 
Citizen Perception Survey (CPS) 2024 | Bangladesh Bureau of Statistics
Razin Mustafiz
 
AIHA Heat Stress App Version 2 - Introduction and Overview
AIHA
 
Shivsrushti’s Special Summer Access Made Possible by Abhay Bhutada Foundation
Lokesh Agrawal
 
Presentation - Master the CV and Interview Game.pptx
biormicah3
 
加拿大爱德华王子岛大学成绩单范本{UPEI毕业完成信UPEI成绩单防伪}100%复刻
Taqyea
 
Outreach Proramme on Sensitizing the Diductor for Better TDS / TCS Compliance
COLOURIMPRESSION
 
Item # 1a - June 5, 2025 SAP Work Session Minutes
ahcitycouncil
 
PML N Manifesto election February 2024 pdf
hammadyousafzai777
 
dawsoncityyukoncommunityrollingjune22_25.pptx
pmenzies
 

Linked Open Data in Romania

  • 2. about me • bachelor and PhD from Politehnica University of Bucharest • Master in Data Mining from University Lumiere of Lyon • research on competence management, semantic web, e-learning • business on career management and recruiting • now fellow of the Romanian American Foundation at the University of Rochester (Fulbright scholarship starting with 2017) for developing entrepreneurship in Romania
  • 3. Linked Data and Open Data • linked data = a way to connect data on the web using URIs and RDF, the most successful result of the Semantic Web initiative • open data = Open data is data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike. • open government data = data regarding public institutions, published on governmental sites
  • 4. Very Short Intro on RDF • data represented as statements • statements contain • subject • predicate • object • subject, predicate and sometimes objects are URIs • URIs are used to uniquely identify entities or properties
  • 6. Why Do We Need Open Data? • transparency • how does the government spend money • fuel innovation and entrepreneurship • https://www.youtube.com/watch?v=sUqY5ySylXg (Todd Park discussing benefits of Open Government Data) • opening weather data and GPS data allowed people to build businesses • “last year alone civilian and commercial access to GPS created 90 billion $ worth of value” (2013) • participatory governance • citizens enabled in decision making • “making a full read/write society” (http://opengovernmentdata.org/why/)
  • 7. Open Data Quality • five stars of open data proposed by Tim Berners Lee • (1) be available on the Web under an open licence, • (2) be in the form of structured data, • (3) be in a non-proprietary file format, • (4) use URIs as itsidentifiers (see also RDF), • (5) include links to other data sources (see linked data). http://opendatahandbook.org/glossary/en/terms/five-stars-of-open-data/
  • 8. Open Data in the World • Global Open Data Quality measures how governments implement Open Data • evaluates if a country posted data on • national statistics • government budget • government spending • legislation • election results • national map • pollution • also evaluates the quality of the posted data • companies • location datasets • government procurement • water quality • weather forecast • land ownership • transport timetables • health performance
  • 9. Global Open Data Quality relevant progress has been made in terms of opening data scores would be much lower if 5 star data would have a bigger weight http://index.okfn.org/place/ http://index.okfn.org/methodology/
  • 10. Open Data in the US • data.gov – 190k datasets • mostly html (70k) • RDF below 5% of the total number of datasets • more than a quarter are either pdf, jpg, tiff • relevant steps • data.gov launched in 2009 • Open Government Partnership 2011 (http://www.opengovpartnership.org/) • Digital Accountability and Transparency Act (2014) • creating publishing standards for public spending data https://max.gov/maxportal/assets/public/offm/DataStandardsFinal.htm
  • 11. Open Data in Saint Louis • https://www.stlouis-mo.gov/data/ - list of data sets • most of them html or pdf • some confuse open data with reports
  • 12. Open Data in Romania • Data.gov.ro • National portal where public institutions put all the data • Types of resources published: CSV (***), PDF(*), XLS (**) • There is no connection between files (zero files with 4 or 5 *) • September 2016: • 72 public institutions • 8185 files • Each file can have its own structure • uses CKAN (http://ckan.org/)
  • 13. Why do we need Linked Open Data • classic workflow when working with open data: • analyze CSV files • define own data model • import data from CSV files into data model • solve import problems (naming differences, character encoding issues) • identify entities and link them to other entities existing in the model • link data from different CSV files in a common model • extract relevant information • write the program logic to exploit the data
  • 14. Why Do We Need Linked Open Data • classic workflow when working with linked open data: • analyze models • write query to extract relevant information • write the program logic to exploit the data • can use directly more than one dataset by performing “joins” in the queries • much faster to develop an application • much easier to reuse data
  • 15. Linked Open Data in Romania • Our goal is to transform open data from Romania into Linked Open Data. • Transform data into RDF triples (Subject, Predicate, Object) • Link entities with existing online resources, especially from dbpedia.org and Geonames • Create a platform where each published file is transformed into RDF • Create rich applications using SPARQL queries
  • 16. Vision • create tools and workflows to allow non-technical users to add Linked Data to the government website • offer an API for developers who want to create apps based on open government data • integrate the software into CKAN (the open data portal used by most governments) to allow every government to create linked data
  • 17. Stages 1. modeling data 2. massively transforming data 3. linking data to external data sets 4. embed into CKAN
  • 18. First Stage – Modeling Data • Identify the most common ontologies used • Create naming rules for creating the same URIs that identify the same resources • Identify the most common properties of the open data and the ontological properties associated to them • Identify the most common naming problems • different encodings • different spelling • different lexicalization of the same concepts and write hacks to solve them
  • 19. Open Data Types • Numerical data: • Different budgets or revenues • Different statistical data: number of cars/type/year, number of bed/hospital • Etc. • “Plain” data: • Information about entities: • Lawyers, Schools, Pharmacies, Museums, Archeological sites • Etc. • Found in tabular files, such as CSV or XLS
  • 20. Most Common Vocabularies for Open Data https://lov.okfn.org/dataset/lov/
  • 21. Most Common Vocabularies for Open Data • Dublin Core (DCTerms, DCE) – describes metadata terms (http://dublincore.org/schemas/) • SKOS – Simple Knowledge Organisation Systems – representing taxonomies • FOAF – Friend of a Friend – representing people and the relations between them • CC – Creative Commons – copyright information • GEO – Geonames – data about locations • VANN – data about vocabularies • DBPedia –
  • 22. Ontologies used • We used especially OWL classes defined by dbpedia such as: • http://dbpedia.org/ontology/Location • http://dbpedia.org/ontology/Place • http://dbpedia.org/ontology/PopulatedPlace • http://dbpedia.org/ontology/Museum • http://dbpedia.org/ontology/Hospital • http://dbpedia.org/ontology/EducationalInstitution • http://dbpedia.org/class/yago/CitiesInRomania • Other used OWL classes: • http://umbel.org/umbel/rc/Village • https://schema.org/PostalAddress
  • 23. Naming rules for creating URIs • Each URI has as prefix: http://opendata.cs.pub.ro/resource • Our goal is to make URIs for each resource as easy to understand as possible for humans • Our statement is: Once you read the URI, you know what it is about • For example: • Locality: http://opendata.cs.pub.ro/resource/<localityName>_judet_<localityCounty> • Hospital: http://opendata.cs.pub.ro/resource/<hospitalName>_hospital_<localityCounty>
  • 24. Most common properties • Most used properties were taken from well-known vocabularies such as: • VCARD: vcard:region, vcard:locality • FOAF: foaf:mbox, foaf:fax • GEO: geo:lat, geo:long • Other properties were taken from those defined by dbpedia.org: • http://dbpedia.org/property/postcode • http://dbpedia.org/property/phonenumber • We also defined properties defined in our own namespace: • http://opendata.cs.pub.ro/property
  • 25. Most common naming problems and how we solve them • Resource’s name was written with diacritics • Replace diacritics with normal letter • Resource’s name was written with non-alphanumeric characters, such as: space, hyphen, comma • Replace them with underscore • After initial choosing the naming convention, we saw that there can be some conflicts • For example: we chose initial the URI for museums only the name of the museum, but there can be a museum with same name in multiple towns, so we added for the URI also the museum’s town
  • 28. Stage 2: massively transforming data • experiment with 20 students from a master’s class • groups of 2 asked to choose 2 datasets and transform them • + greatly increased the number of data transformed • - big amount of work to correct the errors introduced by students • have to involve a larger number of volunteers • students will be asked to offer expert support • 2016 result: 10 new datasets, more than 500000 triples added
  • 29. Biggest problem PDF Files • Unfortunately, there are tabular data hidden in scanned PDF files • We created an algorithm to extract only the tables from these scanned files • This way, we transform the unmanageable scanned files into tabular ones • We want to improve existing tools using contextual information regarding the type of document
  • 30. Stage 3 Linking to external data sets • most important datasets: • dbpedia • people • events • places • geonames • all the places
  • 31. How to link? • query and disambiguate • sometimes really difficult • disambiguation • by type • by context • not always possible automatically
  • 32. Relevant tool: SILK • http://silkframework.org/ • Generating links between related data items within different Linked Data sources. • Linked Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. • Applying data transformations to structured data sources.
  • 33. or write some code geoloc=loc.decode("utf8")[:-1] query=strip_accents("http://api.geonames.org/search?q=%s&maxRows=1&type=rdf&username=vladposea"%geoloc) g=rdflib.Graph() g.parse(query.encode("ascii","ignore"),format="xml", encoding="utf-8") for s,p,o in g.triples( (None, rdflib.RDF.type, gn.Feature) ): fullGraph.add((locURI,OWL.sameAs, s)) <rdf:RDF > <gn:Feature rdf:about="http://sws.geonames.org/686254/"> ...
  • 34. Stage 4 – Embed into CKAN • CKAN - http://ckan.org/ • tool for publishing data • aimed at governments and other public organizations • specially designed for open data • used internationally • not built for linked data • we envision developing a plugin to semi-automatically construct linked data from the open data published
  • 35. What do we have so far? • Our focus was on “plain” data: • Cities dataset published in RDF and linked with geonames.org • Each created resource has a <owl:sameAs> property that links to geonames.org • Schools dataset published in RDF • Pharmacies dataset published in RDF • Museums dataset published in RDF and linked with dbpedia.org • Churches dataset publishe in RDF and linked with dbpedia.org • 207382 URIs with overall 2683968 RDF triples
  • 36. How we transformed the data? • For each dataset: • We identified what vocabulary should be used for each property • We identified what additional properties should be created for each resource • Each physical entity has an address and using Google Geocode service we obtained the geographical coordinates for that address • We created one unique URI for each resource • We generated the URIs by putting a lot of information inside them for example the URI for one school is :http://opendata.cs.pub.ro/resource/<school_name>_<city> • We opted for this encoding schema to create more verbose URIs, not just hashes • We linked each possible resource using online semantic repositories, such as dbpedia.org and geonames.org • The linking is done by searching entities with the same type and name
  • 37. How can someone access the resources? • We have published all RDF triples in a semantic repository: • http://opendata.cs.pub.ro/repo • It supports SPARQL queries • http://opendata.cs.pub.ro/repo/sparql • We document all published datasets in : • Blog: http://opendata.cs.pub.ro/blog • Wiki: http://opendata.cs.pub.ro/wiki
  • 38. SPARQL queries Towns where there are no schools SELECT ?loc WHERE { ?loc rdf:type <http://dbpedia.org/ontology/Settlement> . FILTER NOT EXISTS { ?x <http://opendata.cs.pub.ro/property/institutie_in_lo calitate> ?loc . } } Find the museums linked with dbpedia.org SELECT ?MusRO ?MusDB WHERE { ?MusRO rdf:type <http://dbpedia.org/ontology/Museum>. ?MusRO owl:sameAs ?MusDB. } ORDER BY ?MusRO
  • 39. Example application • All physical entities have an address and we obtained the geographical coordinates of this address. • We put on a map all these entities and someone can see the nearest museums, hospitals or pharmacies from its location • The app is online: • http://opendata.cs.pub.ro:3000
  • 41. Technologies used in this project • Storage layer: • Apache Marmotta HEAD version • Processing layer: • JAVA using Apache POI for reading tabular data and Apache Jena for converting data to RDF • C with OpenCV and Tesseract for extracting tabular data from scanned PDF files • Visualization layer: • Backend: node.js using sparql-client module for SPARQL queries • Fronted: angular.js
  • 42. Alternative Technologies • Open Refine • http://openrefine.org/ • formerly Google Refine • allows to • explore data in various formats • clean and transform data (clustering, easy or scripted transformations) • reconcile and match data • supports external web services
  • 43. Karma • semantic mapping tool http://labs.europeana.eu/apps/karma • imports data in various formats • transforms it to semantic data • links it to DBPedia or GeoNames • no features for statistical data integration • no features for parsing pdf files
  • 44. Named Entity Recognition • Named Entity Recognition – identify entities in texts, apply tags, link to permanent entities • Open Calais – up to 5k free requests/day • http://www.opencalais.com/ • Alchemy – made by IBM • http://www.alchemyapi.com/ • 1k/day free
  • 45. Apache Marmotta • http://marmotta.apache.org/ • read – write linked data server • open implementation of W3C’s Linked Data Platform Recommendation https://www.w3.org/TR/ldp/ • repository • SPARQL 1.1 engine
  • 46. RDF Data Cube Vocabulary • statistical data can’t be expressed using just subject predicate and attribute • RDF – graph • statistical data – hypergraph • RDF Data Cube https://www.w3.org/TR/vocab-data-cube - recommendation for a vocabulary to describe multi-dimensional data • compatible with Statistical Data and Metadata eXchange - SDMX
  • 47. Plan for the future • Develop an automated way to choose the vocabulary for one dataset • Focus on statistical data and publish them using RDF Data Cube vocabulary • Develop a more accurate method of linking resources • Create more applications that use the published data
  • 48. Papers • LODRo: Using cultural Romanian open data to build new learning applications Octavian Rinciog, Vlad Posea, The International Scientific Conference eLearning and Software for Education, Bucharest, 2016 • Publishing Romanian public health data as Linked Open Data Octavian Rinciog, Vlad Posea, E-Health and Bioengineering Conference (EHB), Iasi, 2015 • The Semantic Representation of Open Data Regarding the Romanian Companies Marian Spoiala, Octavian Rinciog, Vlad Posea, RoEDU Conference, Bucharest, 2016 • GovLOD: Towards a Linked Open Data Portal Octavian Rinciog, Vlad Posea, Poster in ISWC Conference, Tokyo, 2016