Batch import of large RDF datasets into Semantic MediaWiki

Batch import of large RDF datasets
using RDFIO or the new
rdf2smw tool
Samuel Lampa - @smllmp
PhD Student
in Pharmaceutical Bioinformatics @ pharmb.io
with Assoc. Prof. Ola Spjuth - @ola_spjuth
@ Dept. of Pharm. Biosci. / Uppsala University
Semantic MediaWiki Conference Fall 2016, Frankfurt am Main,

Research interests
● Large datasets
● Automation
● Scientific workflows
● Machine Learning
● Semantic data
● Reasoning
● Query systems
● Something user friendly
● … and hopefully usable
● “Answer ALL the research questionz”

What’s the problem?
● Semantic MediaWiki has great support for
exporting to RDF

What’s the problem?
● … but, not really any (proper) RDF import
(as in: plain triples → wiki syntax in articles)

RDFIO What?!
● SMW extension
● Import plain RDF triples
● No need for an ontology
● RDF URIs → Wiki titles
● Retains Original URIs
● Translates back to
Original URIs on export
● Round-trip SMW ↔ RDF
● tinyurl.com/getrdfio

Turning RDF Triples into Wiki Pages
<http://ex.org/Stockholm> <http://ex.org/onto/LocatedIn> <http://ex.org/Sweden>
<http://ex.org/Stockholm> <http://ex.org/onto/Population> "789024"^^xsd:integer
<http://ex.org/Frankfurt> <http://ex.org/onto/LocatedIn> <http://ex.org/Germany>
<http://ex.org/Frankfurt> <http://ex.org/onto/Population> "731095"^^xsd:integer

Stockholm
[[Located In::Sweden]]
[[Population::789024]]
[[Original URI::http://ex.org/Stockholm]]
Frankfurt
[[Located In::Germany]]
[[Original URI::http://ex.org/Frankfurt]]

Sweden
[[Original URI::http://ex.org/Sweden]]
Germany
[[Original URI::http://ex.org/Germany]]
Stockholm
Frankfurt

Property:LocatedIn
[[Has type::Page]]
[[Original URI::http://ex.org/LocatedIn]]
Property:Population
[[Has type::Number]]
[[Original URI::http://ex.org/Population]]
Sweden
[[Original URI::http://ex.org/Sweden]]
Germany
[[Original URI::http://ex.org/Germany]]
Stockholm
Frankfurt

RDFIO – Current Status
● SMW 2.3 support – with some hacks
(Ali working on the last minor issues)
● See the Vagrant box for a working automated
setup with MW 1.26.4 + SMW 2.3.1:
– github.com/rdfio/rdfio-vagrantbox
● Some known minor issues

New Feature: Commandline Import

Problem:
● Importing 300K triples can take like 24h
.
.
.
.
.
.
.
.
● What if you realize a mis-configuration
only after 24h?

The new rdf2smw tool
● Convert RDF → MediaWiki XML (Really fast!)
● Import via MediaWiki XML import (Still slow...)
● But: Can now preview before the XML import!

More rdf2smw facts:
● Written in Go for compiled, multi-core performance
● Very pluggable architecture
● Easy to install: Just download and run!
● Get it: github.com/samuell/rdf2smw

rdf2smw performance
50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 550000
0
100
200
300
400
500
600
Number of triples
Executiontime(s)

Future outlook
● How to make RDFIO more maintainable, for developers
with too little time?
● Drastically simplify?
● Break out well defined sub-modules?
(SPARQL endpoint, RDF Import, etc)
● Integrate with MW REST API Instead of dedicated Special-
page – as per Denny’s original idea with SMWWriter?
● Re-use core SMW functionality more? (Or not?)
● Your ideas?

RDFIO Vagrant box
github.com/rdfio/rdfio-vagrantbox
$ vagrant up
20 min

The new Vagrant box:
Set up MW + SMW + RDFIO in 7 steps
1) Install dependencies
2) $ git clone https://github.com/rdfio/rdfio-vagrantbox.git
3) $ cd rdfio-vagrantbox
4) $ vagrant up
5) Surf in on localhost:8080/w/index.php/Special:RDFIOAdmin
6) Log in with Admin and changethis
7) Click “Setup”
Done!

Acknowledgements
● Denny Vrandečić (@vrandezo) - Basically had the same idea for an extension already
when the (eventually accepted) GSOC proposal was submitted in 2010, and supported
the project with valuable ideas and though mentoring the GSOC 2010 project.
● Ali King (@ali_king) – Has done great work at updating the extension to the latest
standards and versions, and added the new template editing functionality, as part of
aOPW 2014 project.
● Joel Sachs (@xjsachs) - Championed the addition of the template editing functionality,
provided valuable encouragement and mentored Ali King’s FOSS OPW project.
● Egon Willighagen (@egonwillighagen) - Has supported the project with valuable
testing, constructive feedback, encouragement and new ideas.
● Ola Spjuth (@ola_spjuth) – Has provided constructive feedback and encouragement,
as well as financed parts of the further development of the project.
● Google Inc. - Supported the initial development through it’s
summer of code program (GSOC) in 2010.
● Gnome Foundation - Supporting further development as part of its
outreach program for women (OPW) in 2014.

Batch import of large RDF datasets into Semantic MediaWiki

Recommended

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to Batch import of large RDF datasets into Semantic MediaWiki (20)

More from Samuel Lampa (10)

Recently uploaded (20)

Batch import of large RDF datasets into Semantic MediaWiki