Hooking up Semantic MediaWiki with external tools via SPARQLSamuel Lampa
This document discusses integrating Semantic MediaWiki (SMW) with external tools using the RDFIO extension. It describes the motivation for RDFIO as allowing manual schema exploration, automated data generation, and community collaboration. RDFIO solves problems with SMW by allowing the choice of wiki page titles for RDF entities and exporting RDF in the original import format. Real-world uses of RDFIO include visualizing data on SMW pages and pulling data from R into SMW using SPARQL queries. The integration of SMW and Bioclipse is also discussed.
SciPipe - A light-weight workflow library inspired by flow-based programmingSamuel Lampa
A presentation of the SciPipe workflow library, written in Go (Golang), inspired by Flow-based programming, at an internal workshop at Uppsala University, Department of Pharmaceutical Biosciences.
Build your own discovery index of scholary e-resourcesMartin Czygan
Providing discovery systems for eresourcesis essential for library services today.
Commercial search engine indices have been a widely used solution in recent years. In
contrast, running an own discovery service is undoubtedly a challenging task but promises
full control over data processing, enrichment, performance and quality. Building an own
aggregated index of eresourcesincludes gathering the right mix of data sources, clearing
licensing issues, and negotiating data availability. Technically, these threads are resumed by
data harvesters, filters and workflow orchestration tools.
The document discusses log aggregation and analysis using the Elastic Stack. It describes how the Elastic Stack collects logs from various sources using lightweight data shippers called Beats. The logs are then processed and structured by Logstash before being stored in Elasticsearch for exploration and visualization using Kibana. Demos are provided showing how the Elastic Stack can parse nginx logs, capture logs from a Django application, and monitor node metrics.
This document discusses logs aggregation and analysis using the ELK stack, which consists of Elasticsearch, Logstash, and Kibana. It describes problems with traditional logging like inconsistent formats and high server loads. It then explains how each tool in the ELK stack addresses these issues. Elasticsearch provides centralized storage and search. Logstash collects, parses, and filters logs from multiple sources. Kibana enables visualization and dashboarding for log analysis. Additional tools like Marvel and plugins are also discussed. Overall, the ELK stack provides a scalable logging solution with consistent structure, centralized management, and interactive analytics dashboards.
RDF.rb is a Ruby library for working with RDF and SPARQL. It allows querying RDF stores using SPARQL and supports various RDF serialization formats and storage backends. SPARQL queries can be run against SPARQL endpoints using the sparql-client gem. RDF.rb is pure Ruby, open source under the UNLICENSE, and supports CRuby and JRuby.
The document discusses using Sphinx to index and search data. Sphinx allows indexing data from MySQL, PostgreSQL, and XMLPipe2 formats. It supports searching via its native protocol or MySQL protocol. The document provides examples of indexing over 2 million rows from a MariaDB database, filtering search results using hashes, transforming data to the XMLPipe2 format, and dynamically configuring indexes when data partitions change over time using the Sphinx::Config::Builder module. Real-world use cases discussed include providing a search interface for auction data and indexing multiple data partitions that can change.
The document discusses setting up a centralized log collection system to collect, parse, index, and analyze log events from multiple sources using tools like Splunk or Logstash. It provides details on using Logstash to ship logs from agents to an indexer, which then parses and indexes the logs before storing them in Elasticsearch for searching. The log collection system allows for real-time log analysis, visualization of metrics, and alerting on key events.
The document discusses solr-fusion, an open source project developed by a company for the SE-project finc. Solr-fusion allows querying across multiple Solr servers and schemas, translating queries, collecting results, recalculating scores, and merging results into a single response. It addresses the need for unified search across heterogeneous datasets and schemas. The author invites others to join in the development work.
The document discusses Reactive Slick, a new version of the Slick database access library for Scala that provides reactive capabilities. It allows parallel database execution and streaming of large query results using Reactive Streams. Reactive Slick is suitable for composite database tasks, combining async tasks, and processing large datasets through reactive streams.
This document introduces the (B)ELK stack, which consists of Beats, Elasticsearch, Logstash, and Kibana. It describes each component and how they work together. Beats are lightweight data shippers that collect data from logs and systems. Logstash processes and transforms data from inputs like Beats. Elasticsearch stores and indexes the data. Kibana provides visualization and analytics capabilities. The document provides examples of using each tool and tips for working with the ELK stack.
Pharo 4 was released in Spring 2015, containing many improvements and updates since Pharo 3 in April 2014. These include improved refactorings, a dark theme, GT tools replacing old tools, first class variables, advanced reflection capabilities, Epicea replacing .changes, and a new GC called Spur. Future plans include a 64-bit COG VM, an optimizer called Sista, updated windowing with SDL2, a Block redo in Morphic, 3D with Woden, and a virtual GPU. Pharo remains very active with ongoing development and a welcoming community.
Garage RDBMS
First name: Esteban
Last name: Lorenzano
Type: talk
Video: https://www.youtube.com/watch?v=_kuyAUt5AMw
Abstract: Access to RDBMS is key to make successful business and Pharo has improved support for them in the last years, but there is still a lot of work to do. DBXTalk is the umbrella project in which we are grouping all our relational persistence strategy: It contains low level database drivers and high level object mappers.
This talk proposes a review of the state of art on relational persistence support.
Bio: Esteban Lorenzano, 43 years old. He studied -and let unfinished- Computer Sciences at Universidad de Buenos Aires, and worked since 1994 in several object oriented technologies (Delphi, C++, Java), where he scaled from “Junior Programmer” to “Senior Architect”. On 2007 he and two friends began a new start-up, Smallworks, an enterprise for agile developments, centered on Smalltalk. Currently, he is working in the RMoD INRIA team in Lille, France, as core developer for Pharo.
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines.
This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization, which was built and executed on Saturn Cloud. Saturn Cloud is an end-to-end data science and machine learning platform that provides an easy interface for Python environments and Dask clusters, removing many barriers to accessing parallel computing. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. We will provide an introduction to Dask and Saturn Cloud, then walk through the NLP code.
This document summarizes the speaker's experience with StackStorm, including presentations given on the topic from 2011 to 2018. It also outlines some technical aspects of StackStorm such as using journald for logging, addressing timeouts in actions, running components, using CronTimer for scheduling, handling Unicode errors, the React-based web UI, optimizing remote executions, and future plans for StackStorm including moving to Python 3 and integrating Orquesta/Mistral workflow engines.
Presto is an open source distributed SQL query engine originally developed by Facebook. It allows querying of data across multiple data sources including HDFS, S3, MySQL, PostgreSQL and more. Presto has seen significant growth and adoption since its initial release, with over 100 releases and contributions from over 100 developers. It is used in production by Facebook and Netflix on very large datasets and clusters. Teradata has joined the Presto community and aims to enhance enterprise features and provide commercial support through its certified Presto distribution.
This document discusses using Fluentd and AWS together. It provides an overview of how Treasure Data uses Fluentd to collect log data from applications on AWS and forwards it to various AWS services like S3, DynamoDB, and Redshift for storage and analysis. It also describes how Fluentd can be used to collect logs from EC2 instances to monitor them and address issues. The document highlights Fluentd's pluggable architecture and some of its core plugins for buffering, routing, and input/output of log data.
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?ArangoDB Database
View the video of this webinar here: https://www.arangodb.com/arangodb-events/gvisor-kata-containers-firecracker-docker/
Containers* have revolutionized the IT landscape and for a long time. Docker seemed to be the default whenever people were talking about containerization technologies**. But traditional container technologies might not be suitable if strong isolation guarantees are required. So recently new technologies such as gVisor, Kata Container, or firecracker have been introduced to close the gap between the strong isolation of virtual machines and the small resource footprint of containers.
In this talk, we will provide an overview of the different containerization technologies, discuss their tradeoffs, and provide guidance for different use cases.
* We will define the term container in more detailed during the talk
** and yes we will also cover some of the pre-docker container space!
«Scrapy internals» Александр Сибиряков, Scrapinghubit-people
- Scrapy is a framework for web scraping that allows for extraction of structured data from HTML/XML through selectors like CSS and XPath. It provides features like an interactive shell, feed exports, encoding support, and more.
- Scrapy is built on top of the Twisted asynchronous networking framework, which provides an event loop and deferreds. It handles protocols and transports like TCP, HTTP, and more across platforms.
- Scrapy architecture includes components like the downloader, scraper, and item pipelines that communicate internally. Flow control is needed between these to limit memory usage and scheduling through techniques like concurrent item limits, memory limits, and delays between calls.
kikstart journey of Golang with Hello world - Gopherlabs sangam biradar
This document summarizes key concepts in Go programming including packages, functions, parameters vs arguments, and more. It discusses how every Go file begins with a package name, and the "main" package is the entry point for a program. Functions need to be capitalized to be accessible outside a package. It also provides review questions and references for further reading on Go.
This document introduces Minoru Osuka and provides information about ManifoldCF and Solr. It discusses that Minoru is a committer and PMC member of ManifoldCF at Apache Software Foundation and a senior consultant. It then provides an overview of what ManifoldCF is, its project status, architecture, use cases, resources, books, and demonstration. It concludes by announcing that Minoru's company is now hiring.
ESUG 2014, Cambridge
Wed, August 20, 11:00am – 11:45am
Video:
Part1: https://www.youtube.com/watch?v=_Mv7SX-8Vlk
Part2: https://www.youtube.com/watch?v=qdZq2IZBm4k
Description
Abstract: In this talk we will present the advances and new features in Pharo 3.0. We will present the current work on Pharo 4.0 and beyond.
Decision making - for loop , nested loop ,if-else statements , switch in goph...sangam biradar
This document discusses decision making in Golang. It provides an overview of loops including for, while, break, continue, and nested loops. It also covers conditionals such as if, else if, else, switch statements, and logical operators. Code examples are provided for each concept via links to an online Golang playground. The author is identified as Sangam Biradar, a Docker community leader who writes tutorials on Golang.
PharoDAYS 2015: Pharo Status - by Markus DenkerPharo
Pharo 4 will be released in Spring 2015 and contains many improvements and updates over previous versions. It has already seen over 1200 issues closed and is very stable. Small changes include improved refactorings, a smaller 6MB deployment image, and ifTrue: working on non-Booleans. Larger ongoing projects include first class variables, replacing the .changes system with Epicea, advanced reflection work, and VM improvements.
This document introduces RethinkDB and Horizon.js for building real-time web applications. It discusses that RethinkDB is an open-source NoSQL database written in C++ that stores JSON and uses its own query language called ReQL. Horizon.js is a JavaScript framework built on RethinkDB and Node.js that allows applications to subscribe to state changes using RxJS. The document provides code samples of using the Horizon class and Collection class to perform operations like storing, watching and querying data in a reactive way.
- Craigslist is a classified advertising website with over 500 cities worldwide and handles over 20 billion pageviews and 50 million users per month. It allows users to post free classified ads for jobs, housing, items for sale, and other services.
- The technical challenges for Craigslist include high ad churn rate, growth in traffic volume, need for data archiving and search capabilities, and maintaining the system with a small team.
- Craigslist uses open source technologies like MySQL, memcached, Apache, and Sphinx to power its infrastructure while keeping it simple, efficient and low cost. It employs techniques like vertical and horizontal data partitioning and incremental indexing to handle its scale.
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn
In the space of building products with data, either by dealing with huge amounts of data or by applying machine learning, many different ecosystems meet. Larger volumes of data have to be passed between these systems. The handling of the data is not only down to divide between systems written in Java that need to pass it on to the machine learning model in Python. When you take into account that you want to integrate with the existing business infrastructure, you also need to cater for legacy systems as well do you need to bring the large volumes of data to the user via UIs.
Vagrant, Ansible and Docker - How they fit together for productive flexible d...Samuel Lampa
A very quick overview of how Vagrant, Ansible and Docker fits nicely together as a very productive and flexible solution for creating automated development environments.
3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in BioclipseSamuel Lampa
This document summarizes Samuel Lampa's 2010 degree project on integrating SWI-Prolog for semantic reasoning in Bioclipse. It compares SWI-Prolog to other semantic tools like Jena and Pellet in terms of speed and expressiveness when querying biochemical data. Prolog code is presented for querying NMR spectrum data that finds molecules with peak values near a search value. SPARQL queries for the same use case are also shown. Observations indicate Prolog is fastest while SPARQL is easier to understand but Prolog allows easier parameter changes and logic reuse. A final presentation was planned for April 28, 2010.
The document discusses setting up a centralized log collection system to collect, parse, index, and analyze log events from multiple sources using tools like Splunk or Logstash. It provides details on using Logstash to ship logs from agents to an indexer, which then parses and indexes the logs before storing them in Elasticsearch for searching. The log collection system allows for real-time log analysis, visualization of metrics, and alerting on key events.
The document discusses solr-fusion, an open source project developed by a company for the SE-project finc. Solr-fusion allows querying across multiple Solr servers and schemas, translating queries, collecting results, recalculating scores, and merging results into a single response. It addresses the need for unified search across heterogeneous datasets and schemas. The author invites others to join in the development work.
The document discusses Reactive Slick, a new version of the Slick database access library for Scala that provides reactive capabilities. It allows parallel database execution and streaming of large query results using Reactive Streams. Reactive Slick is suitable for composite database tasks, combining async tasks, and processing large datasets through reactive streams.
This document introduces the (B)ELK stack, which consists of Beats, Elasticsearch, Logstash, and Kibana. It describes each component and how they work together. Beats are lightweight data shippers that collect data from logs and systems. Logstash processes and transforms data from inputs like Beats. Elasticsearch stores and indexes the data. Kibana provides visualization and analytics capabilities. The document provides examples of using each tool and tips for working with the ELK stack.
Pharo 4 was released in Spring 2015, containing many improvements and updates since Pharo 3 in April 2014. These include improved refactorings, a dark theme, GT tools replacing old tools, first class variables, advanced reflection capabilities, Epicea replacing .changes, and a new GC called Spur. Future plans include a 64-bit COG VM, an optimizer called Sista, updated windowing with SDL2, a Block redo in Morphic, 3D with Woden, and a virtual GPU. Pharo remains very active with ongoing development and a welcoming community.
Garage RDBMS
First name: Esteban
Last name: Lorenzano
Type: talk
Video: https://www.youtube.com/watch?v=_kuyAUt5AMw
Abstract: Access to RDBMS is key to make successful business and Pharo has improved support for them in the last years, but there is still a lot of work to do. DBXTalk is the umbrella project in which we are grouping all our relational persistence strategy: It contains low level database drivers and high level object mappers.
This talk proposes a review of the state of art on relational persistence support.
Bio: Esteban Lorenzano, 43 years old. He studied -and let unfinished- Computer Sciences at Universidad de Buenos Aires, and worked since 1994 in several object oriented technologies (Delphi, C++, Java), where he scaled from “Junior Programmer” to “Senior Architect”. On 2007 he and two friends began a new start-up, Smallworks, an enterprise for agile developments, centered on Smalltalk. Currently, he is working in the RMoD INRIA team in Lille, France, as core developer for Pharo.
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines.
This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization, which was built and executed on Saturn Cloud. Saturn Cloud is an end-to-end data science and machine learning platform that provides an easy interface for Python environments and Dask clusters, removing many barriers to accessing parallel computing. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. We will provide an introduction to Dask and Saturn Cloud, then walk through the NLP code.
This document summarizes the speaker's experience with StackStorm, including presentations given on the topic from 2011 to 2018. It also outlines some technical aspects of StackStorm such as using journald for logging, addressing timeouts in actions, running components, using CronTimer for scheduling, handling Unicode errors, the React-based web UI, optimizing remote executions, and future plans for StackStorm including moving to Python 3 and integrating Orquesta/Mistral workflow engines.
Presto is an open source distributed SQL query engine originally developed by Facebook. It allows querying of data across multiple data sources including HDFS, S3, MySQL, PostgreSQL and more. Presto has seen significant growth and adoption since its initial release, with over 100 releases and contributions from over 100 developers. It is used in production by Facebook and Netflix on very large datasets and clusters. Teradata has joined the Presto community and aims to enhance enterprise features and provide commercial support through its certified Presto distribution.
This document discusses using Fluentd and AWS together. It provides an overview of how Treasure Data uses Fluentd to collect log data from applications on AWS and forwards it to various AWS services like S3, DynamoDB, and Redshift for storage and analysis. It also describes how Fluentd can be used to collect logs from EC2 instances to monitor them and address issues. The document highlights Fluentd's pluggable architecture and some of its core plugins for buffering, routing, and input/output of log data.
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?ArangoDB Database
View the video of this webinar here: https://www.arangodb.com/arangodb-events/gvisor-kata-containers-firecracker-docker/
Containers* have revolutionized the IT landscape and for a long time. Docker seemed to be the default whenever people were talking about containerization technologies**. But traditional container technologies might not be suitable if strong isolation guarantees are required. So recently new technologies such as gVisor, Kata Container, or firecracker have been introduced to close the gap between the strong isolation of virtual machines and the small resource footprint of containers.
In this talk, we will provide an overview of the different containerization technologies, discuss their tradeoffs, and provide guidance for different use cases.
* We will define the term container in more detailed during the talk
** and yes we will also cover some of the pre-docker container space!
«Scrapy internals» Александр Сибиряков, Scrapinghubit-people
- Scrapy is a framework for web scraping that allows for extraction of structured data from HTML/XML through selectors like CSS and XPath. It provides features like an interactive shell, feed exports, encoding support, and more.
- Scrapy is built on top of the Twisted asynchronous networking framework, which provides an event loop and deferreds. It handles protocols and transports like TCP, HTTP, and more across platforms.
- Scrapy architecture includes components like the downloader, scraper, and item pipelines that communicate internally. Flow control is needed between these to limit memory usage and scheduling through techniques like concurrent item limits, memory limits, and delays between calls.
kikstart journey of Golang with Hello world - Gopherlabs sangam biradar
This document summarizes key concepts in Go programming including packages, functions, parameters vs arguments, and more. It discusses how every Go file begins with a package name, and the "main" package is the entry point for a program. Functions need to be capitalized to be accessible outside a package. It also provides review questions and references for further reading on Go.
This document introduces Minoru Osuka and provides information about ManifoldCF and Solr. It discusses that Minoru is a committer and PMC member of ManifoldCF at Apache Software Foundation and a senior consultant. It then provides an overview of what ManifoldCF is, its project status, architecture, use cases, resources, books, and demonstration. It concludes by announcing that Minoru's company is now hiring.
ESUG 2014, Cambridge
Wed, August 20, 11:00am – 11:45am
Video:
Part1: https://www.youtube.com/watch?v=_Mv7SX-8Vlk
Part2: https://www.youtube.com/watch?v=qdZq2IZBm4k
Description
Abstract: In this talk we will present the advances and new features in Pharo 3.0. We will present the current work on Pharo 4.0 and beyond.
Decision making - for loop , nested loop ,if-else statements , switch in goph...sangam biradar
This document discusses decision making in Golang. It provides an overview of loops including for, while, break, continue, and nested loops. It also covers conditionals such as if, else if, else, switch statements, and logical operators. Code examples are provided for each concept via links to an online Golang playground. The author is identified as Sangam Biradar, a Docker community leader who writes tutorials on Golang.
PharoDAYS 2015: Pharo Status - by Markus DenkerPharo
Pharo 4 will be released in Spring 2015 and contains many improvements and updates over previous versions. It has already seen over 1200 issues closed and is very stable. Small changes include improved refactorings, a smaller 6MB deployment image, and ifTrue: working on non-Booleans. Larger ongoing projects include first class variables, replacing the .changes system with Epicea, advanced reflection work, and VM improvements.
This document introduces RethinkDB and Horizon.js for building real-time web applications. It discusses that RethinkDB is an open-source NoSQL database written in C++ that stores JSON and uses its own query language called ReQL. Horizon.js is a JavaScript framework built on RethinkDB and Node.js that allows applications to subscribe to state changes using RxJS. The document provides code samples of using the Horizon class and Collection class to perform operations like storing, watching and querying data in a reactive way.
- Craigslist is a classified advertising website with over 500 cities worldwide and handles over 20 billion pageviews and 50 million users per month. It allows users to post free classified ads for jobs, housing, items for sale, and other services.
- The technical challenges for Craigslist include high ad churn rate, growth in traffic volume, need for data archiving and search capabilities, and maintaining the system with a small team.
- Craigslist uses open source technologies like MySQL, memcached, Apache, and Sphinx to power its infrastructure while keeping it simple, efficient and low cost. It employs techniques like vertical and horizontal data partitioning and incremental indexing to handle its scale.
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn
In the space of building products with data, either by dealing with huge amounts of data or by applying machine learning, many different ecosystems meet. Larger volumes of data have to be passed between these systems. The handling of the data is not only down to divide between systems written in Java that need to pass it on to the machine learning model in Python. When you take into account that you want to integrate with the existing business infrastructure, you also need to cater for legacy systems as well do you need to bring the large volumes of data to the user via UIs.
Vagrant, Ansible and Docker - How they fit together for productive flexible d...Samuel Lampa
A very quick overview of how Vagrant, Ansible and Docker fits nicely together as a very productive and flexible solution for creating automated development environments.
3rd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in BioclipseSamuel Lampa
This document summarizes Samuel Lampa's 2010 degree project on integrating SWI-Prolog for semantic reasoning in Bioclipse. It compares SWI-Prolog to other semantic tools like Jena and Pellet in terms of speed and expressiveness when querying biochemical data. Prolog code is presented for querying NMR spectrum data that finds molecules with peak values near a search value. SPARQL queries for the same use case are also shown. Observations indicate Prolog is fastest while SPARQL is easier to understand but Prolog allows easier parameter changes and logic reuse. A final presentation was planned for April 28, 2010.
Samuel Lampa presented his MSc thesis on integrating SWI-Prolog as a semantic querying tool in Bioclipse. He demonstrated [1] how SWI-Prolog can be used for semantic querying of biological data in RDF format within Bioclipse, [2] examples of SPARQL and Prolog code used to perform semantic queries, and [3] benchmarking of Prolog's performance as a semantic querying tool. The work adds new semantic querying functionality to Bioclipse using SWI-Prolog and demonstrates its ability to efficiently query biological data.
This document discusses using Vagrant, Ansible, and Docker together to build portable infrastructure that avoids dependency issues, allows consistent workflows, and reduces risk. Vagrant is used to create and manage virtual environments from a configuration file. Ansible then provides configuration management through push-based execution of tasks without a client. Docker adds portability by allowing applications to run in lightweight isolated containers across machines. A sample project demonstrates Vagrant starting a VM, Ansible provisioning it by starting a Docker container, and an application running within the container.
Reproducibility in Scientific Data Analysis - BioScience SeminarSamuel Lampa
Slides for a talk held at BioScience Seminar at Dept. of Pharmaceutical BioSciences at Uppsala University on December 16, 2016.
The event webpage: http://www.farmbio.uu.se/calendar/kalendarium-detaljsida/?eventId=22496
Structure of the talk:
Reproducibility in Scientific Data Analysis ...
● What is it?
● Why is it important?
● Why is it a problem?
● What can we do about it?
● What does pharmb.io do about it?
POC Conference 2015
Virtual Appliances have become very prevalent these days as virtualization is ubiquitous and hypervisors commonplace. More and more of the major vendors are providing literally virtual clones for many of their once physical-only products. Like IoT and the CAN bus, it's early in the game and vendors are late as usual. One thing that it catching these vendors off guard is the huge additional attack surface, ripe with vulnerabilities, added in the process. Also, many vendors see software appliances as an opportunity for the customer to easily evaluate the product before buying the physical one, making these editions more accessible and debuggable by utilizing features of the platform on which it runs. During this talk, I will provide real case studies for various vulnerabilities created by mistakes that many of the major players made when shipping their appliances. You'll learn how to find these bugs yourself and how the vendors went about fixing them, if at all. By the end of this talk, you should have a firm grasp of how one goes about getting remote root on appliances.
RDFIO is an RDF import and query extension for MediaWiki. It allows users to import RDF triples into MediaWiki and query the triples using SPARQL. The architecture includes an in-memory RDF store to hold the triples and a SPARQL endpoint for querying. Future plans include enhancing editing capabilities via templates and importing triples on a per-page basis. Samuel Lampa presented RDFIO and is looking for additional ideas to improve the extension.
This document provides an outline for a tutorial on importing and editing data in Stata. It discusses importing comma-separated value files, generating new variables, saving and reopening datasets, creating do-files, using Stata's help system, and presents a challenge involving creating a dummy variable using imported data. The tutorial materials are based on examples from an introductory econometrics textbook and the datasets can be downloaded from Stata's website.
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in BioclipseSamuel Lampa
Contains a small background on the semantic web, and shows how Prolog is thought to be used from inside Bioclipse research software for RDF data handling.
This document provides a cheat sheet overview of key concepts in the IRODS rule language, including numeric and string literals, arithmetic and comparison operators, functions for strings, lists, tuples, if/else statements, foreach loops, defining functions and rules, handling errors, and inductive data types. It describes syntax for defining data types using constructors, and using pattern matching to define functions over data types.
Ready to leverage the power of a graph database to bring your application to the next level, but all the data is still stuck in a legacy relational database?
Fortunately, Neo4j offers several ways to quickly and efficiently import relational data into a suitable graph model. It's as simple as exporting the subset of the data you want to import and ingest it either with an initial loader in seconds or minutes or apply Cypher's power to put your relational data transactionally in the right places of your graph model.
In this webinar, Michael will also demonstrate a simple tool that can load relational data directly into Neo4j, automatically transforming it into a graph representation of your normalized entity-relationship model.
Producing, publishing and consuming linked data - CSHALS 2013François Belleau
This document discusses lessons learned from the Bio2RDF project for producing, publishing, and consuming linked data. It outlines three key lessons: 1) How to efficiently produce RDF using existing ETL tools like Talend to transform data formats into RDF triples; 2) How to publish linked data by designing URI patterns, offering SPARQL endpoints and associated tools, and registering data in public registries; 3) How to consume SPARQL endpoints by building semantic mashups using workflows to integrate data from multiple endpoints and then querying the mashup to answer questions.
This document proposes a content model and API to unify access to different types of content like wikis, RDF, binaries, and more. It aims to be used in projects like NEPOMUK, WAVES, and WIF. The model represents content at different levels of granularity from words to documents. Content can be annotated with semantic statements and metadata. All content is addressable and versioned. The API provides functions for basic CRUD operations as well as fulltext search and auto-completion support through a keyword index.
Build an application upon Semantic Web models. Brief overview of Apache Jena and OWL-API.
Semantic Web course
e-Lite group (https://elite.polito.it)
Politecnico di Torino, 2017
OpenEvent is a Drupal distribution that represents an Event Open Data Model and publishes event data through a self-documented API. It aims to be a generic foundation for cultural organizations to manage and publish their events online. The distribution includes Drupal 7, the Open Data Model, Schema.org mappings, and features like a read-only API. Future plans include moving it to Drupal.org, improving documentation, refactoring custom code into reusable modules, and attending to the issue queue. Lessons learned include benefits of open source like higher developer motivation and easier code sharing.
As of Drupal 7 we'll have RDFa markup in core, in this session I will:
-explain what the implications are of this and why this matters
-give a short introduction to the Semantic web, RDF, RDFa and SPARQL in human language
-give a short overview of the RDF modules that are available in contrib
-talk about some of the potential use cases of all these magical technologies
This document discusses tools for improving reproducibility in research, including hosting data in GigaDB, sharing images using OMERO, implementing workflows using Galaxy and executable documents, and sharing virtual machines. It emphasizes the need for publishers to host and curate research objects like data, code, and workflows and provide citations for reproducible research. Key tools highlighted are GigaDB for data hosting, OMERO for image hosting, Galaxy for implementing workflows, and virtual machines for sharing full computational environments.
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
This document provides an introduction to Big Data and Apache Hadoop. It defines Big Data as large and complex datasets that are difficult to process using traditional database tools. It describes how Hadoop uses MapReduce and HDFS to provide scalable storage and parallel processing of Big Data. It provides examples of companies using Hadoop to analyze exabytes of data and common Hadoop use cases like log analysis. Finally, it summarizes some popular Hadoop ecosystem projects like Hive, Pig, and Zookeeper that provide SQL-like querying, data flows, and coordination.
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
Drupal 7 will use RDFa markup in core, in this session I will:
-explain what the implications are of this and why this matters
-give a short introduction to the Semantic web, RDF, RDFa and SPARQL in human language
-give a short overview of the RDF modules that are available in contrib
-talk about some of the potential use cases of all these magical technologies
This is a talk from the Drupal track at Fosdem 2010.
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
This document introduces Okkam, an Italian company that uses Apache Flink for large-scale data integration and semantic technologies. It discusses Okkam's use of Flink for domain reasoning, RDF data processing, duplicate detection, entity linkage, and telemetry analysis. The document also provides lessons learned from Okkam's Flink experiences and suggestions for improving Flink.
Linked data enhanced publishing for special collections (with Drupal)Joachim Neubert
This document discusses using Drupal 7 as a content management system for publishing special collections as linked open data. It provides an overview of how Drupal allows customizing content types and fields for mapping to RDF properties. While Drupal 7 provides basic RDFa support out of the box, there are some limitations around nested RDF structures and multiple entities per page that may require custom code. The document outlines some additional linked data modules for Drupal 7 and highlights improved RDF support anticipated in Drupal 8.
Knowledge graph construction with a façade - The SPARQL Anything ProjectEnrico Daga
The document discusses a project called "SPARQL Anything" which aims to simplify knowledge graph construction by using SPARQL as the single language for representing and transforming diverse data formats into RDF. It presents an approach called "Facade-X" which defines a common RDF structure that can be applied over different formats like CSV, JSON, HTML, etc. This facade focuses on the RDF meta-model and aims to apply minimal ontological commitments. The document outlines how Facade-X can be used to represent different formats and provides examples of using SPARQL to transform sample data into RDF without committing to a specific domain ontology.
Intro to DefectDojo at OWASP SwitzerlandMatt Tesauro
This document introduces Fred Blaise and provides information about OWASP DefectDojo. DefectDojo is an open-source application vulnerability correlation and security orchestration tool that consolidates findings from multiple tools, tracks vulnerabilities, and enables automation through its REST API. It can ingest reports from many common security tools and helps automate previously manual processes to improve security and allow small teams to manage large application security programs. The document demonstrates how DefectDojo can be deployed in various environments and discusses its features, community, and recent improvements.
My presentation on RDFauthor at EKAW2010, Lisbon. For more information on RDFauthor visit http://aksw.org/Projects/RDFauthor; for the code visit http://code.google.com/p/rdfauthor/.
The document discusses tools and methods for improving reproducibility in research, including open data and open source tools. It summarizes that less than 30% of published studies are reproducible due to lack of sharing of data, code, and workflows. It promotes hosting research objects like data, images, and workflows to make them accessible and citable. Specific tools mentioned include GigaDB for data, OMERO for images, Galaxy and executable documents for workflows, and virtual machines for replicable computational environments.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
Apache Marmotta is a linked data platform that provides a linked data server, SPARQL server, and development environment for building linked data applications. It uses modular components including a triplestore backend, SPARQL endpoint, LDCache for remote data access, and an optional reasoner. Marmotta is implemented as a Java web application and uses services, dependency injection, and REST APIs.
Redfish is an IPMI replacement standardized by the DMTF. It provides a RESTful API for server out of band management and a lightweight data model specification that is scalable, discoverable and extensible. (Cf: http://www.dmtf.org/standards/redfish). This presentation will start by detailing its role and the features it provides with examples. It will demonstrate the benefits it provides to system administrator by providing a standardized open interface for multiple servers, and also storage systems.
We will then cover various tools such as the DMTF ones and the python-redfish library (Cf: https://github.com/openstack/python-redfish) offering Redfish abstractions.
"Xapi-lang For declarative code generation" By James NelsonGWTcon
Xapi-lang is a Java parser enhanced with an XML-like syntax that can be used for code generation, templating, and creating domain-specific languages. It works by parsing code into an abstract syntax tree and then applying visitors to analyze and transform the AST to produce output. Examples shown include class templating, API generation from templates, and UI component generation. The document also discusses best practices for code generation and outlines plans for rebuilding the GWT toolchain to support GWT 3 and J2CL. It promotes a new company, Vertispan, for GWT support and consulting and introduces another project called We The Internet for building tools to improve political systems using distributed democracy.
SciCommander - Provenance reports for outputs of ad-hoc analysesSamuel Lampa
There exist a multitude of pipeline tools for bioinformatics [1]. As using a pipeline tool is more complex than just writing shell scripts2, a lot of bioinformatics work happens in a more ad-hoc fashion, with individual shell commands executed to run analyses. This makes it much harder to keep a full audit log of the analyses, since it is easy to miss to document some steps. It is also often not clear afterwards which output files were created by which command. Additionally, shell scripts are missing functionality to re-use already finished intermediary output files to resume cancelled runs. SciCommander is a tool that addresses these limitations by allowing to track the produced output files of almost any shell command and avoiding to re-run already run commands, by only slight changes in the commands, prepending the command itself and the inputs, outputs and parameters of the command with special markers. Using this information, SciCommander checks if any of the output files already exist, and if so skips that command. Secondly, it produces an audit log for each output file in JSON format. It includes a command to convert this file into an HTML report with a graphical visualization of all the steps needed to produce this particular file. All in all, this functionality allows providing full provenance at the individual file level, as well as allows resuming interrupted runs even for simple shell commands. SciCommander is open source, can be installed via the Python Package Index, or from GitHub: https://github.com/samuell/scicommander
Using Flow-based programming to write tools and workflows for Scientific Comp...Samuel Lampa
The document summarizes Samuel Lampa's talk on using flow-based programming for scientific computing. It provides biographical information on Samuel Lampa, including his background in pharmaceutical bioinformatics and current work. It then gives an overview of flow-based programming, describing it as using black box processes connected by data flows, with connections specified separately from processes. Benefits mentioned include easy testing, monitoring, and changing connections without rewriting components. Examples of using FBP in Go are also presented.
Linked Data for improved organization of research dataSamuel Lampa
Slides for a talk at a Farmbio BioScience Seminar May 18, 2018, at http://farmbio.uu.se introducing Linked Data as a way to manage research data in a way that can better keep track of provenance, make its semantics more explicit, and make it more easily integrated with other data, and consumed by others, both humans and machines.
How to document computational research projectsSamuel Lampa
These slides are from an internal meeting at pharmb.io where we discussed ways to improve documentation of our internal computational research projects. The winning solutions turns out to be markdown files, versioned with git. The slides explains a little bit about why.
AddisDev Meetup ii: Golang and Flow-based ProgrammingSamuel Lampa
The document discusses flow-based programming (FBP), its history and concepts. FBP defines applications as networks of processes that exchange data through message passing over predefined connections. This allows the processes to be reconnected without changing their code. The document provides examples of FBP networks and components implemented in various languages like Go, Java and JavaScript. It also discusses the benefits of FBP and its growing popularity with implementations like NoFlo.
First encounter with Elixir - Some random thingsSamuel Lampa
The document discusses Samuel Lampa's first encounter with the programming language Elixir. It covers calculating GC ratios in DNA sequences, provides a DNA sequence example file, and compares Elixir processes to Go channels, noting that Elixir processes are named mailboxes tied to a process while Go channels are anonymous and separate from routines. The document is authored by Samuel Lampa from Uppsala University.
Profiling go code a beginners tutorialSamuel Lampa
This document summarizes a presentation on profiling Go code. It introduces pprof, a tool created by Dave Cheney that makes profiling Go code easier. It demonstrates pprof by profiling a string processing program and showing the performance improvements from various optimizations. It recommends resources for learning more about profiling Go programs with pprof and high performance Go programming.
This document provides an overview of flow-based programming (FBP). FBP is a programming paradigm where applications are defined as networks of black box processes that exchange data through predefined connections. These connections can be redefined without changing the internal processes, allowing for endless reconfiguration. FBP was invented in the 1960s and has seen a resurgence of interest with tools like NoFlo that allow building distributed applications as connected processes. The document discusses several open source FBP implementations and frameworks and provides examples of how FBP has been used to build applications and bioinformatics libraries.
My lightning talk at Go Stockholm meetup Aug 6th 2013Samuel Lampa
This document discusses flow-based programming, an approach to programming invented in the 1970s where the flow of data between components is emphasized. It was successfully used in several domains including data analysis, banking software, and digital signal processing. New implementations of flow-based programming include NoFlo for Node.js and GoFlow, an open-source implementation in Go. More information on flow-based programming can be found on the listed websites.
Mastering Advance Window Functions in SQL.pdfSpiral Mantra
How well do you really know SQL?📊
.
.
If PARTITION BY and ROW_NUMBER() sound familiar but still confuse you, it’s time to upgrade your knowledge
And you can schedule a 1:1 call with our industry experts: https://spiralmantra.com/contact-us/ or drop us a mail at [email protected]
Social Media App Development Company-EmizenTechSteve Jonas
EmizenTech is a trusted Social Media App Development Company with 11+ years of experience in building engaging and feature-rich social platforms. Our team of skilled developers delivers custom social media apps tailored to your business goals and user expectations. We integrate real-time chat, video sharing, content feeds, notifications, and robust security features to ensure seamless user experiences. Whether you're creating a new platform or enhancing an existing one, we offer scalable solutions that support high performance and future growth. EmizenTech empowers businesses to connect users globally, boost engagement, and stay competitive in the digital social landscape.
Unlocking the Power of IVR: A Comprehensive Guidevikasascentbpo
Streamline customer service and reduce costs with an IVR solution. Learn how interactive voice response systems automate call handling, improve efficiency, and enhance customer experience.
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
Vaibhav Gupta BAML: AI work flows without Hallucinationsjohn409870
Shipping Agents
Vaibhav Gupta
Cofounder @ Boundary
in/vaigup
boundaryml/baml
Imagine if every API call you made
failed only 5% of the time
boundaryml/baml
Imagine if every LLM call you made
failed only 5% of the time
boundaryml/baml
Imagine if every LLM call you made
failed only 5% of the time
boundaryml/baml
Fault tolerant systems are hard
but now everything must be
fault tolerant
boundaryml/baml
We need to change how we
think about these systems
Aaron Villalpando
Cofounder @ Boundary
Boundary
Combinator
boundaryml/baml
We used to write websites like this:
boundaryml/baml
But now we do this:
boundaryml/baml
Problems web dev had:
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
State management was impossible.
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
State management was impossible.
Dynamic components? forget about it.
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
State management was impossible.
Dynamic components? forget about it.
Reuse components? Good luck.
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
State management was impossible.
Dynamic components? forget about it.
Reuse components? Good luck.
Iteration loops took minutes.
boundaryml/baml
Problems web dev had:
Strings. Strings everywhere.
State management was impossible.
Dynamic components? forget about it.
Reuse components? Good luck.
Iteration loops took minutes.
Low engineering rigor
boundaryml/baml
React added engineering rigor
boundaryml/baml
The syntax we use changes how we
think about problems
boundaryml/baml
We used to write agents like this:
boundaryml/baml
Problems agents have:
boundaryml/baml
Problems agents have:
Strings. Strings everywhere.
Context management is impossible.
Changing one thing breaks another.
New models come out all the time.
Iteration loops take minutes.
boundaryml/baml
Problems agents have:
Strings. Strings everywhere.
Context management is impossible.
Changing one thing breaks another.
New models come out all the time.
Iteration loops take minutes.
Low engineering rigor
boundaryml/baml
Agents need
the expressiveness of English,
but the structure of code
F*** You, Show Me The Prompt.
boundaryml/baml
<show don’t tell>
Less prompting +
More engineering
=
Reliability +
Maintainability
BAML
Sam
Greg Antonio
Chris
turned down
openai to join
ex-founder, one
of the earliest
BAML users
MIT PhD
20+ years in
compilers
made his own
database, 400k+
youtube views
Vaibhav Gupta
in/vaigup
[email protected]
boundaryml/baml
Thank you!
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Batch import of large RDF datasets into Semantic MediaWiki
1. Batch import of large RDF datasets
using RDFIO or the new
rdf2smw tool
Samuel Lampa - @smllmp
PhD Student
in Pharmaceutical Bioinformatics @ pharmb.io
with Assoc. Prof. Ola Spjuth - @ola_spjuth
@ Dept. of Pharm. Biosci. / Uppsala University
Semantic MediaWiki Conference Fall 2016, Frankfurt am Main,
3. Research interests
● Large datasets
● Automation
● Scientific workflows
● Machine Learning
● Semantic data
● Reasoning
● Query systems
● Something user friendly
● … and hopefully usable
● “Answer ALL the research questionz”
6. What’s the problem?
● … but, not really any (proper) RDF import
(as in: plain triples → wiki syntax in articles)
7. RDFIO What?!
● SMW extension
● Import plain RDF triples
● No need for an ontology
● RDF URIs → Wiki titles
● Retains Original URIs
● Translates back to
Original URIs on export
● Round-trip SMW ↔ RDF
● tinyurl.com/getrdfio
8. Turning RDF Triples into Wiki Pages
<http://ex.org/Stockholm> <http://ex.org/onto/LocatedIn> <http://ex.org/Sweden>
<http://ex.org/Stockholm> <http://ex.org/onto/Population> "789024"^^xsd:integer
<http://ex.org/Frankfurt> <http://ex.org/onto/LocatedIn> <http://ex.org/Germany>
<http://ex.org/Frankfurt> <http://ex.org/onto/Population> "731095"^^xsd:integer
17. RDFIO – Current Status
● SMW 2.3 support – with some hacks
(Ali working on the last minor issues)
● See the Vagrant box for a working automated
setup with MW 1.26.4 + SMW 2.3.1:
– github.com/rdfio/rdfio-vagrantbox
● Some known minor issues
21. The new rdf2smw tool
● Convert RDF → MediaWiki XML (Really fast!)
● Import via MediaWiki XML import (Still slow...)
● But: Can now preview before the XML import!
22. More rdf2smw facts:
● Written in Go for compiled, multi-core performance
● Very pluggable architecture
● Easy to install: Just download and run!
● Get it: github.com/samuell/rdf2smw
25. Future outlook
● How to make RDFIO more maintainable, for developers
with too little time?
● Drastically simplify?
● Break out well defined sub-modules?
(SPARQL endpoint, RDF Import, etc)
● Integrate with MW REST API Instead of dedicated Special-
page – as per Denny’s original idea with SMWWriter?
● Re-use core SMW functionality more? (Or not?)
● Your ideas?
27. The new Vagrant box:
Set up MW + SMW + RDFIO in 7 steps
1) Install dependencies
2) $ git clone https://github.com/rdfio/rdfio-vagrantbox.git
3) $ cd rdfio-vagrantbox
4) $ vagrant up
5) Surf in on localhost:8080/w/index.php/Special:RDFIOAdmin
6) Log in with Admin and changethis
7) Click “Setup”
Done!
28. Acknowledgements
● Denny Vrandečić (@vrandezo) - Basically had the same idea for an extension already
when the (eventually accepted) GSOC proposal was submitted in 2010, and supported
the project with valuable ideas and though mentoring the GSOC 2010 project.
● Ali King (@ali_king) – Has done great work at updating the extension to the latest
standards and versions, and added the new template editing functionality, as part of
aOPW 2014 project.
● Joel Sachs (@xjsachs) - Championed the addition of the template editing functionality,
provided valuable encouragement and mentored Ali King’s FOSS OPW project.
● Egon Willighagen (@egonwillighagen) - Has supported the project with valuable
testing, constructive feedback, encouragement and new ideas.
● Ola Spjuth (@ola_spjuth) – Has provided constructive feedback and encouragement,
as well as financed parts of the further development of the project.
● Google Inc. - Supported the initial development through it’s
summer of code program (GSOC) in 2010.
● Gnome Foundation - Supporting further development as part of its
outreach program for women (OPW) in 2014.