SlideShare a Scribd company logo
1 
RDF and the Hadoop Ecosystem 
Rob Vesse 
Twitter: @RobVesse 
Email: rvesse@apache.org
2 
 Software Engineer at YarcData (part of Cray Inc) 
 Working on big data analytics products 
 Active open source contributor primarily to RDF & SPARQL 
related projects 
 Apache Jena Committer and PMC Member 
 dotNetRDF Lead Developer 
 Primarily interested in RDF, SPARQL and Big Data Analytics 
technologies
3 
What's missing in the Hadoop ecosystem? 
What's needed to fill the gap? 
What's already available? 
 Jena Hadoop RDF Tools 
 GraphBuilder 
 Other Projects 
 Getting Involved 
 Questions
4
5 
Apache, the projects and their logo shown here are registered trademarks or 
trademarks of The Apache Software Foundation in the U.S. and/or other 
countries
6 
 No first class projects 
 Some limited support in other projects 
 E.g. Giraph supports RDF by bridging through the Tinkerpop stack 
 Some existing external projects 
 Lots of academic proof of concepts 
 Some open source efforts but mostly task specific 
 E.g. Infovore targeted at creating curated Freebase and DBPedia datasets
7
8 
 Need to efficiently represent RDF concepts as Writable 
types 
 Nodes, Triples, Quads, Graphs, Datasets, Query Results etc 
What's the minimum viable subset?
9 
 Need to be able to get data in and out of RDF formats 
Without this we can't use the power of the Hadoop 
ecosystem to do useful work 
 Lots of serializations out there: 
 RDF/XML 
 Turtle 
 NTriples 
 NQuads 
 JSON-LD 
 etc 
 Also would like to be able to produce end results as RDF
1 
0 
Map/Reduce building blocks 
 Common operations e.g. splitting 
 Enable developers to focus on their applications 
 User Friendly tooling 
 i.e. non-programmer tools
1 
1
1 
2 
CC BY-SA 3.0 Wikimedia Commons
1 
3 
 Set of modules part of the Apache Jena project 
 Originally developed at Cray and donated to the project earlier this year 
 Experimental modules on the hadoop-rdf branch of our 
 Currently only available as development SNAPSHOT 
releases 
 Group ID: org.apache.jena 
 Artifact IDs: 
 jena-hadoop-rdf-common 
 jena-hadoop-rdf-io 
 jena-hadoop-rdf-mapreduce 
 Latest Version: 0.9.0-SNAPSHOT 
 Aims to fulfill all the basic requirements for enabling RDF on 
Hadoop 
 Built against Hadoop Map/Reduce 2.x APIs
1 
4 
 Provides the Writable types for RDF primitives 
 NodeWritable 
 TripleWritable 
 QuadWritable 
 NodeTupleWritable 
 All backed by RDF Thrift 
 A compact binary serialization for RDF using Apache Thrift 
 See http://afs.github.io/rdf-thrift/ 
 Extremely efficient to serialize and deserialize 
 Allows for efficient WritableComparator implementations that perform binary comparisons
 Provides InputFormat and OutputFormat implementations 
1 
5 
 Supports most formats that Jena supports 
 Designed to be extendable with new formats 
Will split and parallelize inputs where the RDF serialization 
is amenable to this 
 Also transparently handles compressed inputs and outputs 
 Note that compression blocks splitting 
 i.e. trade off between IO and parallelism
1 
6 
 Various reusable building block Mapper and Reducer 
implementations: 
 Counting 
 Filtering 
 Grouping 
 Splitting 
 Transforming 
 Can be used as-is to do some basic Hadoop tasks or used as 
building blocks for more complex tasks
1 
7
 For NTriples inputs compared performance of a Text based 
node count versus RDF based node count 
1 
8 
 Performance as good (within 10%) and sometimes 
significantly better 
 Heavily dataset dependent 
 Varies considerably with cluster setup 
 Also depends on how the input is processed 
 YMMV! 
 For other RDF formats you would struggle to implement 
this at all
1 
9 
 Originally developed by Intel 
 Some contributions by Cray - awaiting merging at time of writing 
 Open source under Apache License 
 https://github.com/01org/graphbuilder/tree/2.0.alpha 
 2.0.alpha is the Pig based branch 
 Allows data to be transformed into graphs using Pig scripts 
 Provides set of Pig UDFs for translating data to graph formats 
 Supports both property graphs and RDF graphs
2 
0 
-- Declare our mappings 
x = FOREACH propertyGraph GENERATE (*, 
[ 'idBase' # 'http://example.org/instances/', 
'base' # 'http://example.org/ontology/', 
'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ], 
'propertyMap' # [ 'type' # 'a', 
'name' # 'foaf:name', 
'age' # 'foaf:age' ], 
'idProperty' # 'id' ]); 
-- Convert to NTriples 
rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*)); 
-- Write out NTriples 
STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
2 
1 
 Uses a declarative mapping based on Pig primitives 
 Maps and Tuples 
 Have to be explicitly joined to the data because Pig UDFs 
can only be called with String arguments 
 Has some benefits e.g. conditional mappings 
 RDF Mappings operate on Property Graphs 
 Requires original data to be mapped to a property graph first 
 Direct mapping to RDF is a future enhancement that has yet to be implemented
2 
2
2 
3 
 Infovore - Paul Houle 
 https://github.com/paulhoule/infovore/wiki 
 Cleaned and curated Freebase datasets processed with Hadoop 
 CumulusRDF - Institute of Applied Informatics and Formal 
Description Methods 
 https://code.google.com/p/cumulusrdf/ 
 RDF store backed by Apache Cassandra
2 
4 
 Please start playing with these projects 
 Please interact with the community: 
 dev@jena.apache.org 
 What works? 
 What is broken? 
 What is missing? 
 Contribute 
 Apache projects are ultimately driven by the community 
 If there's a feature you want please suggest it 
 Or better still contribute it yourself!
2 
5 
Questions? 
Personal Email: rvesse@apache.org 
Jena Mailing List: dev@jena.apache.org
2 
6
2 
7 
> bin/hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar 
org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output 
/user/output --input-type triples /user/input 
 --node-count requests the Node Count statistics be 
calculated 
 Assumes mixed quads and triples input if no --input-type 
specified 
 Using this for triples only data can skew statistics 
 e.g. can result in high node counts for default graph node 
 Hence we explicitly specify input as triples
2 
8
2 
9
3 
0
3 
1
3 
2
3 
3 
> ./pig -x local examples/property_graphs_and_rdf.pig 
> cat /tmp/rdf_triples/part-m-00000 
 Running in local mode for this demo 
 Output goes to /tmp/rdf_triples
3 
4
3 
5
3 
6
public abstract class AbstractNodeTupleNodeCountMapper<TKey, TValue, T extends AbstractNodeTupleWritable<TValue>> 
extends Mapper<TKey, T, NodeWritable, LongWritable> { 
3 
7 
private LongWritable initialCount = new LongWritable(1); 
@Override 
protected void map(TKey key, T value, Context context) throws IOException, InterruptedException { 
NodeWritable[] ns = this.getNodes(value); 
for (NodeWritable n : ns) { 
context.write(n, this.initialCount); 
} 
} 
protected abstract NodeWritable[] getNodes(T tuple); 
} 
public class TripleNodeCountMapper<TKey> extends AbstractNodeTupleNodeCountMapper<TKey, Triple, TripleWritable> { 
@Override 
protected NodeWritable[] getNodes(TripleWritable tuple) { 
Triple t = tuple.get(); 
return new NodeWritable[] { new NodeWritable(t.getSubject()), new NodeWritable(t.getPredicate()), 
new NodeWritable(t.getObject()) }; 
} 
}
3 
8 
public class NodeCountReducer extends Reducer<NodeWritable, LongWritable, NodeWritable, 
LongWritable> { 
@Override 
protected void reduce(NodeWritable key, Iterable<LongWritable> values, Context context) 
throws IOException, InterruptedException { 
long count = 0; 
Iterator<LongWritable> iter = values.iterator(); 
while (iter.hasNext()) { 
count += iter.next().get(); 
} 
context.write(key, new LongWritable(count)); 
} 
}
3 
9 
Job job = Job.getInstance(config); 
job.setJarByClass(JobFactory.class); 
job.setJobName("RDF Triples Node Usage Count"); 
// Map/Reduce classes 
job.setMapperClass(TripleNodeCountMapper.class); 
job.setMapOutputKeyClass(NodeWritable.class); 
job.setMapOutputValueClass(LongWritable.class); 
job.setReducerClass(NodeCountReducer.class); 
// Input and Output 
job.setInputFormatClass(TriplesInputFormat.class); 
job.setOutputFormatClass(NTriplesNodeOutputFormat.class); 
FileInputFormat.setInputPaths(job, StringUtils.arrayToString(inputPaths)); 
FileOutputFormat.setOutputPath(job, new Path(outputPath)); 
return job;
 https://github.com/Cray/graphbuilder/blob/2.0.alpha/exa 
mples/property_graphs_and_rdf_example.pig 
4 
0
Ad

More Related Content

What's hot (20)

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
Avery Ching
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Databricks
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
Grigory Sapunov
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Yu Liu
 
Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraph
DataWorks Summit
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
Databricks
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
Avery Ching
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Databricks
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Yu Liu
 
Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraph
DataWorks Summit
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 

Similar to Quadrupling your elephants - RDF and the Hadoop ecosystem (20)

May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
KennyPratheepKumar
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
Frank Schroeter
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Sean Murphy
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
Joseph Niemiec
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Presentation distro recipes-2013
Presentation distro recipes-2013Presentation distro recipes-2013
Presentation distro recipes-2013
olberger
 
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Anne Nicolas
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
Spark 101
Spark 101Spark 101
Spark 101
Shahaf Azriely {TopLinked} ☁
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Presentation distro recipes-2013
Presentation distro recipes-2013Presentation distro recipes-2013
Presentation distro recipes-2013
olberger
 
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Anne Nicolas
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Ad

More from Rob Vesse (6)

Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scale
Rob Vesse
 
Introducing JDBC for SPARQL
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQL
Rob Vesse
 
Practical SPARQL Benchmarking
Practical SPARQL BenchmarkingPractical SPARQL Benchmarking
Practical SPARQL Benchmarking
Rob Vesse
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web Developer
Rob Vesse
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web Developer
Rob Vesse
 
dotNetRDF - A Semantic Web/RDF Library for .Net Developers
dotNetRDF - A Semantic Web/RDF Library for .Net DevelopersdotNetRDF - A Semantic Web/RDF Library for .Net Developers
dotNetRDF - A Semantic Web/RDF Library for .Net Developers
Rob Vesse
 
Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scale
Rob Vesse
 
Introducing JDBC for SPARQL
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQL
Rob Vesse
 
Practical SPARQL Benchmarking
Practical SPARQL BenchmarkingPractical SPARQL Benchmarking
Practical SPARQL Benchmarking
Rob Vesse
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web Developer
Rob Vesse
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web Developer
Rob Vesse
 
dotNetRDF - A Semantic Web/RDF Library for .Net Developers
dotNetRDF - A Semantic Web/RDF Library for .Net DevelopersdotNetRDF - A Semantic Web/RDF Library for .Net Developers
dotNetRDF - A Semantic Web/RDF Library for .Net Developers
Rob Vesse
 
Ad

Recently uploaded (20)

Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Download Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With LatestDownload Wondershare Filmora Crack [2025] With Latest
Download Wondershare Filmora Crack [2025] With Latest
tahirabibi60507
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMeet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Meet the Agents: How AI Is Learning to Think, Plan, and Collaborate
Maxim Salnikov
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...Explaining GitHub Actions Failures with Large Language Models Challenges, In...
Explaining GitHub Actions Failures with Large Language Models Challenges, In...
ssuserb14185
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 

Quadrupling your elephants - RDF and the Hadoop ecosystem

  • 1. 1 RDF and the Hadoop Ecosystem Rob Vesse Twitter: @RobVesse Email: [email protected]
  • 2. 2  Software Engineer at YarcData (part of Cray Inc)  Working on big data analytics products  Active open source contributor primarily to RDF & SPARQL related projects  Apache Jena Committer and PMC Member  dotNetRDF Lead Developer  Primarily interested in RDF, SPARQL and Big Data Analytics technologies
  • 3. 3 What's missing in the Hadoop ecosystem? What's needed to fill the gap? What's already available?  Jena Hadoop RDF Tools  GraphBuilder  Other Projects  Getting Involved  Questions
  • 4. 4
  • 5. 5 Apache, the projects and their logo shown here are registered trademarks or trademarks of The Apache Software Foundation in the U.S. and/or other countries
  • 6. 6  No first class projects  Some limited support in other projects  E.g. Giraph supports RDF by bridging through the Tinkerpop stack  Some existing external projects  Lots of academic proof of concepts  Some open source efforts but mostly task specific  E.g. Infovore targeted at creating curated Freebase and DBPedia datasets
  • 7. 7
  • 8. 8  Need to efficiently represent RDF concepts as Writable types  Nodes, Triples, Quads, Graphs, Datasets, Query Results etc What's the minimum viable subset?
  • 9. 9  Need to be able to get data in and out of RDF formats Without this we can't use the power of the Hadoop ecosystem to do useful work  Lots of serializations out there:  RDF/XML  Turtle  NTriples  NQuads  JSON-LD  etc  Also would like to be able to produce end results as RDF
  • 10. 1 0 Map/Reduce building blocks  Common operations e.g. splitting  Enable developers to focus on their applications  User Friendly tooling  i.e. non-programmer tools
  • 11. 1 1
  • 12. 1 2 CC BY-SA 3.0 Wikimedia Commons
  • 13. 1 3  Set of modules part of the Apache Jena project  Originally developed at Cray and donated to the project earlier this year  Experimental modules on the hadoop-rdf branch of our  Currently only available as development SNAPSHOT releases  Group ID: org.apache.jena  Artifact IDs:  jena-hadoop-rdf-common  jena-hadoop-rdf-io  jena-hadoop-rdf-mapreduce  Latest Version: 0.9.0-SNAPSHOT  Aims to fulfill all the basic requirements for enabling RDF on Hadoop  Built against Hadoop Map/Reduce 2.x APIs
  • 14. 1 4  Provides the Writable types for RDF primitives  NodeWritable  TripleWritable  QuadWritable  NodeTupleWritable  All backed by RDF Thrift  A compact binary serialization for RDF using Apache Thrift  See http://afs.github.io/rdf-thrift/  Extremely efficient to serialize and deserialize  Allows for efficient WritableComparator implementations that perform binary comparisons
  • 15.  Provides InputFormat and OutputFormat implementations 1 5  Supports most formats that Jena supports  Designed to be extendable with new formats Will split and parallelize inputs where the RDF serialization is amenable to this  Also transparently handles compressed inputs and outputs  Note that compression blocks splitting  i.e. trade off between IO and parallelism
  • 16. 1 6  Various reusable building block Mapper and Reducer implementations:  Counting  Filtering  Grouping  Splitting  Transforming  Can be used as-is to do some basic Hadoop tasks or used as building blocks for more complex tasks
  • 17. 1 7
  • 18.  For NTriples inputs compared performance of a Text based node count versus RDF based node count 1 8  Performance as good (within 10%) and sometimes significantly better  Heavily dataset dependent  Varies considerably with cluster setup  Also depends on how the input is processed  YMMV!  For other RDF formats you would struggle to implement this at all
  • 19. 1 9  Originally developed by Intel  Some contributions by Cray - awaiting merging at time of writing  Open source under Apache License  https://github.com/01org/graphbuilder/tree/2.0.alpha  2.0.alpha is the Pig based branch  Allows data to be transformed into graphs using Pig scripts  Provides set of Pig UDFs for translating data to graph formats  Supports both property graphs and RDF graphs
  • 20. 2 0 -- Declare our mappings x = FOREACH propertyGraph GENERATE (*, [ 'idBase' # 'http://example.org/instances/', 'base' # 'http://example.org/ontology/', 'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ], 'propertyMap' # [ 'type' # 'a', 'name' # 'foaf:name', 'age' # 'foaf:age' ], 'idProperty' # 'id' ]); -- Convert to NTriples rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*)); -- Write out NTriples STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
  • 21. 2 1  Uses a declarative mapping based on Pig primitives  Maps and Tuples  Have to be explicitly joined to the data because Pig UDFs can only be called with String arguments  Has some benefits e.g. conditional mappings  RDF Mappings operate on Property Graphs  Requires original data to be mapped to a property graph first  Direct mapping to RDF is a future enhancement that has yet to be implemented
  • 22. 2 2
  • 23. 2 3  Infovore - Paul Houle  https://github.com/paulhoule/infovore/wiki  Cleaned and curated Freebase datasets processed with Hadoop  CumulusRDF - Institute of Applied Informatics and Formal Description Methods  https://code.google.com/p/cumulusrdf/  RDF store backed by Apache Cassandra
  • 24. 2 4  Please start playing with these projects  Please interact with the community:  [email protected]  What works?  What is broken?  What is missing?  Contribute  Apache projects are ultimately driven by the community  If there's a feature you want please suggest it  Or better still contribute it yourself!
  • 25. 2 5 Questions? Personal Email: [email protected] Jena Mailing List: [email protected]
  • 26. 2 6
  • 27. 2 7 > bin/hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output /user/output --input-type triples /user/input  --node-count requests the Node Count statistics be calculated  Assumes mixed quads and triples input if no --input-type specified  Using this for triples only data can skew statistics  e.g. can result in high node counts for default graph node  Hence we explicitly specify input as triples
  • 28. 2 8
  • 29. 2 9
  • 30. 3 0
  • 31. 3 1
  • 32. 3 2
  • 33. 3 3 > ./pig -x local examples/property_graphs_and_rdf.pig > cat /tmp/rdf_triples/part-m-00000  Running in local mode for this demo  Output goes to /tmp/rdf_triples
  • 34. 3 4
  • 35. 3 5
  • 36. 3 6
  • 37. public abstract class AbstractNodeTupleNodeCountMapper<TKey, TValue, T extends AbstractNodeTupleWritable<TValue>> extends Mapper<TKey, T, NodeWritable, LongWritable> { 3 7 private LongWritable initialCount = new LongWritable(1); @Override protected void map(TKey key, T value, Context context) throws IOException, InterruptedException { NodeWritable[] ns = this.getNodes(value); for (NodeWritable n : ns) { context.write(n, this.initialCount); } } protected abstract NodeWritable[] getNodes(T tuple); } public class TripleNodeCountMapper<TKey> extends AbstractNodeTupleNodeCountMapper<TKey, Triple, TripleWritable> { @Override protected NodeWritable[] getNodes(TripleWritable tuple) { Triple t = tuple.get(); return new NodeWritable[] { new NodeWritable(t.getSubject()), new NodeWritable(t.getPredicate()), new NodeWritable(t.getObject()) }; } }
  • 38. 3 8 public class NodeCountReducer extends Reducer<NodeWritable, LongWritable, NodeWritable, LongWritable> { @Override protected void reduce(NodeWritable key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long count = 0; Iterator<LongWritable> iter = values.iterator(); while (iter.hasNext()) { count += iter.next().get(); } context.write(key, new LongWritable(count)); } }
  • 39. 3 9 Job job = Job.getInstance(config); job.setJarByClass(JobFactory.class); job.setJobName("RDF Triples Node Usage Count"); // Map/Reduce classes job.setMapperClass(TripleNodeCountMapper.class); job.setMapOutputKeyClass(NodeWritable.class); job.setMapOutputValueClass(LongWritable.class); job.setReducerClass(NodeCountReducer.class); // Input and Output job.setInputFormatClass(TriplesInputFormat.class); job.setOutputFormatClass(NTriplesNodeOutputFormat.class); FileInputFormat.setInputPaths(job, StringUtils.arrayToString(inputPaths)); FileOutputFormat.setOutputPath(job, new Path(outputPath)); return job;

Editor's Notes

  • #6: Tons of active projects Accumulo, Ambari, Avro, Cassandra, Chukwa, Giraph, Ham, HBase, Hive, Mahout, Pig, Spark, Tez, ZooKeeper And those are just off the top of my head (and ignoring Incubating projects) However mostly focused on traditional data sources e.g. logs, relational databases, unstructured data
  • #15: Highlight benefit of WritableComparator - significant speed up in reduce phase
  • #18: Project also provides a demo JAR which shows how to use the building blocks to perform common Hadoop tasks on RDF So Node Count is essentially the Word Count "Hello World" of Hadoop programming
  • #21: Mention that Intel may not have yet merged our pull request that adds the declarative mapping approach
  • #38: ~20 lines of code (less if you remove unnecessary formatting)
  • #39: 11 lines of code