SlideShare a Scribd company logo
Importing data quickly
and easily
Michael Hunger @mesirii
Mark Needham @markhneedham
The data set
The data set
‣ Stack Exchange API
‣ Stack Exchange Data Dump
Stack Exchange API
{ "items": [{
"question_id": 24620768,
"link": "http://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements",
"title": "Neo4j cypher query: get last N elements",
"answer_count": 1,
"score": 1,
.....
"creation_date": 1404771217,
"body_markdown": "I have a graph....How can I do that?",
"tags": ["neo4j", "cypher"],
"owner": {
"reputation": 815,
"user_id": 1212067,
....
"link": "http://stackoverflow.com/users/1212067/"
},
"answers": [{
"owner": {
"reputation": 488,
"user_id": 737080,
"display_name": "Chris Leishman",
....
},
"answer_id": 24620959,
"share_link": "http://stackoverflow.com/a/24620959",
....
"body_markdown": "The simplest would be to use an ... some discussion on this here:...",
"title": "Neo4j cypher query: get last N elements"
}]
}
JSON to CSV
JSON ??? CSV
LOAD
CSV
Initial Model
{ "items": [{
"question_id": 24620768,
"link": "http://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements",
"title": "Neo4j cypher query: get last N elements",
"answer_count": 1,
"score": 1,
.....
"creation_date": 1404771217,
"body_markdown": "I have a graph....How can I do that?",
"tags": ["neo4j", "cypher"],
"owner": {
"reputation": 815,
"user_id": 1212067,
....
"link": "http://stackoverflow.com/users/1212067/"
},
"answers": [{
"owner": {
"reputation": 488,
"user_id": 737080,
"display_name": "Chris Leishman",
....
},
"answer_id": 24620959,
"share_link": "http://stackoverflow.com/a/24620959",
....
"body_markdown": "The simplest would be to use an ... some discussion on this here:...",
"title": "Neo4j cypher query: get last N elements"
}]
}
jq: Converting JSON to CSV
jq: Converting questions to CSV
jq -r '.[] | .items[] |
[.question_id,
.title,
.up_vote_count,
.down_vote_count,
.creation_date,
.last_activity_date,
.owner.user_id,
.owner.display_name,
(.tags | join(";"))] |
@csv ' so.json
jq: Converting questions to CSV
$ head -n5 questions.csv
question_id,title,up_vote_count,down_vote_count,creation_date,last_activity_date,
owner_user_id,owner_display_name,tags
33023306,"How to delete multiple nodes by specific ID using Cypher",
0,0,1444328760,1444332194,260511,"rayman","jdbc;neo4j;cypher;spring-data-neo4j"
33020796,"How do a general search across string properties in my nodes?",
1,0,1444320356,1444324015,1429542,"osazuwa","ruby-on-rails;neo4j;neo4j.rb"
33018818,"Neo4j match nodes related to all nodes in collection",
0,0,1444314877,1444332779,1212463,"lmazgon","neo4j;cypher"
33018084,"Problems upgrading to Spring Data Neo4j 4.0.0",
0,0,1444312993,1444312993,1528942,"Grégoire Colbert","neo4j;spring-data-neo4j"
jq: Converting answers to CSV
jq -r '.[] | .items[] |
{ question_id: .question_id, answer: .answers[]? } |
[.question_id,
.answer.answer_id,
.answer.title,
.answer.owner.user_id,
.answer.owner.display_name,
(.answer.tags | join(";")),
.answer.up_vote_count,
.answer.down_vote_count] |
@csv'
jq: Converting answers to CSV
$ head -n5 answers.csv
question_id,answer_id,answer_title,owner_id,owner_display_name,tags,up_vote_count,
down_vote_count
33023306,33024189,"How to delete multiple nodes by specific ID using Cypher",
3248864,"FylmTM","",0,0
33020796,33021958,"How do a general search across string properties in my nodes?",
2920686,"FrobberOfBits","",0,0
33018818,33020068,"Neo4j match nodes related to all nodes in collection",158701,"
Stefan Armbruster","",0,0
33018818,33024273,"Neo4j match nodes related to all nodes in collection",974731,"
cybersam","",0,0
Time to import into Neo4j...
Introducing Cypher
‣ The Graph Query Language
‣ Declarative language (think SQL) for graphs
‣ ASCII art based
‣ CREATE create a new pattern in the graph
Cypher primer
CREATE (user:User {name:"Michael Hunger"})
CREATE (question:Question {title: "..."})
CREATE (answer:Answer {text: "..."})
CREATE (user)-[:PROVIDED]->(answer)
CREATE (answer)-[:ANSWERS]->(question);
‣ CREATE create a new pattern in the graph
Cypher primer
CREATE (user:User {name:"Michael Hunger"})
CREATE (question:Question {title: "..."})
CREATE (answer:Answer {text: "..."})
CREATE (user)-[:PROVIDED]->(answer)
CREATE (answer)-[:ANSWERS]->(question);
CREATE (user:User {name:"Michael Hunger"})
Label PropertyNode
‣ CREATE create a new pattern in the graph
Cypher primer
CREATE (user:User {name:"Michael Hunger"})
CREATE (question:Question {title: "..."})
CREATE (answer:Answer {text: "..."})
CREATE (user)-[:PROVIDED]->(answer)
CREATE (answer)-[:ANSWERS]->(question);
CREATE (user)-[:PROVIDED]->(answer)
Relationship
‣ MATCH find a pattern in the graph
Cypher primer
MATCH (answer:Answer)<-[:PROVIDED]-(user:User),
(answer)-[:ANSWERS]->(question)
WHERE user.display_name = "Michael Hunger"
RETURN question, answer;
‣ MERGE find pattern if it exists,
create it if it doesn’t
MERGE (user:User {name:"Mark Needham"})
MERGE (question:Question {title: "..."})
MERGE (answer:Answer {text: "..."})
MERGE (user)-[:PROVIDED]->(answer)
MERGE (answer)-[:ANSWERS]->(question);
Cypher primer
Import using LOAD CSV
‣ LOAD CSV iterates CSV files applying the
provided query line by line
LOAD CSV [WITH HEADERS] FROM [URI/File path]
AS row
CREATE ...
MERGE ...
MATCH ...
LOAD CSV: The naive version
LOAD CSV WITH HEADERS FROM "questions.csv" AS row
MERGE (question:Question {
id:row.question_id,
title: row.title,
up_vote_count: row.up_vote_count,
creation_date: row.creation_date})
MERGE (owner:User {id:row.owner_user_id, display_name: row.owner_display_name})
MERGE (owner)-[:ASKED]->(question)
FOREACH (tagName IN split(row.tags, ";") |
MERGE (tag:Tag {name:tagName})
MERGE (question)-[:TAGGED]->(tag));
Tip: Start with a sample
LOAD CSV WITH HEADERS FROM "questions.csv" AS row
WITH row LIMIT 100
MERGE (question:Question {
id:row.question_id,
title: row.title,
up_vote_count: row.up_vote_count,
creation_date: row.creation_date})
MERGE (owner:User {id:row.owner_user_id, display_name: row.owner_display_name})
MERGE (owner)-[:ASKED]->(question)
FOREACH (tagName IN split(row.tags, ";") |
MERGE (tag:Tag {name:tagName})
MERGE (question)-[:TAGGED]->(tag));
Tip: MERGE on a key
LOAD CSV WITH HEADERS FROM "questions.csv" AS row
WITH row LIMIT 100
MERGE (question:Question {id:row.question_id})
ON CREATE SET question.title = row.title,
question.up_vote_count = row.up_vote_count,
question.creation_date = row.creation_date
MERGE (owner:User {id:row.owner_user_id})
ON CREATE SET owner.display_name = row.owner_display_name
MERGE (owner)-[:ASKED]->(question)
FOREACH (tagName IN split(row.tags, ";") |
MERGE (tag:Tag {name:tagName})
MERGE (question)-[:TAGGED]->(tag));
Tip: Index/Constrain those keys
CREATE INDEX ON :Label(property);
CREATE CONSTRAINT ON (n:Label) ASSERT n.property IS UNIQUE;
Tip: Index those keys
CREATE INDEX ON :Label(property);
CREATE CONSTRAINT ON (n:Label) ASSERT n.property IS UNIQUE;
CREATE CONSTRAINT ON (q:Question) ASSERT q.id IS UNIQUE;
CREATE CONSTRAINT ON (u:User) ASSERT u.id IS UNIQUE;
CREATE INDEX ON :Question(title);
LOAD CSV WITH HEADERS FROM "questions.csv" AS row
WITH row LIMIT 100
MERGE (question:Question {id:row.question_id})
ON CREATE SET question.title = row.title,
question.up_vote_count = row.up_vote_count,
question.creation_date = row.creation_date;
Tip: One MERGE per statement
LOAD CSV WITH HEADERS FROM "questions.csv" AS row
WITH row LIMIT 100
MERGE (owner:User {id:row.owner_user_id})
ON CREATE SET owner.display_name = row.owner_display_name;
Tip: One MERGE per statement
LOAD CSV WITH HEADERS FROM "questions.csv" AS row
WITH row LIMIT 100
MATCH (question:Question {id:row.question_id})
MATCH (owner:User {id:row.owner_user_id})
MERGE (owner)-[:ASKED]->(question);
Tip: One MERGE per statement
Tip: Use DISTINCT
LOAD CSV WITH HEADERS FROM "questions.csv" AS row
WITH row LIMIT 100
UNWIND split(row.tags, ";") AS tag
WITH distinct tag
MERGE (:Tag {name: tag});
Tip: Use periodic commit
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "questions.csv" AS row
MERGE (question:Question {id:row.question_id})
ON CREATE SET question.title = row.title,
question.up_vote_count = row.up_vote_count,
question.creation_date = row.creation_date;
Periodic commit
‣ Neo4j keeps all transaction state in memory
which is problematic for large CSV files
‣ USING PERIODIC COMMIT flushes the
transaction after a certain number of rows
‣ Default is 1000 rows but it’s configurable
‣ Currently only works with LOAD CSV
Tip: Script your import commands
Tip: Use neo4j-shell to load script
$ ./neo4j-enterprise-2.3.0/bin/neo4j-shell
-file import.cql [-path so.db]
LOAD CSV: Summary
‣ ETL power tool
‣ Built into Neo4J since version 2.1
‣ Can load data from any URL
‣ Good for medium size data
(up to 10M rows)
Bulk loading an initial data set
‣ Introducing the Neo4j Import Tool
‣ Find it in the bin folder of your Neo4j
download
‣ Used to large sized initial data sets
‣ Skips the transactional layer of Neo4j and
concurrently writes store files directly
Importing into Neo4j
:ID(Crime) :LABEL description
export NEO=neo4j-enterprise-2.3.0
$NEO/bin/neo4j-import 
--into stackoverflow.db 
--id-type string 
--nodes:Post extracted/Posts_header.csv,extracted/Posts.csv.gz 
--nodes:User extracted/Users_header.csv,extracted/Users.csv.gz 
--nodes:Tag extracted/Tags_header.csv,extracted/Tags.csv.gz 
--relationships:PARENT_OF extracted/PostsRels_header.csv,extracted/PostsRels.csv.gz 
--relationships:ANSWERS extracted/PostsAnswers_header.csv,extracted/PostsAnswers.csv.gz
--relationships:HAS_TAG extracted/TagsPosts_header.csv,extracted/TagsPosts.csv.gz 
--relationships:POSTED extracted/UsersPosts_header.csv,extracted/UsersPosts.csv.gz
Expects files in a certain format
:ID(Crime) :LABEL descriptionpostId:ID(Post) title body
Nodes
userId:ID(User) displayname views
Rels
:START_ID(User) :END_ID(Post)
<?xml version="1.0" encoding="utf-16"?>
<posts>
...
<row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667"
Score="358" ViewCount="24247" Body="..." OwnerUserId="8" LastEditorUserId="451518"
LastEditorDisplayName="Rich B" LastEditDate="2014-07-28T10:02:50.557" LastActivityDate="
2015-08-01T12:55:11.380" Title="When setting a form's opacity should I use a decimal or
double?" Tags="&lt;c#&gt;&lt;winforms&gt;&lt;type-conversion&gt;&lt;opacity&gt;"
AnswerCount="13" CommentCount="1" FavoriteCount="28" CommunityOwnedDate="2012-10-31T16:42:
47.213" />
...
</posts>
What do we have?
<posts>
...
<row Id="4" PostTypeId="1"
AcceptedAnswerId="7" CreationDate="
2008-07-31T21:42:52.667" Score="358"
ViewCount="24247" Body="..."
OwnerUserId="8" LastEditorUserId="
451518" LastEditorDisplayName="Rich
B" LastEditDate="2014-07-28T10:02:
50.557" LastActivityDate="2015-08-
01T12:55:11.380" Title="When setting
a form's opacity should I use a
decimal or double?"
XML to CSV
Java program
The generated files
$ cat extracted/Posts_header.csv
"postId:ID(Post)","title","postType:INT",
"createdAt","score:INT","views:INT","answers:INT",
"comments:INT","favorites:INT","updatedAt"
The generated files
$ gzcat extracted/Posts.csv.gz | head -n2
"4","When setting a forms opacity should I use a
decimal or double?","1","2008-07-31T21:42:52.667","
358","24247","13","1","28","2014-07-28T10:02:50.557"
"6","Why doesn’t the percentage width child in
absolutely positioned parent work?","1","2008-07-
31T22:08:08.620","156","11840","5","0","7","2015-04-
26T14:37:49.673"
The generated files
$ cat extracted/PostsRels_header.csv
":START_ID(Post)",":END_ID(Post)"
Importing into Neo4j
:ID(Crime) :LABEL description
export NEO=neo4j-enterprise-2.3.0
$NEO/bin/neo4j-import 
--into stackoverflow.db 
--id-type string 
--nodes:Post extracted/Posts_header.csv,extracted/Posts.csv.gz 
--nodes:User extracted/Users_header.csv,extracted/Users.csv.gz 
--nodes:Tag extracted/Tags_header.csv,extracted/Tags.csv.gz 
--relationships:PARENT_OF extracted/PostsRels_header.csv,extracted/PostsRels.csv.gz 
--relationships:ANSWERS extracted/PostsAnswers_header.csv,extracted/PostsAnswers.csv.gz
--relationships:HAS_TAG extracted/TagsPosts_header.csv,extracted/TagsPosts.csv.gz 
--relationships:POSTED extracted/UsersPosts_header.csv,extracted/UsersPosts.csv.gz 
IMPORT DONE in 3m 10s 661ms. Imported:
31138574 nodes
77930024 relationships
218106346 properties
Tip: Make sure your data is clean
:ID(Crime) :LABEL description
‣ Use a consistent line break style
‣ Ensure headers are consistent with data
‣ Quote Special characters
‣ Escape stray quotes
‣ Remove non-text characters
Even more tips
:ID(Crime) :LABEL description
‣ Get the fastest disk you can
‣ Use separate disk for input and output
‣ Compress your CSV files
‣ The more cores the better
‣ Separate headers from data
The End
‣ https://github.com/mdamien/stackoverflow-neo4j
‣ http://neo4j.com/blog/import-10m-stack-overflow-questions/
‣ http://neo4j.com/blog/cypher-load-json-from-url/
Michael Hunger @mesirii
Mark Needham @markhneedham

More Related Content

What's hot (20)

PPTX
Getting started with Elasticsearch and .NET
Tomas Jansson
 
PPTX
Php forum2015 tomas_final
Bertrand Matthelie
 
PPTX
MongoDB and Indexes - MUG Denver - 20160329
Douglas Duncan
 
PDF
The Ring programming language version 1.5.2 book - Part 44 of 181
Mahmoud Samir Fayed
 
PDF
Using Scala Slick at FortyTwo
Eishay Smith
 
ODP
Patterns for slick database applications
Skills Matter
 
PPTX
Indexing with MongoDB
MongoDB
 
PPTX
Web осень 2012 лекция 6
Technopark
 
PDF
The Ring programming language version 1.5.4 book - Part 43 of 185
Mahmoud Samir Fayed
 
PDF
NoSQL для PostgreSQL: Jsquery — язык запросов
CodeFest
 
PPT
Fast querying indexing for performance (4)
MongoDB
 
PDF
SunshinePHP 2017 - Making the most out of MySQL
Gabriela Ferrara
 
PPTX
Web весна 2013 лекция 6
Technopark
 
PDF
How to Use JSON in MySQL Wrong
Karwin Software Solutions LLC
 
ODP
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
Nikolay Samokhvalov
 
PDF
Getting Creative with WordPress Queries, Again
DrewAPicture
 
PPTX
Webinaire 2 de la série « Retour aux fondamentaux » : Votre première applicat...
MongoDB
 
PDF
MySQL 5.7 NF – JSON Datatype 활용
I Goo Lee
 
PDF
Ajax Performance Tuning and Best Practices
Doris Chen
 
Getting started with Elasticsearch and .NET
Tomas Jansson
 
Php forum2015 tomas_final
Bertrand Matthelie
 
MongoDB and Indexes - MUG Denver - 20160329
Douglas Duncan
 
The Ring programming language version 1.5.2 book - Part 44 of 181
Mahmoud Samir Fayed
 
Using Scala Slick at FortyTwo
Eishay Smith
 
Patterns for slick database applications
Skills Matter
 
Indexing with MongoDB
MongoDB
 
Web осень 2012 лекция 6
Technopark
 
The Ring programming language version 1.5.4 book - Part 43 of 185
Mahmoud Samir Fayed
 
NoSQL для PostgreSQL: Jsquery — язык запросов
CodeFest
 
Fast querying indexing for performance (4)
MongoDB
 
SunshinePHP 2017 - Making the most out of MySQL
Gabriela Ferrara
 
Web весна 2013 лекция 6
Technopark
 
How to Use JSON in MySQL Wrong
Karwin Software Solutions LLC
 
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
Nikolay Samokhvalov
 
Getting Creative with WordPress Queries, Again
DrewAPicture
 
Webinaire 2 de la série « Retour aux fondamentaux » : Votre première applicat...
MongoDB
 
MySQL 5.7 NF – JSON Datatype 활용
I Goo Lee
 
Ajax Performance Tuning and Best Practices
Doris Chen
 

Similar to Graph Connect: Importing data quickly and easily (16)

PDF
Importing Data into Neo4j quickly and easily - StackOverflow
Neo4j
 
PDF
GraphConnect Europe 2016 - Importing Data - Mark Needham, Michael Hunger
Neo4j
 
PDF
Working With a Real-World Dataset in Neo4j: Import and Modeling
Neo4j
 
PPTX
Relational to Graph - Import
Neo4j
 
PDF
Neo4j Introduction - Game of Thrones
Neo4j
 
PDF
There and Back Again, A Developer's Tale
Neo4j
 
PPTX
Graph databases for SQL Server profesionnals
MSDEVMTL
 
PDF
Leveraging the Power of Graph Databases in PHP
Jeremy Kendall
 
PPTX
The openCypher Project - An Open Graph Query Language
Neo4j
 
PDF
Intro to Cypher
Neo4j
 
PPTX
How to Import JSON Using Cypher and APOC
Neo4j
 
PPTX
Top 10 Cypher Tuning Tips & Tricks
Neo4j
 
PPTX
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
jexp
 
PPT
Neo4 j
Dr.Saranya K.G
 
PPTX
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
PDF
Leveraging the Power of Graph Databases in PHP
Jeremy Kendall
 
Importing Data into Neo4j quickly and easily - StackOverflow
Neo4j
 
GraphConnect Europe 2016 - Importing Data - Mark Needham, Michael Hunger
Neo4j
 
Working With a Real-World Dataset in Neo4j: Import and Modeling
Neo4j
 
Relational to Graph - Import
Neo4j
 
Neo4j Introduction - Game of Thrones
Neo4j
 
There and Back Again, A Developer's Tale
Neo4j
 
Graph databases for SQL Server profesionnals
MSDEVMTL
 
Leveraging the Power of Graph Databases in PHP
Jeremy Kendall
 
The openCypher Project - An Open Graph Query Language
Neo4j
 
Intro to Cypher
Neo4j
 
How to Import JSON Using Cypher and APOC
Neo4j
 
Top 10 Cypher Tuning Tips & Tricks
Neo4j
 
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptx
jexp
 
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
Leveraging the Power of Graph Databases in PHP
Jeremy Kendall
 
Ad

More from Mark Needham (14)

PDF
Neo4j GraphTour: Utilizing Powerful Extensions for Analytics and Operations
Mark Needham
 
PDF
This week in Neo4j - 3rd February 2018
Mark Needham
 
PDF
Building a recommendation engine with python and neo4j
Mark Needham
 
PDF
Graph Connect: Tuning Cypher
Mark Needham
 
PDF
Graph Connect Europe: From Zero To Import
Mark Needham
 
PDF
Optimizing cypher queries in neo4j
Mark Needham
 
PPTX
Football graph - Neo4j and the Premier League
Mark Needham
 
PDF
The Football Graph - Neo4j and the Premier League
Mark Needham
 
PPTX
Scala: An experience report
Mark Needham
 
PPTX
Visualisations
Mark Needham
 
PPTX
Mixing functional programming approaches in an object oriented language
Mark Needham
 
PPT
Mixing functional and object oriented approaches to programming in C#
Mark Needham
 
PPT
Mixing functional and object oriented approaches to programming in C#
Mark Needham
 
PDF
F#: What I've learnt so far
Mark Needham
 
Neo4j GraphTour: Utilizing Powerful Extensions for Analytics and Operations
Mark Needham
 
This week in Neo4j - 3rd February 2018
Mark Needham
 
Building a recommendation engine with python and neo4j
Mark Needham
 
Graph Connect: Tuning Cypher
Mark Needham
 
Graph Connect Europe: From Zero To Import
Mark Needham
 
Optimizing cypher queries in neo4j
Mark Needham
 
Football graph - Neo4j and the Premier League
Mark Needham
 
The Football Graph - Neo4j and the Premier League
Mark Needham
 
Scala: An experience report
Mark Needham
 
Visualisations
Mark Needham
 
Mixing functional programming approaches in an object oriented language
Mark Needham
 
Mixing functional and object oriented approaches to programming in C#
Mark Needham
 
Mixing functional and object oriented approaches to programming in C#
Mark Needham
 
F#: What I've learnt so far
Mark Needham
 
Ad

Recently uploaded (20)

PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PDF
Online Queue Management System for Public Service Offices in Nepal [Focused i...
Rishab Acharya
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Online Queue Management System for Public Service Offices in Nepal [Focused i...
Rishab Acharya
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Human Resources Information System (HRIS)
Amity University, Patna
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 

Graph Connect: Importing data quickly and easily

  • 1. Importing data quickly and easily Michael Hunger @mesirii Mark Needham @markhneedham
  • 3. The data set ‣ Stack Exchange API ‣ Stack Exchange Data Dump
  • 4. Stack Exchange API { "items": [{ "question_id": 24620768, "link": "http://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements", "title": "Neo4j cypher query: get last N elements", "answer_count": 1, "score": 1, ..... "creation_date": 1404771217, "body_markdown": "I have a graph....How can I do that?", "tags": ["neo4j", "cypher"], "owner": { "reputation": 815, "user_id": 1212067, .... "link": "http://stackoverflow.com/users/1212067/" }, "answers": [{ "owner": { "reputation": 488, "user_id": 737080, "display_name": "Chris Leishman", .... }, "answer_id": 24620959, "share_link": "http://stackoverflow.com/a/24620959", .... "body_markdown": "The simplest would be to use an ... some discussion on this here:...", "title": "Neo4j cypher query: get last N elements" }] }
  • 5. JSON to CSV JSON ??? CSV LOAD CSV
  • 6. Initial Model { "items": [{ "question_id": 24620768, "link": "http://stackoverflow.com/questions/24620768/neo4j-cypher-query-get-last-n-elements", "title": "Neo4j cypher query: get last N elements", "answer_count": 1, "score": 1, ..... "creation_date": 1404771217, "body_markdown": "I have a graph....How can I do that?", "tags": ["neo4j", "cypher"], "owner": { "reputation": 815, "user_id": 1212067, .... "link": "http://stackoverflow.com/users/1212067/" }, "answers": [{ "owner": { "reputation": 488, "user_id": 737080, "display_name": "Chris Leishman", .... }, "answer_id": 24620959, "share_link": "http://stackoverflow.com/a/24620959", .... "body_markdown": "The simplest would be to use an ... some discussion on this here:...", "title": "Neo4j cypher query: get last N elements" }] }
  • 8. jq: Converting questions to CSV jq -r '.[] | .items[] | [.question_id, .title, .up_vote_count, .down_vote_count, .creation_date, .last_activity_date, .owner.user_id, .owner.display_name, (.tags | join(";"))] | @csv ' so.json
  • 9. jq: Converting questions to CSV $ head -n5 questions.csv question_id,title,up_vote_count,down_vote_count,creation_date,last_activity_date, owner_user_id,owner_display_name,tags 33023306,"How to delete multiple nodes by specific ID using Cypher", 0,0,1444328760,1444332194,260511,"rayman","jdbc;neo4j;cypher;spring-data-neo4j" 33020796,"How do a general search across string properties in my nodes?", 1,0,1444320356,1444324015,1429542,"osazuwa","ruby-on-rails;neo4j;neo4j.rb" 33018818,"Neo4j match nodes related to all nodes in collection", 0,0,1444314877,1444332779,1212463,"lmazgon","neo4j;cypher" 33018084,"Problems upgrading to Spring Data Neo4j 4.0.0", 0,0,1444312993,1444312993,1528942,"Gr&#233;goire Colbert","neo4j;spring-data-neo4j"
  • 10. jq: Converting answers to CSV jq -r '.[] | .items[] | { question_id: .question_id, answer: .answers[]? } | [.question_id, .answer.answer_id, .answer.title, .answer.owner.user_id, .answer.owner.display_name, (.answer.tags | join(";")), .answer.up_vote_count, .answer.down_vote_count] | @csv'
  • 11. jq: Converting answers to CSV $ head -n5 answers.csv question_id,answer_id,answer_title,owner_id,owner_display_name,tags,up_vote_count, down_vote_count 33023306,33024189,"How to delete multiple nodes by specific ID using Cypher", 3248864,"FylmTM","",0,0 33020796,33021958,"How do a general search across string properties in my nodes?", 2920686,"FrobberOfBits","",0,0 33018818,33020068,"Neo4j match nodes related to all nodes in collection",158701," Stefan Armbruster","",0,0 33018818,33024273,"Neo4j match nodes related to all nodes in collection",974731," cybersam","",0,0
  • 12. Time to import into Neo4j...
  • 13. Introducing Cypher ‣ The Graph Query Language ‣ Declarative language (think SQL) for graphs ‣ ASCII art based
  • 14. ‣ CREATE create a new pattern in the graph Cypher primer CREATE (user:User {name:"Michael Hunger"}) CREATE (question:Question {title: "..."}) CREATE (answer:Answer {text: "..."}) CREATE (user)-[:PROVIDED]->(answer) CREATE (answer)-[:ANSWERS]->(question);
  • 15. ‣ CREATE create a new pattern in the graph Cypher primer CREATE (user:User {name:"Michael Hunger"}) CREATE (question:Question {title: "..."}) CREATE (answer:Answer {text: "..."}) CREATE (user)-[:PROVIDED]->(answer) CREATE (answer)-[:ANSWERS]->(question); CREATE (user:User {name:"Michael Hunger"}) Label PropertyNode
  • 16. ‣ CREATE create a new pattern in the graph Cypher primer CREATE (user:User {name:"Michael Hunger"}) CREATE (question:Question {title: "..."}) CREATE (answer:Answer {text: "..."}) CREATE (user)-[:PROVIDED]->(answer) CREATE (answer)-[:ANSWERS]->(question); CREATE (user)-[:PROVIDED]->(answer) Relationship
  • 17. ‣ MATCH find a pattern in the graph Cypher primer MATCH (answer:Answer)<-[:PROVIDED]-(user:User), (answer)-[:ANSWERS]->(question) WHERE user.display_name = "Michael Hunger" RETURN question, answer;
  • 18. ‣ MERGE find pattern if it exists, create it if it doesn’t MERGE (user:User {name:"Mark Needham"}) MERGE (question:Question {title: "..."}) MERGE (answer:Answer {text: "..."}) MERGE (user)-[:PROVIDED]->(answer) MERGE (answer)-[:ANSWERS]->(question); Cypher primer
  • 19. Import using LOAD CSV ‣ LOAD CSV iterates CSV files applying the provided query line by line LOAD CSV [WITH HEADERS] FROM [URI/File path] AS row CREATE ... MERGE ... MATCH ...
  • 20. LOAD CSV: The naive version LOAD CSV WITH HEADERS FROM "questions.csv" AS row MERGE (question:Question { id:row.question_id, title: row.title, up_vote_count: row.up_vote_count, creation_date: row.creation_date}) MERGE (owner:User {id:row.owner_user_id, display_name: row.owner_display_name}) MERGE (owner)-[:ASKED]->(question) FOREACH (tagName IN split(row.tags, ";") | MERGE (tag:Tag {name:tagName}) MERGE (question)-[:TAGGED]->(tag));
  • 21. Tip: Start with a sample LOAD CSV WITH HEADERS FROM "questions.csv" AS row WITH row LIMIT 100 MERGE (question:Question { id:row.question_id, title: row.title, up_vote_count: row.up_vote_count, creation_date: row.creation_date}) MERGE (owner:User {id:row.owner_user_id, display_name: row.owner_display_name}) MERGE (owner)-[:ASKED]->(question) FOREACH (tagName IN split(row.tags, ";") | MERGE (tag:Tag {name:tagName}) MERGE (question)-[:TAGGED]->(tag));
  • 22. Tip: MERGE on a key LOAD CSV WITH HEADERS FROM "questions.csv" AS row WITH row LIMIT 100 MERGE (question:Question {id:row.question_id}) ON CREATE SET question.title = row.title, question.up_vote_count = row.up_vote_count, question.creation_date = row.creation_date MERGE (owner:User {id:row.owner_user_id}) ON CREATE SET owner.display_name = row.owner_display_name MERGE (owner)-[:ASKED]->(question) FOREACH (tagName IN split(row.tags, ";") | MERGE (tag:Tag {name:tagName}) MERGE (question)-[:TAGGED]->(tag));
  • 23. Tip: Index/Constrain those keys CREATE INDEX ON :Label(property); CREATE CONSTRAINT ON (n:Label) ASSERT n.property IS UNIQUE;
  • 24. Tip: Index those keys CREATE INDEX ON :Label(property); CREATE CONSTRAINT ON (n:Label) ASSERT n.property IS UNIQUE; CREATE CONSTRAINT ON (q:Question) ASSERT q.id IS UNIQUE; CREATE CONSTRAINT ON (u:User) ASSERT u.id IS UNIQUE; CREATE INDEX ON :Question(title);
  • 25. LOAD CSV WITH HEADERS FROM "questions.csv" AS row WITH row LIMIT 100 MERGE (question:Question {id:row.question_id}) ON CREATE SET question.title = row.title, question.up_vote_count = row.up_vote_count, question.creation_date = row.creation_date; Tip: One MERGE per statement
  • 26. LOAD CSV WITH HEADERS FROM "questions.csv" AS row WITH row LIMIT 100 MERGE (owner:User {id:row.owner_user_id}) ON CREATE SET owner.display_name = row.owner_display_name; Tip: One MERGE per statement
  • 27. LOAD CSV WITH HEADERS FROM "questions.csv" AS row WITH row LIMIT 100 MATCH (question:Question {id:row.question_id}) MATCH (owner:User {id:row.owner_user_id}) MERGE (owner)-[:ASKED]->(question); Tip: One MERGE per statement
  • 28. Tip: Use DISTINCT LOAD CSV WITH HEADERS FROM "questions.csv" AS row WITH row LIMIT 100 UNWIND split(row.tags, ";") AS tag WITH distinct tag MERGE (:Tag {name: tag});
  • 29. Tip: Use periodic commit USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "questions.csv" AS row MERGE (question:Question {id:row.question_id}) ON CREATE SET question.title = row.title, question.up_vote_count = row.up_vote_count, question.creation_date = row.creation_date;
  • 30. Periodic commit ‣ Neo4j keeps all transaction state in memory which is problematic for large CSV files ‣ USING PERIODIC COMMIT flushes the transaction after a certain number of rows ‣ Default is 1000 rows but it’s configurable ‣ Currently only works with LOAD CSV
  • 31. Tip: Script your import commands
  • 32. Tip: Use neo4j-shell to load script $ ./neo4j-enterprise-2.3.0/bin/neo4j-shell -file import.cql [-path so.db]
  • 33. LOAD CSV: Summary ‣ ETL power tool ‣ Built into Neo4J since version 2.1 ‣ Can load data from any URL ‣ Good for medium size data (up to 10M rows)
  • 34. Bulk loading an initial data set ‣ Introducing the Neo4j Import Tool ‣ Find it in the bin folder of your Neo4j download ‣ Used to large sized initial data sets ‣ Skips the transactional layer of Neo4j and concurrently writes store files directly
  • 35. Importing into Neo4j :ID(Crime) :LABEL description export NEO=neo4j-enterprise-2.3.0 $NEO/bin/neo4j-import --into stackoverflow.db --id-type string --nodes:Post extracted/Posts_header.csv,extracted/Posts.csv.gz --nodes:User extracted/Users_header.csv,extracted/Users.csv.gz --nodes:Tag extracted/Tags_header.csv,extracted/Tags.csv.gz --relationships:PARENT_OF extracted/PostsRels_header.csv,extracted/PostsRels.csv.gz --relationships:ANSWERS extracted/PostsAnswers_header.csv,extracted/PostsAnswers.csv.gz --relationships:HAS_TAG extracted/TagsPosts_header.csv,extracted/TagsPosts.csv.gz --relationships:POSTED extracted/UsersPosts_header.csv,extracted/UsersPosts.csv.gz
  • 36. Expects files in a certain format :ID(Crime) :LABEL descriptionpostId:ID(Post) title body Nodes userId:ID(User) displayname views Rels :START_ID(User) :END_ID(Post)
  • 37. <?xml version="1.0" encoding="utf-16"?> <posts> ... <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="358" ViewCount="24247" Body="..." OwnerUserId="8" LastEditorUserId="451518" LastEditorDisplayName="Rich B" LastEditDate="2014-07-28T10:02:50.557" LastActivityDate=" 2015-08-01T12:55:11.380" Title="When setting a form's opacity should I use a decimal or double?" Tags="&lt;c#&gt;&lt;winforms&gt;&lt;type-conversion&gt;&lt;opacity&gt;" AnswerCount="13" CommentCount="1" FavoriteCount="28" CommunityOwnedDate="2012-10-31T16:42: 47.213" /> ... </posts> What do we have?
  • 38. <posts> ... <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate=" 2008-07-31T21:42:52.667" Score="358" ViewCount="24247" Body="..." OwnerUserId="8" LastEditorUserId=" 451518" LastEditorDisplayName="Rich B" LastEditDate="2014-07-28T10:02: 50.557" LastActivityDate="2015-08- 01T12:55:11.380" Title="When setting a form's opacity should I use a decimal or double?" XML to CSV Java program
  • 39. The generated files $ cat extracted/Posts_header.csv "postId:ID(Post)","title","postType:INT", "createdAt","score:INT","views:INT","answers:INT", "comments:INT","favorites:INT","updatedAt"
  • 40. The generated files $ gzcat extracted/Posts.csv.gz | head -n2 "4","When setting a forms opacity should I use a decimal or double?","1","2008-07-31T21:42:52.667"," 358","24247","13","1","28","2014-07-28T10:02:50.557" "6","Why doesn’t the percentage width child in absolutely positioned parent work?","1","2008-07- 31T22:08:08.620","156","11840","5","0","7","2015-04- 26T14:37:49.673"
  • 41. The generated files $ cat extracted/PostsRels_header.csv ":START_ID(Post)",":END_ID(Post)"
  • 42. Importing into Neo4j :ID(Crime) :LABEL description export NEO=neo4j-enterprise-2.3.0 $NEO/bin/neo4j-import --into stackoverflow.db --id-type string --nodes:Post extracted/Posts_header.csv,extracted/Posts.csv.gz --nodes:User extracted/Users_header.csv,extracted/Users.csv.gz --nodes:Tag extracted/Tags_header.csv,extracted/Tags.csv.gz --relationships:PARENT_OF extracted/PostsRels_header.csv,extracted/PostsRels.csv.gz --relationships:ANSWERS extracted/PostsAnswers_header.csv,extracted/PostsAnswers.csv.gz --relationships:HAS_TAG extracted/TagsPosts_header.csv,extracted/TagsPosts.csv.gz --relationships:POSTED extracted/UsersPosts_header.csv,extracted/UsersPosts.csv.gz IMPORT DONE in 3m 10s 661ms. Imported: 31138574 nodes 77930024 relationships 218106346 properties
  • 43. Tip: Make sure your data is clean :ID(Crime) :LABEL description ‣ Use a consistent line break style ‣ Ensure headers are consistent with data ‣ Quote Special characters ‣ Escape stray quotes ‣ Remove non-text characters
  • 44. Even more tips :ID(Crime) :LABEL description ‣ Get the fastest disk you can ‣ Use separate disk for input and output ‣ Compress your CSV files ‣ The more cores the better ‣ Separate headers from data
  • 45. The End ‣ https://github.com/mdamien/stackoverflow-neo4j ‣ http://neo4j.com/blog/import-10m-stack-overflow-questions/ ‣ http://neo4j.com/blog/cypher-load-json-from-url/ Michael Hunger @mesirii Mark Needham @markhneedham