Neo4j Import Webinar

Neo4j Import Webinar
Mark Needham (@markhneedham)
30th July 2015

Neo Technology, Inc Confidential
#neo4j
Chicago Crime dataset

#neo4j
Chicago Crime CSV file
imported into
The goal

#neo4j
Exploring the data

#neo4j
Exploring the data
LOAD CSV WITH HEADERS FROM
"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to
_present.csv"
AS row
RETURN row
LIMIT 1

#neo4j
Sketch a rough initial model

#neo4j
Import a sample: Crimes
_present.csv"
AS row
WITH row LIMIT 100
MERGE (crime:Crime {
id: row.ID,
description: row.Description,
caseNumber: row.`Case Number`,
arrest: row.Arrest,
domestic: row.Domestic});

#neo4j
Import a sample: Crimes
Show how to do this better by splitting up the attrib
utes

#neo4j
Import a sample: Crime Types
_present.csv"
AS row
WITH row LIMIT 100
MERGE (:CrimeType {
name: row.`Primary Type`});

#neo4j
Import a sample: Crimes -> Crime Types
_present.csv"
AS row
WITH row LIMIT 100
MATCH (crime:Crime {
id: row.ID,
description: row.Description})
MATCH (crimeType:CrimeType {
name: row.`Primary Type`})
MERGE (crime)-[:TYPE]->(crimeType);

#neo4j
Add indexes
CREATE INDEX ON :Label(property)

#neo4j
Add indexes
CREATE INDEX ON :Label(property)
CREATE INDEX ON :Crime(id);
CREATE INDEX ON :Location(name);
CREATE INDEX ON :CrimeType(name);
CREATE INDEX ON :Location(name);
...

#neo4j
Periodic Commit
USING PERIODIC COMMIT
file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_
present.csv
MERGE (crime:Crime {
id: row.ID,
description: row.Description})

#neo4j
Periodic Commit
• Neo4j keeps all transaction state in memory
which becomes problematic for large CSV files
• USING PERIODIC COMMIT flushes the
transaction after a certain number of rows
• Default is 1000 rows but it’s configurable
• Currently only works with LOAD CSV

#neo4j
Avoiding the Eager
• Cypher has an Eager operator which will bring
forward parts of a query to ensure safety
• We don’t want to see this operator when we’re
importing data – it will slow things down a lot
• Put a diagram of eager => slow (maybe a query
plan?)

#neo4j
LOAD CSV in summary
• ETL power tool
• Built into Neo4J since version 2.1
• Can load data from any URL
• Good for medium size data (up to 10M rows)

#neo4j
Bulk loading an initial data set
• Introducing the Neo4j Import Tool
• Find it in the bin folder of your Neo4j download
• Used to large sized initial data sets
• Skips the transactional layer of Neo4j and writes
store files directly

#neo4j
Expects files in a certain format
:ID(Crime) :LABEL description :ID(Beat) :LABEL
:START_ID(Crime) :END_ID(Beat) :TYPE
Nodes
Relationships

#neo4j
What we have…

#neo4j
Chicago Crime
CSV file
Neo4j ready CSV
files
Translation Phase required
Translation
Phase

#neo4j
Chicago Crime
CSV file
Spark all the things
Spark Job
processed by
spits out
Neo4j ready CSV
files
imported into

#neo4j
The Spark Job

#neo4j
Submitting the Spark Job
./spark-1.3.0-bin-hadoop1/bin/spark-submit
--driver-memory 5g
--class GenerateCSVFiles
--master local[8]
target/scala-2.10/playground_2.10-1.0.jar
real 1m25.506s
user 8m2.183s
sys 0m24.267s

#neo4j
The generated files
$ ls -1 tmp/*.csv
tmp/beats.csv
tmp/crimeDates.csv
tmp/crimes.csv
tmp/crimesBeats.csv
tmp/crimesDates.csv
tmp/crimesLocations.csv
tmp/crimesPrimaryTypes.csv
tmp/dates.csv
tmp/locations.csv
tmp/primaryTypes.csv

#neo4j
Importing into Neo4j
DATA=tmp
NEO=./neo4j-enterprise-2.2.3
$NEO/bin/neo4j-import
--into $DATA/crimes.db
--nodes $DATA/crimes.csv
--nodes $DATA/beats.csv
--nodes $DATA/primaryTypes.csv
--nodes $DATA/locations.csv
--relationships $DATA/crimesBeats.csv
--relationships $DATA/crimesPrimaryTypes.csv
--relationships $DATA/crimesLocations.csv
--stacktrace
IMPORT DONE in 36s 208ms

#neo4j
Enriching the crime graph

#neo4j
2 options
JSON CSVjq
LOAD
CSV
JSON
Language
Driver
HTTP
API

#neo4j
Using py2neo to load JSON into Neo4j
import json
from py2neo import Graph, authenticate
authenticate("localhost:7474", "neo4j", "foobar")
graph = Graph()
with open('categories.json') as data_file:
json = json.load(data_file)
query = """
WITH {json} AS document
UNWIND document.categories AS category
UNWIND category.sub_categories AS subCategory
MERGE (c:CrimeCategory {name: category.name})
MERGE (sc:SubCategory {code: subCategory.code})
ON CREATE SET sc.description = subCategory.description
MERGE (c)-[:CHILD]->(sc)
"""
print graph.cypher.execute(query, json = json)

#neo4j
anslate from JSON to CSV

#neo4j
Import using LOAD CSV

#neo4j
Updating the graph
• As new crimes come in we want to update the
graph to take them into account

#neo4j
Updating the graph
• Import this using REST Transactional API

#neo4j
This talk brought to you by…

#neo4j
And that’s it…

Neo4j Import Webinar

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Neo4j Import Webinar (20)

More from Neo4j (20)

Recently uploaded (20)

Neo4j Import Webinar

Editor's Notes