SlideShare a Scribd company logo
Neo4j Import Webinar
Mark Needham (@markhneedham)
30th July 2015
Neo Technology, Inc Confidential
#neo4j
Chicago Crime dataset
Neo Technology, Inc Confidential
#neo4j
Chicago Crime dataset
Neo Technology, Inc Confidential
#neo4j
Chicago Crime CSV file
imported into
The goal
Neo Technology, Inc Confidential
#neo4j
Exploring the data
Neo Technology, Inc Confidential
#neo4j
Exploring the data
LOAD CSV WITH HEADERS FROM
"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to
_present.csv"
AS row
RETURN row
LIMIT 1
Neo Technology, Inc Confidential
#neo4j
Exploring the data
Neo Technology, Inc Confidential
#neo4j
Exploring the data
Neo Technology, Inc Confidential
#neo4j
Sketch a rough initial model
Neo Technology, Inc Confidential
#neo4j
Import a sample: Crimes
LOAD CSV WITH HEADERS FROM
"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to
_present.csv"
AS row
WITH row LIMIT 100
MERGE (crime:Crime {
id: row.ID,
description: row.Description,
caseNumber: row.`Case Number`,
arrest: row.Arrest,
domestic: row.Domestic});
Neo Technology, Inc Confidential
#neo4j
Import a sample: Crimes
Show how to do this better by splitting up the attrib
utes
Neo Technology, Inc Confidential
#neo4j
Import a sample: Crime Types
LOAD CSV WITH HEADERS FROM
"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to
_present.csv"
AS row
WITH row LIMIT 100
MERGE (:CrimeType {
name: row.`Primary Type`});
Neo Technology, Inc Confidential
#neo4j
Import a sample: Crimes -> Crime Types
LOAD CSV WITH HEADERS FROM
"file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to
_present.csv"
AS row
WITH row LIMIT 100
MATCH (crime:Crime {
id: row.ID,
description: row.Description})
MATCH (crimeType:CrimeType {
name: row.`Primary Type`})
MERGE (crime)-[:TYPE]->(crimeType);
Neo Technology, Inc Confidential
#neo4j
Add indexes
CREATE INDEX ON :Label(property)
Neo Technology, Inc Confidential
#neo4j
Add indexes
CREATE INDEX ON :Label(property)
CREATE INDEX ON :Crime(id);
CREATE INDEX ON :Location(name);
CREATE INDEX ON :CrimeType(name);
CREATE INDEX ON :Location(name);
...
Neo Technology, Inc Confidential
#neo4j
Periodic Commit
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_
present.csv
MERGE (crime:Crime {
id: row.ID,
description: row.Description})
Neo Technology, Inc Confidential
#neo4j
Periodic Commit
• Neo4j keeps all transaction state in memory
which becomes problematic for large CSV files
• USING PERIODIC COMMIT flushes the
transaction after a certain number of rows
• Default is 1000 rows but it’s configurable
• Currently only works with LOAD CSV
Neo Technology, Inc Confidential
#neo4j
Avoiding the Eager
• Cypher has an Eager operator which will bring
forward parts of a query to ensure safety
• We don’t want to see this operator when we’re
importing data – it will slow things down a lot
• Put a diagram of eager => slow (maybe a query
plan?)
Neo Technology, Inc Confidential
#neo4j
LOAD CSV in summary
• ETL power tool
• Built into Neo4J since version 2.1
• Can load data from any URL
• Good for medium size data (up to 10M rows)
Neo Technology, Inc Confidential
#neo4j
Bulk loading an initial data set
• Introducing the Neo4j Import Tool
• Find it in the bin folder of your Neo4j download
• Used to large sized initial data sets
• Skips the transactional layer of Neo4j and writes
store files directly
Neo Technology, Inc Confidential
#neo4j
Expects files in a certain format
:ID(Crime) :LABEL description :ID(Beat) :LABEL
:START_ID(Crime) :END_ID(Beat) :TYPE
Nodes
Relationships
Neo Technology, Inc Confidential
#neo4j
What we have…
Neo Technology, Inc Confidential
#neo4j
Chicago Crime
CSV file
Neo4j ready CSV
files
Translation Phase required
Translation
Phase
Neo Technology, Inc Confidential
#neo4j
Chicago Crime
CSV file
Spark all the things
Spark Job
processed by
spits out
Neo4j ready CSV
files
imported into
Neo Technology, Inc Confidential
#neo4j
The Spark Job
Neo Technology, Inc Confidential
#neo4j
The Spark Job
Neo Technology, Inc Confidential
#neo4j
Submitting the Spark Job
./spark-1.3.0-bin-hadoop1/bin/spark-submit 
--driver-memory 5g 
--class GenerateCSVFiles 
--master local[8] 
target/scala-2.10/playground_2.10-1.0.jar
real 1m25.506s
user 8m2.183s
sys 0m24.267s
Neo Technology, Inc Confidential
#neo4j
Submitting the Spark Job
./spark-1.3.0-bin-hadoop1/bin/spark-submit 
--driver-memory 5g 
--class GenerateCSVFiles 
--master local[8] 
target/scala-2.10/playground_2.10-1.0.jar
real 1m25.506s
user 8m2.183s
sys 0m24.267s
Neo Technology, Inc Confidential
#neo4j
The generated files
$ ls -1 tmp/*.csv
tmp/beats.csv
tmp/crimeDates.csv
tmp/crimes.csv
tmp/crimesBeats.csv
tmp/crimesDates.csv
tmp/crimesLocations.csv
tmp/crimesPrimaryTypes.csv
tmp/dates.csv
tmp/locations.csv
tmp/primaryTypes.csv
Neo Technology, Inc Confidential
#neo4j
Importing into Neo4j
DATA=tmp
NEO=./neo4j-enterprise-2.2.3
$NEO/bin/neo4j-import 
--into $DATA/crimes.db 
--nodes $DATA/crimes.csv 
--nodes $DATA/beats.csv 
--nodes $DATA/primaryTypes.csv 
--nodes $DATA/locations.csv 
--relationships $DATA/crimesBeats.csv 
--relationships $DATA/crimesPrimaryTypes.csv 
--relationships $DATA/crimesLocations.csv 
--stacktrace
IMPORT DONE in 36s 208ms
Neo Technology, Inc Confidential
#neo4j
Enriching the crime graph
Neo Technology, Inc Confidential
#neo4j
Enriching the crime graph
Neo Technology, Inc Confidential
#neo4j
Enriching the crime graph
Neo Technology, Inc Confidential
#neo4j
2 options
JSON CSVjq
LOAD
CSV
JSON
Language
Driver
HTTP
API
Neo Technology, Inc Confidential
#neo4j
Using py2neo to load JSON into Neo4j
import json
from py2neo import Graph, authenticate
authenticate("localhost:7474", "neo4j", "foobar")
graph = Graph()
with open('categories.json') as data_file:
json = json.load(data_file)
query = """
WITH {json} AS document
UNWIND document.categories AS category
UNWIND category.sub_categories AS subCategory
MERGE (c:CrimeCategory {name: category.name})
MERGE (sc:SubCategory {code: subCategory.code})
ON CREATE SET sc.description = subCategory.description
MERGE (c)-[:CHILD]->(sc)
"""
print graph.cypher.execute(query, json = json)
Neo Technology, Inc Confidential
#neo4j
Enriching the crime graph
anslate from JSON to CSV
Neo Technology, Inc Confidential
#neo4j
Enriching the crime graph
Import using LOAD CSV
Neo Technology, Inc Confidential
#neo4j
Updating the graph
• As new crimes come in we want to update the
graph to take them into account
Neo Technology, Inc Confidential
#neo4j
Updating the graph
• Import this using REST Transactional API
Neo Technology, Inc Confidential
#neo4j
This talk brought to you by…
Neo Technology, Inc Confidential
#neo4j
And that’s it…

More Related Content

PPTX
Airflow를 이용한 데이터 Workflow 관리
YoungHeon (Roy) Kim
 
PPTX
日本語:Mongo dbに於けるシャーディングについて
ippei_suzuki
 
PDF
Data Engineering 101
DaeMyung Kang
 
PDF
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Yongho Ha
 
PDF
nginxの紹介
Takashi Takizawa
 
PDF
[236] 카카오의데이터파이프라인 윤도영
NAVER D2
 
PDF
今からでも遅くないDBマイグレーション - Flyway と SchemaSpy の紹介 -
onozaty
 
PPTX
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
Miki Shimogai
 
Airflow를 이용한 데이터 Workflow 관리
YoungHeon (Roy) Kim
 
日本語:Mongo dbに於けるシャーディングについて
ippei_suzuki
 
Data Engineering 101
DaeMyung Kang
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Yongho Ha
 
nginxの紹介
Takashi Takizawa
 
[236] 카카오의데이터파이프라인 윤도영
NAVER D2
 
今からでも遅くないDBマイグレーション - Flyway と SchemaSpy の紹介 -
onozaty
 
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
Miki Shimogai
 

What's hot (20)

PDF
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
PDF
nginx入門
Takashi Takizawa
 
PDF
並列クエリを実行するPostgreSQLのアーキテクチャ
Kohei KaiGai
 
PDF
オープンソースで提供される第二のJVM:OpenJ9 VMとIBM Javaについて
Takakiyo Tanaka
 
PPTX
Data pipeline and data lake
DaeMyung Kang
 
PDF
Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Seongyun Byeon
 
PPTX
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
Yooseok Choi
 
PPT
Introduction to MongoDB
Ravi Teja
 
PDF
Embulk, an open-source plugin-based parallel bulk data loader
Sadayuki Furuhashi
 
PPTX
WiredTigerを詳しく説明
Tetsutaro Watanabe
 
PPTX
PySpark dataframe
Jaemun Jung
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
PDF
스타트업에서 기술책임자로 살아가기
Hyun-woo Park
 
PPTX
やってはいけない空振りDelete
Yu Yamada
 
PPTX
검색엔진이 데이터를 다루는 법 김종민
종민 김
 
PPTX
Mongodb basics and architecture
Bishal Khanal
 
PDF
Data Science. Intro
Seongyun Byeon
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
JSON:APIについてざっくり入門
iPride Co., Ltd.
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
nginx入門
Takashi Takizawa
 
並列クエリを実行するPostgreSQLのアーキテクチャ
Kohei KaiGai
 
オープンソースで提供される第二のJVM:OpenJ9 VMとIBM Javaについて
Takakiyo Tanaka
 
Data pipeline and data lake
DaeMyung Kang
 
Little Big Data #1. 바닥부터 시작하는 데이터 인프라
Seongyun Byeon
 
Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912
Yooseok Choi
 
Introduction to MongoDB
Ravi Teja
 
Embulk, an open-source plugin-based parallel bulk data loader
Sadayuki Furuhashi
 
WiredTigerを詳しく説明
Tetsutaro Watanabe
 
PySpark dataframe
Jaemun Jung
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
스타트업에서 기술책임자로 살아가기
Hyun-woo Park
 
やってはいけない空振りDelete
Yu Yamada
 
검색엔진이 데이터를 다루는 법 김종민
종민 김
 
Mongodb basics and architecture
Bishal Khanal
 
Data Science. Intro
Seongyun Byeon
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
JSON:APIについてざっくり入門
iPride Co., Ltd.
 
The Apache Spark File Format Ecosystem
Databricks
 
Ad

Viewers also liked (20)

PDF
An overview of Neo4j Internals
Tobias Lindaaker
 
PDF
Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...
TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science
 
PDF
Using neo4j for enterprise metadata requirements
Neo4j
 
PDF
Neo4j Introduction - Game of Thrones
Neo4j
 
PDF
Introducing Neo4j
Neo4j
 
PPTX
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Neo4j
 
PPTX
Introduction: Relational to Graphs
Neo4j
 
PDF
Deploying Massive Scale Graphs for Realtime Insights
Neo4j
 
PDF
Digital Transformation in a Connected World
Neo4j
 
PPTX
Neo4j graphs in the real world - graph days d.c. - april 14, 2015
Neo4j
 
PDF
Graphs for Enterprise Architects
Neo4j
 
PDF
Natural Language Processing with Graphs
Neo4j
 
PDF
RDBMS to Graphs
Neo4j
 
PPT
An Introduction to Graph Databases
InfiniteGraph
 
PPTX
Using a Graph Database for Next-Gen MDM
Neo4j
 
PPTX
An Introduction to NOSQL, Graph Databases and Neo4j
Debanjan Mahata
 
PDF
Relational to Big Graph
Neo4j
 
PDF
Importing Data into Neo4j quickly and easily - StackOverflow
Neo4j
 
PDF
Neo4j the Anti Crime Database
Neo4j
 
PDF
Fraud Detection with Neo4j
Neo4j
 
An overview of Neo4j Internals
Tobias Lindaaker
 
Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...
TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science
 
Using neo4j for enterprise metadata requirements
Neo4j
 
Neo4j Introduction - Game of Thrones
Neo4j
 
Introducing Neo4j
Neo4j
 
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Neo4j
 
Introduction: Relational to Graphs
Neo4j
 
Deploying Massive Scale Graphs for Realtime Insights
Neo4j
 
Digital Transformation in a Connected World
Neo4j
 
Neo4j graphs in the real world - graph days d.c. - april 14, 2015
Neo4j
 
Graphs for Enterprise Architects
Neo4j
 
Natural Language Processing with Graphs
Neo4j
 
RDBMS to Graphs
Neo4j
 
An Introduction to Graph Databases
InfiniteGraph
 
Using a Graph Database for Next-Gen MDM
Neo4j
 
An Introduction to NOSQL, Graph Databases and Neo4j
Debanjan Mahata
 
Relational to Big Graph
Neo4j
 
Importing Data into Neo4j quickly and easily - StackOverflow
Neo4j
 
Neo4j the Anti Crime Database
Neo4j
 
Fraud Detection with Neo4j
Neo4j
 
Ad

Similar to Neo4j Import Webinar (20)

PDF
Graph Connect Europe: From Zero To Import
Mark Needham
 
PPTX
Relational to Graph - Import
Neo4j
 
PDF
026 Neo4j Data Loading (ETL_ELT) Best Practices - NODES2022 AMERICAS Advanced...
Neo4j
 
PPTX
Leveraging Neo4j With Apache Spark
Neo4j
 
PPTX
From Relational to Graph: How Going Graph Revealed the Unknown(Jason_Schatz)....
Neo4j
 
PDF
RDBMS to Graph
Neo4j
 
PPTX
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
PDF
Exploring Neo4j Graph Database as a Fast Data Access Layer
Sambit Banerjee
 
PPTX
Graph databases for SQL Server profesionnals
MSDEVMTL
 
PDF
What's New in Neo4j - David Allen, Neo4j
Neo4j
 
PDF
Intro to Neo4j and Graph Databases
Neo4j
 
PPTX
New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...
jexp
 
PDF
Neo4j Training Cypher
Max De Marzi
 
PDF
There and Back Again, A Developer's Tale
Neo4j
 
PDF
011 Neo4j Ops Manager Intro and Roadmap - NODES2022 AMERICAS Advanced 3 - Chr...
Neo4j
 
PDF
Neo4j Vision and Roadmap
Neo4j
 
PDF
Webinar: RDBMS to Graphs
Neo4j
 
PDF
DriverPack Solution Download Full ISO free
blouch112kp
 
PDF
Atlantis Word Processor 4.4.5.1 Free Download
shanbahikp01
 
PDF
Adobe After Effects 2025 v25.1.0 Free Download
alihamzakpa070
 
Graph Connect Europe: From Zero To Import
Mark Needham
 
Relational to Graph - Import
Neo4j
 
026 Neo4j Data Loading (ETL_ELT) Best Practices - NODES2022 AMERICAS Advanced...
Neo4j
 
Leveraging Neo4j With Apache Spark
Neo4j
 
From Relational to Graph: How Going Graph Revealed the Unknown(Jason_Schatz)....
Neo4j
 
RDBMS to Graph
Neo4j
 
Graph Databases for SQL Server Professionals
Stéphane Fréchette
 
Exploring Neo4j Graph Database as a Fast Data Access Layer
Sambit Banerjee
 
Graph databases for SQL Server profesionnals
MSDEVMTL
 
What's New in Neo4j - David Allen, Neo4j
Neo4j
 
Intro to Neo4j and Graph Databases
Neo4j
 
New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...
jexp
 
Neo4j Training Cypher
Max De Marzi
 
There and Back Again, A Developer's Tale
Neo4j
 
011 Neo4j Ops Manager Intro and Roadmap - NODES2022 AMERICAS Advanced 3 - Chr...
Neo4j
 
Neo4j Vision and Roadmap
Neo4j
 
Webinar: RDBMS to Graphs
Neo4j
 
DriverPack Solution Download Full ISO free
blouch112kp
 
Atlantis Word Processor 4.4.5.1 Free Download
shanbahikp01
 
Adobe After Effects 2025 v25.1.0 Free Download
alihamzakpa070
 

More from Neo4j (20)

PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Jin Foo - Prospa GraphSummit Sydney Presentation.pdf
Neo4j
 
PDF
GraphSummit Singapore Master Deck - May 20, 2025
Neo4j
 
PPTX
Graphs & GraphRAG - Essential Ingredients for GenAI
Neo4j
 
PPTX
Neo4j Knowledge for Customer Experience.pptx
Neo4j
 
PPTX
GraphTalk New Zealand - The Art of The Possible.pptx
Neo4j
 
PDF
Neo4j: The Art of the Possible with Graph
Neo4j
 
PDF
Smarter Knowledge Graphs For Public Sector
Neo4j
 
PDF
GraphRAG and Knowledge Graphs Exploring AI's Future
Neo4j
 
PDF
Matinée GenAI & GraphRAG Paris - Décembre 24
Neo4j
 
PDF
ANZ Presentation: GraphSummit Melbourne 2024
Neo4j
 
PDF
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
Neo4j
 
PDF
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
Neo4j
 
PDF
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
Neo4j
 
PDF
Démonstration Digital Twin Building Wire Management
Neo4j
 
PDF
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
Neo4j
 
PDF
Démonstration Supply Chain - GraphTalk Paris
Neo4j
 
PDF
The Art of Possible - GraphTalk Paris Opening Session
Neo4j
 
PPTX
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
Neo4j
 
PDF
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...
Neo4j
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Jin Foo - Prospa GraphSummit Sydney Presentation.pdf
Neo4j
 
GraphSummit Singapore Master Deck - May 20, 2025
Neo4j
 
Graphs & GraphRAG - Essential Ingredients for GenAI
Neo4j
 
Neo4j Knowledge for Customer Experience.pptx
Neo4j
 
GraphTalk New Zealand - The Art of The Possible.pptx
Neo4j
 
Neo4j: The Art of the Possible with Graph
Neo4j
 
Smarter Knowledge Graphs For Public Sector
Neo4j
 
GraphRAG and Knowledge Graphs Exploring AI's Future
Neo4j
 
Matinée GenAI & GraphRAG Paris - Décembre 24
Neo4j
 
ANZ Presentation: GraphSummit Melbourne 2024
Neo4j
 
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
Neo4j
 
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
Neo4j
 
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
Neo4j
 
Démonstration Digital Twin Building Wire Management
Neo4j
 
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
Neo4j
 
Démonstration Supply Chain - GraphTalk Paris
Neo4j
 
The Art of Possible - GraphTalk Paris Opening Session
Neo4j
 
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
Neo4j
 
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...
Neo4j
 

Recently uploaded (20)

PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Software Development Methodologies in 2025
KodekX
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

Neo4j Import Webinar

  • 1. Neo4j Import Webinar Mark Needham (@markhneedham) 30th July 2015
  • 2. Neo Technology, Inc Confidential #neo4j Chicago Crime dataset
  • 3. Neo Technology, Inc Confidential #neo4j Chicago Crime dataset
  • 4. Neo Technology, Inc Confidential #neo4j Chicago Crime CSV file imported into The goal
  • 5. Neo Technology, Inc Confidential #neo4j Exploring the data
  • 6. Neo Technology, Inc Confidential #neo4j Exploring the data LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to _present.csv" AS row RETURN row LIMIT 1
  • 7. Neo Technology, Inc Confidential #neo4j Exploring the data
  • 8. Neo Technology, Inc Confidential #neo4j Exploring the data
  • 9. Neo Technology, Inc Confidential #neo4j Sketch a rough initial model
  • 10. Neo Technology, Inc Confidential #neo4j Import a sample: Crimes LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to _present.csv" AS row WITH row LIMIT 100 MERGE (crime:Crime { id: row.ID, description: row.Description, caseNumber: row.`Case Number`, arrest: row.Arrest, domestic: row.Domestic});
  • 11. Neo Technology, Inc Confidential #neo4j Import a sample: Crimes Show how to do this better by splitting up the attrib utes
  • 12. Neo Technology, Inc Confidential #neo4j Import a sample: Crime Types LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to _present.csv" AS row WITH row LIMIT 100 MERGE (:CrimeType { name: row.`Primary Type`});
  • 13. Neo Technology, Inc Confidential #neo4j Import a sample: Crimes -> Crime Types LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to _present.csv" AS row WITH row LIMIT 100 MATCH (crime:Crime { id: row.ID, description: row.Description}) MATCH (crimeType:CrimeType { name: row.`Primary Type`}) MERGE (crime)-[:TYPE]->(crimeType);
  • 14. Neo Technology, Inc Confidential #neo4j Add indexes CREATE INDEX ON :Label(property)
  • 15. Neo Technology, Inc Confidential #neo4j Add indexes CREATE INDEX ON :Label(property) CREATE INDEX ON :Crime(id); CREATE INDEX ON :Location(name); CREATE INDEX ON :CrimeType(name); CREATE INDEX ON :Location(name); ...
  • 16. Neo Technology, Inc Confidential #neo4j Periodic Commit USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM file:///Users/markneedham/projects/neo4j-spark-chicago/Crimes_-_2001_to_ present.csv MERGE (crime:Crime { id: row.ID, description: row.Description})
  • 17. Neo Technology, Inc Confidential #neo4j Periodic Commit • Neo4j keeps all transaction state in memory which becomes problematic for large CSV files • USING PERIODIC COMMIT flushes the transaction after a certain number of rows • Default is 1000 rows but it’s configurable • Currently only works with LOAD CSV
  • 18. Neo Technology, Inc Confidential #neo4j Avoiding the Eager • Cypher has an Eager operator which will bring forward parts of a query to ensure safety • We don’t want to see this operator when we’re importing data – it will slow things down a lot • Put a diagram of eager => slow (maybe a query plan?)
  • 19. Neo Technology, Inc Confidential #neo4j LOAD CSV in summary • ETL power tool • Built into Neo4J since version 2.1 • Can load data from any URL • Good for medium size data (up to 10M rows)
  • 20. Neo Technology, Inc Confidential #neo4j Bulk loading an initial data set • Introducing the Neo4j Import Tool • Find it in the bin folder of your Neo4j download • Used to large sized initial data sets • Skips the transactional layer of Neo4j and writes store files directly
  • 21. Neo Technology, Inc Confidential #neo4j Expects files in a certain format :ID(Crime) :LABEL description :ID(Beat) :LABEL :START_ID(Crime) :END_ID(Beat) :TYPE Nodes Relationships
  • 22. Neo Technology, Inc Confidential #neo4j What we have…
  • 23. Neo Technology, Inc Confidential #neo4j Chicago Crime CSV file Neo4j ready CSV files Translation Phase required Translation Phase
  • 24. Neo Technology, Inc Confidential #neo4j Chicago Crime CSV file Spark all the things Spark Job processed by spits out Neo4j ready CSV files imported into
  • 25. Neo Technology, Inc Confidential #neo4j The Spark Job
  • 26. Neo Technology, Inc Confidential #neo4j The Spark Job
  • 27. Neo Technology, Inc Confidential #neo4j Submitting the Spark Job ./spark-1.3.0-bin-hadoop1/bin/spark-submit --driver-memory 5g --class GenerateCSVFiles --master local[8] target/scala-2.10/playground_2.10-1.0.jar real 1m25.506s user 8m2.183s sys 0m24.267s
  • 28. Neo Technology, Inc Confidential #neo4j Submitting the Spark Job ./spark-1.3.0-bin-hadoop1/bin/spark-submit --driver-memory 5g --class GenerateCSVFiles --master local[8] target/scala-2.10/playground_2.10-1.0.jar real 1m25.506s user 8m2.183s sys 0m24.267s
  • 29. Neo Technology, Inc Confidential #neo4j The generated files $ ls -1 tmp/*.csv tmp/beats.csv tmp/crimeDates.csv tmp/crimes.csv tmp/crimesBeats.csv tmp/crimesDates.csv tmp/crimesLocations.csv tmp/crimesPrimaryTypes.csv tmp/dates.csv tmp/locations.csv tmp/primaryTypes.csv
  • 30. Neo Technology, Inc Confidential #neo4j Importing into Neo4j DATA=tmp NEO=./neo4j-enterprise-2.2.3 $NEO/bin/neo4j-import --into $DATA/crimes.db --nodes $DATA/crimes.csv --nodes $DATA/beats.csv --nodes $DATA/primaryTypes.csv --nodes $DATA/locations.csv --relationships $DATA/crimesBeats.csv --relationships $DATA/crimesPrimaryTypes.csv --relationships $DATA/crimesLocations.csv --stacktrace IMPORT DONE in 36s 208ms
  • 31. Neo Technology, Inc Confidential #neo4j Enriching the crime graph
  • 32. Neo Technology, Inc Confidential #neo4j Enriching the crime graph
  • 33. Neo Technology, Inc Confidential #neo4j Enriching the crime graph
  • 34. Neo Technology, Inc Confidential #neo4j 2 options JSON CSVjq LOAD CSV JSON Language Driver HTTP API
  • 35. Neo Technology, Inc Confidential #neo4j Using py2neo to load JSON into Neo4j import json from py2neo import Graph, authenticate authenticate("localhost:7474", "neo4j", "foobar") graph = Graph() with open('categories.json') as data_file: json = json.load(data_file) query = """ WITH {json} AS document UNWIND document.categories AS category UNWIND category.sub_categories AS subCategory MERGE (c:CrimeCategory {name: category.name}) MERGE (sc:SubCategory {code: subCategory.code}) ON CREATE SET sc.description = subCategory.description MERGE (c)-[:CHILD]->(sc) """ print graph.cypher.execute(query, json = json)
  • 36. Neo Technology, Inc Confidential #neo4j Enriching the crime graph anslate from JSON to CSV
  • 37. Neo Technology, Inc Confidential #neo4j Enriching the crime graph Import using LOAD CSV
  • 38. Neo Technology, Inc Confidential #neo4j Updating the graph • As new crimes come in we want to update the graph to take them into account
  • 39. Neo Technology, Inc Confidential #neo4j Updating the graph • Import this using REST Transactional API
  • 40. Neo Technology, Inc Confidential #neo4j This talk brought to you by…
  • 41. Neo Technology, Inc Confidential #neo4j And that’s it…

Editor's Notes

  • #3: We’re going to look at how we’d go about importing the Chicago Crime open dataset
  • #4: Available as a CSV dump much like a Hadoop or relational database dump with lots of different records on one row
  • #5: The goal is to get this into Neo4j and then make some money from having that data imported
  • #6: Now before we do anything it’s time to do a bit of exploration of the data so we know what we’re dealing with. We could choose to do that using command line tools like grep, awk and so on but we could also use Neo4j’s LOAD CSV command.
  • #7: Let’s have a look at one row to see what we’ve got in the data set.
  • #8: These are the ones that I’d probably extract to create a graph. The whole record is a ‘crime’ but then we’ve got some other latent entities that we can reify.
  • #9: These are the ones that I’d probably extract to create a graph. The whole record is a ‘crime’ but then we’ve got some other latent entities that we can reify.
  • #10: Available as a CSV dump much like a Hadoop or relational database dump with lots of different records on one row
  • #11: Let’s start by importing 100 rows so that we can iterate really quickly and get a feel for the model that we’ve come up with
  • #12: Let’s start by importing 100 rows so that we can iterate really quickly and get a feel for the model that we’ve come up with
  • #13: Let’s start by importing 100 rows so that we can iterate really quickly and get a feel for the model that we’ve come up with
  • #14: Mention that it’s best to separate your queries so you don’t have to do multiple MERGEs in the same query – it’s fine for playing around but when you need speed, split them up
  • #15: Don’t forget to add indexes or this is going to be incredibly slow
  • #16: For us we might add an index for each of our main types of entity so we can easily look them up later
  • #17: This will flush the transaction every 1000 rows by default.
  • #18: Here’s some more information about periodic commit
  • #19: Here’s some more information about periodic commit
  • #20: Ok so if we’re to summarise LOAD CSV this is what we’ve got
  • #21: This is a new tool introduced in Neo4j 2.2. It’s super fast but doesn’t give you the transactional guarantees of normal Neo4j. Effectively we’re building an offline copy of the database
  • #22: This is the format of the files that the import tool expects
  • #23: This is the format of the files that the import tool expects
  • #24: We need to translate those files to something more suitable
  • #25: We need to translate those files to something more suitable. We could write a program to do that but Spark provides quite a nice API for doing this.
  • #26: This is a new tool introduced in Neo4j 2.2. It’s super fast but doesn’t give you the transactional guarantees of normal Neo4j. Effectively we’re building an offline copy of the database
  • #27: This is a new tool introduced in Neo4j 2.2. It’s super fast but doesn’t give you the transactional guarantees of normal Neo4j. Effectively we’re building an offline copy of the database
  • #28: This is a new tool introduced in Neo4j 2.2. It’s super fast but doesn’t give you the transactional guarantees of normal Neo4j. Effectively we’re building an offline copy of the database
  • #29: This is a new tool introduced in Neo4j 2.2. It’s super fast but doesn’t give you the transactional guarantees of normal Neo4j. Effectively we’re building an offline copy of the database
  • #30: This is a new tool introduced in Neo4j 2.2. It’s super fast but doesn’t give you the transactional guarantees of normal Neo4j. Effectively we’re building an offline copy of the database
  • #31: This is a new tool introduced in Neo4j 2.2. It’s super fast but doesn’t give you the transactional guarantees of normal Neo4j. Effectively we’re building an offline copy of the database
  • #32: We can get the categories from the Chicago Police website and enrich our graph with a little hierarchy of crimes.
  • #33: Now that we’ve got this as JSON we’re going to import it into the graph
  • #34: Now that we’ve got this as JSON we’re going to import it into the graph
  • #35: We can get the categories from the Chicago Police website
  • #36: We can get the categories from the Chicago Police website
  • #37: We can get the categories from the Chicago Police website
  • #38: We can get the categories from the Chicago Police website
  • #39: This is a new tool introduced in Neo4j 2.2. It’s super fast but doesn’t give you the transactional guarantees of normal Neo4j. Effectively we’re building an offline copy of the database
  • #40: This is a new tool introduced in Neo4j 2.2. It’s super fast but doesn’t give you the transactional guarantees of normal Neo4j. Effectively we’re building an offline copy of the database
  • #41: So it probably goes without saying that the Chicago tourism board sponsored this talk
  • #42: This is a new tool introduced in Neo4j 2.2. It’s super fast but doesn’t give you the transactional guarantees of normal Neo4j. Effectively we’re building an offline copy of the database