Neo4j Essentials - Sample Chapter
Neo4j Essentials - Sample Chapter
ee
Sa
pl
Neo4j Essentials
Understanding the data is the key to all systems and depicting analytical models
on that data, which is built on the paradigm of real-world entities and relationships,
are the success stories of large-scale enterprise systems.
Architects/developers and data scientists have been struggling hard to derive the realworld models and relationships from discrete and disparate systems, which consist of
structured / un-structured data.
We would all agree that without any relationships, data is useless. If we cannot derive
relationships between entities, then it will be of little or no use. After all, it's all about
the connections in the data.
Coming from the relational background, our first choice would be to model
this into relational models and use RDBMS.
No doubt we can use RDBMS, but relational models such as RDBMS focus more
on entities and less on relationships between entities. There are certain disadvantages
in using RDBMS for modeling the above data structure:
Relationships are dynamic and evolving, which makes it more difficult to create models.
Let's consider the example of social networks:
Social networks are highly complex to define. They are agile and evolving.
Considering the preceding example where John is linked to Kevin because of the fact
that they work in the same company will be changed as soon as either of them switch
companies. Moreover, their relationship IS_SUPREVISOR is dependent upon the
company dynamics and is not static. The same is the case with relationship IS_FRIEND,
which can also change at any time.
All the preceding examples state a need for a generic data structure,
which can elegantly represent any kind of data and at the same time
is it easy to query in a highly accessible manner.
Let's introduce a different form of database that focuses on relationships
between the entities rather than the entities itselfNeo4j.
Neo4j as NoSQL database that leverages the theory of graphs. Though
there is no single definition of graphs, here is the simplest one, which helps
to understand the theory of graphs
(http://en.wikipedia.org/wiki/Graph_(abstract_data_type)):
"A graph data structure consists of a finite (and possibly mutable) set
of nodes or vertices, together with a set of ordered pairs of these nodes
(or, in some cases, a set of unordered pairs). These pairs are known as edges
or arcs. As in mathematics, an edge (x,y) is said to point or go from x to y.
The nodes may be part of the graph structure, or may be external entities
represented by integer indices or references."
Neo4j is an open source graph database implemented in Java.
Its first version (1.0) was released in February 2010, and since then it has never stopped.
It is amazing to see the pace at which Neo4j has evolved over the years. At the time of
writing this book, the current stable version is 2.1.5, released in September 2014.
Let's move forward and jump into the nitty-gritties of Neo4j. In the subsequent chapters,
we will cover the various aspects of Neo4j dealing with installation, data modeling and
designing, querying, performance tuning, security, extensions, and many more.
Neo4j Deployment
Designing a scalable and distributed software deployment architecture is another
challenge for developers/architects. Development teams are constantly striving to
deploy software in such a way that various enterprise concerns such as maintenance,
backups, restores, and disaster recovery are easier to perform and flexible enough to
scale and accommodate future needs.
However, application software does provide various deployment options, but which
one to use and how they are used, will largely depend upon the end users. For
example, you may have more reads being performed as compared to writes or vice
versa. So your deployment architecture would need to support and be optimized
either for a high volume of reads or high volume of writes, or both.
Scalability is another aspect that is closely linked to the deployments. In simple
terms, it defines a feature of a software where it can support x number of user
requests by adding more nodes in a cluster (horizontal scalability or scale out) or
by upgrading the existing hardware by adding more processing power CPUs or
memoryRAM (vertical scalability). Now, which one to use, without impacting
the SLAs, is again a challenging question that will require pushing the software to
its limits, so that you can understand its behavior in extreme / rare circumstances
and then define an appropriate strategy / plan for your production deployment.
We should also remember that the production deployments are evolving and will
change over time due to many reasons such as new versions of software, innovation
in hardware, changes in user behavior, and so on.
In this chapter, we will discuss the Neo4j deployment scenarios and also talk about
recommended setup and monitoring. This chapter will cover the following topics:
Monitoring
Neo4j Deployment
Fault tolerance
As shown in the preceding diagram, each Neo4j server instance has two parts: one is
the Neo4j HA database and the other is the cluster manager. Neo4j HA database is
responsible for storing and retrieving the data, and cluster manager is responsible
for ensuring a HA and fault-tolerant cluster.
[ 146 ]
Chapter 7
Neo4j HA database directly communicates with the other instances for data
replication with the help of cluster manager.
The following are the features provided by cluster manager:
Enabling data replication from the master node by polling at regular intervals
Fault tolerance
Neo4j provides the custom implementation of multi-paxos paradigm at
http://en.wikipedia.org/wiki/Paxos_%28computer_science%29#Multi-Paxos,
[ 147 ]
Neo4j Deployment
Although reads can be served even with a single node, when it comes to writes, it is
essential to have a Quorum of nodes available in a cluster.
Whenever a Neo4j database instance becomes unavailable, due to any reason such as
hardware or network failure, the other database instances in the cluster will detect
that and mark it as temporarily failed.
Once the temporarily failed instance is available and ready to serve user requests, it
is automatically synced with the other nodes in the cluster.
If the master goes down, then another (best-suited) member will be elected and have
its role switched from slave to master after a quorum has been reached within the
cluster. Writes are blocked during the election process of the master node.
For all those cases where Quorum is not available and we still want to elect the
master, Neo4j provides arbiter nodes that can be deployed to achieve the Quorum.
Arbiter nodes do not contain the data and neither serve the read or write requests.
They are used only in the election of the master node with the single purpose of
breaking ties.
Arbiter instances are configured in the same way as Neo4j HA members are
configured in the neo4j.properties file under conf/ and the following command
is used to start an arbiter instance:
<$NEO4J_HOME>/bin/neo4j-arbiter start
In cases where the new master performs some changes to the data, before the old
master recovers, there will be two branches of the database after the point where the
old master became unavailable. The old master will move away from its database
(its branch), will download a full copy from the new master, would be marked as
available, and will also be added as a slave node in the cluster.
[ 148 ]
Chapter 7
ha.tx_push_factor: This is the amount of slaves node the master will try
to push a transaction before returning success to the client. We can also set
this to 0, which will switch off the synchronous data writes to slave node and
would eventually increase the write performance, but would also increase
the risk of data loss, as the master would be the only node containing
the transaction.
the priority of nodes that will be selected to push the events. In case of fixed,
the priority is decided based on the value of ha.server_id, which is further
based on the principle of highest first.
All write transactions on a slave will be first synchronized with the master. When the
transaction commits, it will be first committed on the master, and if successful, then
it will be committed on the slave. To ensure consistency, the slave has to be updated
and synchronized with the master before performing a write operation. This is built
into the communication protocol between the slave and the master so that updates
are applied automatically to a slave node communicating with its master node.
Neo4j provides full data replication on each node so that each node is self-sufficient
to serve read/write requests. It also helps in achieving low latency. In order to serve
the global audience, additional Neo4j servers can be configured as read-only slave
servers and these servers can be placed near the customer (maybe geographically).
These slave read-only servers are synced up with the master in real time and all the
local read requests are directed and served by these read-only slave servers, which
provide data locality for our client applications.
[ 149 ]
Neo4j Deployment
2. Full backup: Create a blank directory on the machine where you want to take
the full backup and run the backup tool <$NEO4J_HOME>/bin/neo4j-backup
-host <IP-ADDRESS> -port <PORT#> -to <DIR location on
remote server>.
3. Incremental backup: Run the same command that we used to take full
backup and neo4j-backup will only copy the updates from the last backup.
Incremental backups can fail in case the provided directory does not have
valid backup or previous backup is from the older version of the
Neo4j database.
4. Recovering database from backup: Modify the org.neo4j.server.
database.location property in <$NEO4J_HOME>/conf/neo4j-server.
properties and provide the location of directory where backup is stored
and restart your Neo4j server.
Online backup can also be performed programmatically using Java by
using org.neo4j.backup.OnlineBackup.java.
Advanced settings
Let's discuss the advanced settings exposed by Neo4j, which should be considered
for any production deployment:
<$NEO4J_HOME>/conf/neo4j-server.properties Neo4j server configuration file
Parameter
Default
value
Description
org.neo4j.server.
webserver.address
0.0.0.0
org.neo4j.server.
webserver.maxthreads
200
org.neo4j.server.
transaction.timeout
60
[ 150 ]
Chapter 7
False
query_cache_size
100
keep_logical_logs
7 Days
logical_log_rotation_
threshold
25M
[ 151 ]
Neo4j Deployment
Apart from the properties mentioned in the preceding table, we can also configure
and tune the JVM by defining the configurations such as type of GC, GC Logs, Max
and Min memory, and so on in <$NEO4J_HOME>/conf/neo4j-wrapper.conf.
To summarize, Neo4j HA architecture meets the enterprise needs in a true sense
where it provides the highest degree of fault tolerance, can operate even with a
single machine, and can expose various configurations, which help in tuning our
database to produce maximum throughput.
[ 152 ]
Chapter 7
Especially in cases where we need high throughput from our write operations, it is
recommended to introduce a queuing solution so that the cluster can service a steady
and manageable stream of write operations. It will also rescue us from losing any
write transactions, which may occur in extreme situations where our master node
abruptly goes down.
Batch writes
We talked about batch writes and the optimization techniques in Chapter 2, Ready for
Take Off.
We should remember that Neo4j provides ACID properties and every request
to Neo4j is wrapped around a transaction, which also induces some overhead.
Planning your writes and submitting it to the server in batches using a low-level
API like BatchInserter would help in directly interacting or inserting data into
Neo4j and avoiding all overhead-like transactions. It would definitely help in
achieving higher throughput.
Vertical scaling
Vertical scaling or upgrading the hardware of the master node is another option
where we can scale our write operations. Neo4j cluster does support SSD drives
such as Fusion-io, which does provide exceptional high write performance.
[ 153 ]
Neo4j Deployment
Load balancer
Neo4j HA cluster does not provide any load balancing that can distribute the load
over the nodes in a cluster, but we can introduce a software or hardware load
balancer for distributing the load equally over the various nodes in the cluster.
For example, we can introduce a software load balancer such as HA proxyhttp://
www.haproxy.org/ or Apache Proxyhttp://httpd.apache.org/docs/2.2/mod/
mod_proxy.html, which would intercept REST calls, and based on routing rules, can
delegate the request to the appropriate node in a cluster.
[ 154 ]
Chapter 7
Apart from distributing the load across the cluster, the following are the other
benefits of introducing a load balancer to our Neo4j cluster:
The client does not have to be aware of the location and address of the
physical nodes of the cluster
The nodes can be removed or added at any point of time, while the customer
can still perform the reads
Cache-based sharding
Caching and sharding are two important strategies for any production system.
Neo4j provides a high-performance cache with the Enterprise Edition, which can be
leveraged for fast lookups. It enables the flexibility to provide specific memory sizes
for nodes.
Enable high-performance cache by modifying cache_type=hpc in
<$NEO4J_HOME>/conf/neo4j.properties.
In the Neo4j world, we consider large data, large enough that it cannot fit into the
provided memory (RAM), so the next option would be to introduce shards and
distribute these shards to individual nodes.
Sharding data and then caching it on individual nodes is a reasonable and scalable
solution, but it is difficult to shard the graphs with a traditional sharding approach
and it may not scale for real-time transactions too. That's the reason there is no
utility/API provided in Neo4j to shard the data.
So what's next???
The answer is cache-based sharding.
In cache-based sharding, all nodes in a cluster contain the full data, but we partition
the type of requests served by each database instance to increase the likelihood of
hitting a warm cache for a given request. Warm caches in Neo4j are ridiculously high
in performance, especially the HPChigh-performance cache.
In short, we would recommend a routing strategy that routes the user read requests
in such a manner that they are always served by a specific set of nodes in a cluster.
[ 155 ]
Neo4j Deployment
The strategy could be based on the domain or may be based on the specific type of
query or characteristics of data. We could also use sticky sessions where the first and
subsequent request is served by the same node.
In any case, we need to ensure that majority of READ requests are served by warm
cache and not by the disk.
Monitoring
We discussed the Neo4j browser in Chapter1, Installation and the First Query.
Neo4j browser exposes the basic configuration of our server and database,
but that is not enough for the enterprise class systems, where we need detail
monitoring, statistics, and options to modify certain configurations at runtime
without restarting the server.
Neo4j exposes JMX beans for advanced level of monitoring and management,
which not only exposes the overall health of our Neo4j server and database, but also
provides certain operations that can be invoked over live Neo4j instances and that
too without restarting the server. Most of the monitoring options exposed through
JMX beans are only available with the Enterprise version of Neo4j.
For more information on JMX, refer to http://www.oracle.com/
technetwork/java/javase/tech/javamanagement-140525.html.
Chapter 7
And we are done!!! You will be able to see the JConsole UI that displays the
health of your system, such as memory utilization, details about JVM, and
so on.
[ 157 ]
Neo4j Deployment
4. Click on the last tab MBeans and you will see two JMX Beans: org.neo4j and
org.neo4j.ServerManagement.
Chapter 7
Dcom.sun.management.jmxremote.access.file=conf/jmx.access
All the preceding properties define the configuration of JMX beans such as
communication port, username and password files, and so on.
2. Next we need to modify the username/password for connecting the JMX
server remotely. Open <$NEO4J_HOME>/conf/jmx.access and uncomment
the line control readwrite. Uncommenting this line will enable the admin
role for the JMX beans and you can modify and invoke the operations
exposed by Neo4j JMX beans.
3. Now we will add the username and password in jmx.password. Open
<$NEO4J_HOME>/conf/jmx.password and at the end of file enter control
<space><password> like control Sumit, where the first word is the
username and the second word is the password. The username should match
with the entry made in the jmx.access file.
4. We also need to make sure that the permissions for conf/jmx.password are
0600 in Linux. Open the console and execute sudo chmod 0600 conf/jmx.
password in Linux, and For windows, follow instructions given at http://
docs.oracle.com/javase/7/docs/technotes/guides/management/
security-windows.html.
6. And we are done!!! Click on Connect and you will be able to see the UI of
JConsole exposing the health statistics of your system along with the MBeans
exposed by Neo4j.
[ 159 ]
Neo4j Deployment
Summary
In this chapter, we discussed the Neo4j architecture and its various components
which converged and provided a scalable and HA architecture. We also talked about
various principles of the Neo4j cluster and also provided recommendations for
various production scenarios. Lastly, we also discussed about the monitoring options
available to the Neo4j users.
In the next chapter, we will discuss the Neo4j security and extensions.
[ 160 ]
www.PacktPub.com
Stay Connected: