R23-IDS-Unit4-PPT_2.0
R23-IDS-Unit4-PPT_2.0
UNIT IV
Tools and Applications of Data Science:
Introducing Neo4j for dealing with graph databases, graph query
language Cypher, Applications graph databases, Python libraries
like nltk and SQLite for handling Text mining and analytics, case
study on classifying Reddit posts
Introducing Neo4j: a graph database
Connected data is generally stored in graph
databases. These databases are specifically
designed to cope with the structure of connected
data. The landscape of available graph databases is
rather diverse these days.
The three most-known ones in order of decreasing
popularity are Neo4j, OrientDb, and Titan.
the concept of connected data and its representation as
graph data.
■ Connected data—As the name indicates, connected
data is characterized by the fact that the data at
hand has a relationship that makes it connected.
■ Graphs—Often referred to in the same sentence as
connected data. Graphs are well suited to represent
the connectivity of data in a meaningful way.
■ Graph databases—The reason this subject is
meriting particular attention is because, besides the
fact that data is increasing in size, it’s also becoming
more interconnected. Not much effort is needed to
come up with well-known examples of connected
data.
A prominent example of data that takes a network
form is social media data.
Social media allows us to share and exchange data in
networks, thereby generating a great amount of
connected data.
Features of Neo4j
Neo4j is a graph database that stores the data in a graph
containing nodes and relationships (both are allowed to
contain properties).
This type of graph database is known as a property graph and
is well suited for storing connected data.
It has a flexible schema that will give us freedom to change
our data structure if needed, providing us the ability to add
new data and new relationships if needed.
It’s an open source project, mature technology, easy to install,
user-friendly, and well documented.
Neo4j also has a browser-based interface that facilitates the
creation of graphs for visualization purposes.
Neo4j can be downloaded from
http:/ /neo4j.com/download/.
Four basic structures in Neo4j:
■ Nodes—Represent entities such as documents, users,
recipes, and so on. Certain properties could be assigned
to nodes.
■ Relationships—Exist between the different nodes. They
can be accessed either stand-alone or through the
nodes they’re attached to. Relationships can also
contain properties, hence the name property graph
model. Every relationship has a name and a direction,
which together provide semantic context for the nodes
connected by the relationship.
■ Properties—Both nodes and relationships can have
properties. Properties are defined by key-value pairs.
■ Labels—Can be used to group similar nodes to facilitate
faster traversal through graphs.
Before conducting an analysis, a good habit is to design
your database carefully so it fits the queries you’d like
to run down the road when performing your analysis.
Graph databases have the pleasant characteristic that
they’re whiteboard friendly. If one tries to draw the
problem setting on a whiteboard, this drawing will
closely resemble the database design for the defined
problem.
Now how to retrieve the data? To explore our data, we
need to traverse through the graph following
predefined paths to find the patterns we’re searching
for.
The Neo4j browser is an ideal environment to create
and play around with your connected data until you get
to the right kind of representation for optimal queries.
The flexible schema of the graph database suits us
well here. In this browser you can retrieve your data
in rows or as a graph.
Neo4j has its own query language to ease the
creation and query capabilities of graphs.
Cypher is a highly expressive language that shares
enough with SQL to enhance the learning process of
the language.
We can create our own data using Cypher and
insert it into Neo4j. Then we can play around with
the data.
Cypher: a graph query language:
For a more extensive introduction to Cypher you can
visit http://neo4j.com/docs/stable/cypher-query-
lang.html.
We’ll start by drawing a simple social graph
accompanied by a basic query to retrieve a predefined
pattern as an example.
Figure 7.8 shows a simple social graph of two nodes,
connected by a relationship of type “knows”. The nodes
have both the properties “name” and “lastname”.
Now, if we’d like to find out the following pattern, “Who does
Paul know?” we’d query this using Cypher.
To find a pattern in Cypher, we’ll start with a Match clause.
In this query we’ll start searching at the node User with the
name property “Paul”.
Note how the node is enclosed within parentheses, as shown
in the code snippet below, and the relationship is enclosed by
square brackets.
Relationships are named with a colon (:) prefix, and the
direction is described using arrows. The placeholder p2 will
contain all the User nodes having the relationship of type
“knows” as an inbound relationship.
With the return clause we can retrieve the results of the
query.
Match(p1:User { name: 'Paul' } )-[:knows]->(p2:User)
Return p2.name
Notice the close relationship of how we have
formulated our question verbally and the way the
graph database translates this into a traversal.
In Neo4j, this impressive expressiveness is made
possible by its graph query language, Cypher.