0% found this document useful (0 votes)
14 views

R23-IDS-Unit4-PPT_2.0

The document introduces data science tools and applications, focusing on Neo4j, a graph database that uses the Cypher query language for managing connected data. It discusses the features of Neo4j, including its flexible schema, property graphs, and the process of data retrieval and manipulation using Cypher. Additionally, it covers text mining techniques, including the bag of words model, stemming, lemmatization, and decision tree classifiers, highlighting their applications and complexities in data analysis.

Uploaded by

23a31a4411
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

R23-IDS-Unit4-PPT_2.0

The document introduces data science tools and applications, focusing on Neo4j, a graph database that uses the Cypher query language for managing connected data. It discusses the features of Neo4j, including its flexible schema, property graphs, and the process of data retrieval and manipulation using Cypher. Additionally, it covers text mining techniques, including the bag of words model, stemming, lemmatization, and decision tree classifiers, highlighting their applications and complexities in data analysis.

Uploaded by

23a31a4411
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

INTRODUCTION TO DATA SCIENCE

(R23 – II Year I Sem)

UNIT IV
Tools and Applications of Data Science:
Introducing Neo4j for dealing with graph databases, graph query
language Cypher, Applications graph databases, Python libraries
like nltk and SQLite for handling Text mining and analytics, case
study on classifying Reddit posts
Introducing Neo4j: a graph database
Connected data is generally stored in graph
databases. These databases are specifically
designed to cope with the structure of connected
data. The landscape of available graph databases is
rather diverse these days.
The three most-known ones in order of decreasing
popularity are Neo4j, OrientDb, and Titan.
the concept of connected data and its representation as
graph data.
■ Connected data—As the name indicates, connected
data is characterized by the fact that the data at
hand has a relationship that makes it connected.
■ Graphs—Often referred to in the same sentence as
connected data. Graphs are well suited to represent
the connectivity of data in a meaningful way.
■ Graph databases—The reason this subject is
meriting particular attention is because, besides the
fact that data is increasing in size, it’s also becoming
more interconnected. Not much effort is needed to
come up with well-known examples of connected
data.
A prominent example of data that takes a network
form is social media data.
Social media allows us to share and exchange data in
networks, thereby generating a great amount of
connected data.
Features of Neo4j
 Neo4j is a graph database that stores the data in a graph
containing nodes and relationships (both are allowed to
contain properties).
 This type of graph database is known as a property graph and
is well suited for storing connected data.
 It has a flexible schema that will give us freedom to change
our data structure if needed, providing us the ability to add
new data and new relationships if needed.
 It’s an open source project, mature technology, easy to install,
user-friendly, and well documented.
 Neo4j also has a browser-based interface that facilitates the
creation of graphs for visualization purposes.
 Neo4j can be downloaded from
http:/ /neo4j.com/download/.
Four basic structures in Neo4j:
■ Nodes—Represent entities such as documents, users,
recipes, and so on. Certain properties could be assigned
to nodes.
■ Relationships—Exist between the different nodes. They
can be accessed either stand-alone or through the
nodes they’re attached to. Relationships can also
contain properties, hence the name property graph
model. Every relationship has a name and a direction,
which together provide semantic context for the nodes
connected by the relationship.
■ Properties—Both nodes and relationships can have
properties. Properties are defined by key-value pairs.
■ Labels—Can be used to group similar nodes to facilitate
faster traversal through graphs.
 Before conducting an analysis, a good habit is to design
your database carefully so it fits the queries you’d like
to run down the road when performing your analysis.
 Graph databases have the pleasant characteristic that
they’re whiteboard friendly. If one tries to draw the
problem setting on a whiteboard, this drawing will
closely resemble the database design for the defined
problem.
 Now how to retrieve the data? To explore our data, we
need to traverse through the graph following
predefined paths to find the patterns we’re searching
for.
 The Neo4j browser is an ideal environment to create
and play around with your connected data until you get
to the right kind of representation for optimal queries.
The flexible schema of the graph database suits us
well here. In this browser you can retrieve your data
in rows or as a graph.
Neo4j has its own query language to ease the
creation and query capabilities of graphs.
Cypher is a highly expressive language that shares
enough with SQL to enhance the learning process of
the language.
We can create our own data using Cypher and
insert it into Neo4j. Then we can play around with
the data.
Cypher: a graph query language:
 For a more extensive introduction to Cypher you can
visit http://neo4j.com/docs/stable/cypher-query-
lang.html.
 We’ll start by drawing a simple social graph
accompanied by a basic query to retrieve a predefined
pattern as an example.
 Figure 7.8 shows a simple social graph of two nodes,
connected by a relationship of type “knows”. The nodes
have both the properties “name” and “lastname”.
 Now, if we’d like to find out the following pattern, “Who does
Paul know?” we’d query this using Cypher.
 To find a pattern in Cypher, we’ll start with a Match clause.
 In this query we’ll start searching at the node User with the
name property “Paul”.
 Note how the node is enclosed within parentheses, as shown
in the code snippet below, and the relationship is enclosed by
square brackets.
 Relationships are named with a colon (:) prefix, and the
direction is described using arrows. The placeholder p2 will
contain all the User nodes having the relationship of type
“knows” as an inbound relationship.
 With the return clause we can retrieve the results of the
query.
Match(p1:User { name: 'Paul' } )-[:knows]->(p2:User)
Return p2.name
 Notice the close relationship of how we have
formulated our question verbally and the way the
graph database translates this into a traversal.
 In Neo4j, this impressive expressiveness is made
possible by its graph query language, Cypher.

 To make the examples more interesting, let’s assume


that our data is represented by the graph in figure 7.9.
 We can insert the connected data in figure 7.9 into
Neo4j by using Cypher. We can write Cypher commands
directly in the browser-based interface of Neo4j, or
alternatively through a Python driver (see http:/
/neo4j.com/developer/python/ for an over-view).
To write an appropriate create statement in Cypher,
first we should have a good understanding of which
data we’d like to store as nodes and which as
relationships, what their properties should be, and
whether labels would be useful.
The first decision is to decide which data should be
regarded as nodes and which as relationships to
provide a semantic context for these nodes.
In the following listing we demonstrate how the
different objects could be encoded in Cypher
through one big create statement.
Be aware that Cypher is case sensitive.
Running this create statement in one go has the
advantage that the success of this execution will
ensure us that the graph database has been
successfully created. If an error exists, the graph
won’t be created.
In a real scenario, one should also define indexes
and constraints to ensure a fast lookup and not
search the entire database.
Now that we’ve created our data, we can query it.
The following query will return all nodes and
relationships in the database.
We can ask many questions here. For example:
■ Question 1: Which countries has Annelies visited?
The Cypher code to create the answer is
Match(u:User{name:’Annelies’}) – [:Has_been_in]->
(c:Country)
Return u.name, c.name
■ Question 2: Who has been where? The Cypher code
is
Match ()-[r: Has_been_in]->()
Return r LIMIT 25
 The following query demonstrates how to delete all
nodes and relationships in the database:
MATCH(n)
Optional MATCH (n)-[r]-()
Delete n,r
Applications of graph databases:
 A social graph, for example, can be used to find clusters
of tightly connected nodes inside the graph
communities. People in a cluster who don’t know each
other can then be introduced to each other.
 One of the most popular use cases for graph databases
is the development of recommender engines. Eg: a
recipe recommendation engine, recommend recipes
based on the dish preferences of users and a network
of ingredients.
Text mining in the real world
In your day-to-day life you’ve already come across
text mining and natural language applications.
Autocomplete and spelling correctors are constantly
analyzing the text you type before sending an email
or text message.
Google uses many types of text mining when
presenting you with the results of a query.
Next to shielding its Gmail users from spam, it also
divides the emails into different categories such as
social, updates, and forums.
Text mining has many applications, including, but not
limited to, the following:
■ Entity identification
■ Plagiarism detection
■ Topic identification
■ Text clustering
■ Translation
■ Automatic text summarization
■ Fraud detection
■ Spam filtering
■ Sentiment analysis
Text mining is useful, but is it difficult?
Sorry to disappoint: Yes, it is.
Text mining techniques
The first important concept in text mining is the “bag
of words.”
Bag of words
Bag of words is the simplest way of structuring textual
data: every document is turned into a word vector.
If a certain word is present in the vector it’s labeled
“True”; the others are labeled “False”.
The two word vectors together form the document-
term matrix. The document-term matrix holds a
column for every term and a row for every
document.
Before getting to the actual bag of words, many other data
manipulation steps take place:
■ Tokenization—The text is cut into pieces called “tokens” or
“terms.” (most basic unit of information).
We’ll use unigrams: terms consisting of one word. Often,
however, it’s useful to include bigrams (two words per token)
or trigrams (three words per token) to capture extra meaning
and increase the performance of your models.
■ Stop word filtering—Every language comes with words that
have little value in text analytics because they’re used so
often. NLTK comes with a short list of English stop words we
can filter. If the text is tokenized into words, it often makes
sense to rid the word vector of these low-information stop
words.
■ Lowercasing—Words with capital letters appear at the
beginning of a sentence, others because they’re proper nouns
or adjectives. We gain no added value making that distinction
in our term matrix, so all terms will be set to lowercase.
Stemming and lemmatization
Stemming is the process of bringing words back to their root
form; this way you end up with less variance in the data.
This makes sense if words have similar meanings but are written
differently because, for example, one is in its plural form.
Stemming attempts to unify by cutting off parts of the word. For
example “planes” and “plane” both become “plane.”
Another technique, called lemmatization, has this same goal but
does so in a more grammatically sensitive way.
For example, while both stemming and lemmatization would
reduce “cars” to “car,” lemmatization can also bring back
conjugated verbs to their unconjugated forms such as “are” to
“be.”
Which one you use depends on your case, and lemmatization
profits heavily from POS Tagging (Part of Speech Tagging).
POS Tagging is the process of attributing a grammatical label to
every part of a sentence.
You probably did this manually in school as a language exercise.
Take the sentence “Game of Thrones is a television series.”
If we apply POS Tagging on it we get
({“game”:”NN”},{“of”:”IN},{“thrones”:”NNS},{“is”:”VBZ},{“a”:”DT
},{“television”:”NN}, {“series”:”NN})
NN is a noun, IN is a preposition, NNS is a noun in its plural
form, VBZ is a third-person singular verb, and DT is a determiner.
Table 8.1 has the full list.
POS Tagging is a use case of sentence-tokenization rather than
word-tokenization.
After the POS Tagging is complete you can still proceed to word
tokenization, but a POS Tagger requires whole sentences.
Combining POS Tagging and lemmatization is likely to give
cleaner data than using only a stemmer.
Decision tree classifier
The Naïve Bayes classifier is called that because it considers
each input variable to be independent of all the others, which is
naïve, especially in text mining.
Take the simple examples of “data science,” “data analysis,” or
“game of thrones.” If we cut our data in unigrams we get the
following separate variables (if we ignore stemming and such):
“data,” “science,” “analysis,” “game,” “of,” and “thrones.”
Obviously links will be lost.
This can, in turn, be overcome by creating bigrams (data
science, data analysis) and trigrams (game of thrones).
The decision tree classifier, however, doesn’t consider the
variables to be independent of one another and actively creates
interaction variables and buckets.
An interaction variable is a variable that combines other
variables.
For instance “data” and “science” might be good predictors in their
own right but probably the two of them co-occurring in the same text
might have its own value.
A bucket is somewhat the opposite.
Instead of combining two variables, a variable is split into multiple
new ones. This makes sense for numerical variables. Figure 8.8 shows
what a decision tree might look like and where you can find
interaction and bucketing.

Whereas Naïve Bayes supposes independence of all the input


variables, a decision tree is built upon the assumption of
interdependence. But how does it build this structure?
A decision tree has a few possible criteria it can use to split into
branches and decide which variables are more important (are closer
to the root of the tree) than others.
The one we’ll use in the NLTK decision tree classifier is “information
gain.”
To understand information gain, we first need to look at
entropy. Entropy is a measure of unpredictability or chaos.
A simple example would be the gender of a baby. When a
woman is pregnant, the gender of the fetus can be male or
female, but we don’t know which one it is. If you were to
guess, you have a 50% chance to guess correctly (give or
take, because gender distribution isn’t 100% uniform).
However, during the pregnancy you have the opportunity
to do an ultrasound to determine the gender of the fetus.
An ultrasound is never 100% conclusive, but the farther
along in fetal development, the more accurate it becomes.
This accuracy gain, or information gain, is there because
uncertainty or entropy drops. Let’s say an ultrasound at 12
weeks pregnancy has a 90% accuracy in determining the
gender of the baby. A 10% uncertainty still exists, but the
ultrasound did reduce the uncertainty
To understand information gain, we first need to look at entropy.
Entropy is a measure of unpredictability or chaos.
A simple example would be the gender of a baby. When a
woman is pregnant, the gender of the fetus can be male or
female, but we don’t know which one it is. If you were to guess,
you have a 50% chance to guess correctly (give or take,
because gender distribution isn’t 100% uniform).
However, during the pregnancy you have the opportunity to do
an ultrasound to determine the gender of the fetus. An
ultrasound is never 100% conclusive, but the farther along in
fetal development, the more accurate it becomes.
This accuracy gain, or information gain, is there because
uncertainty or entropy drops. Let’s say an ultrasound at 12
weeks pregnancy has a 90% accuracy in determining the
gender of the baby. A 10% uncertainty still exists, but the
ultrasound did reduce the uncertainty from 50% to 10%. That’s
a pretty good discriminator. A decision tree follows this same
If another gender test has more predictive power, it could
become the root of the tree with the ultrasound test being in the
branches, and this can go on until we run out of variables or
observations.
We can run out of observations, because at every branch split
we also split the input data.
This is a big weakness of the decision tree, because at the leaf
level of the tree robustness breaks down if too few observations
are left; the decision trees starts to overfit the data.
Overfitting allows the model to mistake randomness for real
correlations. To counteract this, a decision tree is pruned: its
meaningless branches are left out of the final model.

You might also like