R23-IDS-Unit4-PPT_2.0

The document introduces data science tools and applications, focusing on Neo4j, a graph database that uses the Cypher query language for managing connected data. It discusses the features of Neo4j, including its flexible schema, property graphs, and the process of data retrieval and manipulation using Cypher. Additionally, it covers text mining techniques, including the bag of words model, stemming, lemmatization, and decision tree classifiers, highlighting their applications and complexities in data analysis.

Uploaded by

23a31a4411

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

R23-IDS-Unit4-PPT_2.0

Uploaded by

23a31a4411

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

INTRODUCTION TO DATA SCIENCE

(R23 – II Year I Sem)

UNIT IV
Tools and Applications of Data Science:
Introducing Neo4j for dealing with graph databases, graph query
language Cypher, Applications graph databases, Python libraries
like nltk and SQLite for handling Text mining and analytics, case
study on classifying Reddit posts
Introducing Neo4j: a graph database
Connected data is generally stored in graph
databases. These databases are specifically
designed to cope with the structure of connected
data. The landscape of available graph databases is
rather diverse these days.
The three most-known ones in order of decreasing
popularity are Neo4j, OrientDb, and Titan.
the concept of connected data and its representation as
graph data.
■ Connected data—As the name indicates, connected
data is characterized by the fact that the data at
hand has a relationship that makes it connected.
■ Graphs—Often referred to in the same sentence as
connected data. Graphs are well suited to represent
the connectivity of data in a meaningful way.
■ Graph databases—The reason this subject is
meriting particular attention is because, besides the
fact that data is increasing in size, it’s also becoming
more interconnected. Not much effort is needed to
come up with well-known examples of connected
data.
A prominent example of data that takes a network
form is social media data.
Social media allows us to share and exchange data in
networks, thereby generating a great amount of
connected data.
Features of Neo4j
 Neo4j is a graph database that stores the data in a graph
containing nodes and relationships (both are allowed to
contain properties).
 This type of graph database is known as a property graph and
is well suited for storing connected data.
 It has a flexible schema that will give us freedom to change
our data structure if needed, providing us the ability to add
new data and new relationships if needed.
 It’s an open source project, mature technology, easy to install,
user-friendly, and well documented.
 Neo4j also has a browser-based interface that facilitates the
creation of graphs for visualization purposes.
 Neo4j can be downloaded from
http:/ /neo4j.com/download/.
Four basic structures in Neo4j:
■ Nodes—Represent entities such as documents, users,
recipes, and so on. Certain properties could be assigned
to nodes.
■ Relationships—Exist between the different nodes. They
can be accessed either stand-alone or through the
nodes they’re attached to. Relationships can also
contain properties, hence the name property graph
model. Every relationship has a name and a direction,
which together provide semantic context for the nodes
connected by the relationship.
■ Properties—Both nodes and relationships can have
properties. Properties are defined by key-value pairs.
■ Labels—Can be used to group similar nodes to facilitate
faster traversal through graphs.
 Before conducting an analysis, a good habit is to design
your database carefully so it fits the queries you’d like
to run down the road when performing your analysis.
 Graph databases have the pleasant characteristic that
they’re whiteboard friendly. If one tries to draw the
problem setting on a whiteboard, this drawing will
closely resemble the database design for the defined
problem.
 Now how to retrieve the data? To explore our data, we
need to traverse through the graph following
predefined paths to find the patterns we’re searching
for.
 The Neo4j browser is an ideal environment to create
and play around with your connected data until you get
to the right kind of representation for optimal queries.
The flexible schema of the graph database suits us
well here. In this browser you can retrieve your data
in rows or as a graph.
Neo4j has its own query language to ease the
creation and query capabilities of graphs.
Cypher is a highly expressive language that shares
enough with SQL to enhance the learning process of
the language.
We can create our own data using Cypher and
insert it into Neo4j. Then we can play around with
the data.
Cypher: a graph query language:
 For a more extensive introduction to Cypher you can
visit http://neo4j.com/docs/stable/cypher-query-
lang.html.
 We’ll start by drawing a simple social graph
accompanied by a basic query to retrieve a predefined
pattern as an example.
 Figure 7.8 shows a simple social graph of two nodes,
connected by a relationship of type “knows”. The nodes
have both the properties “name” and “lastname”.
 Now, if we’d like to find out the following pattern, “Who does
Paul know?” we’d query this using Cypher.
 To find a pattern in Cypher, we’ll start with a Match clause.
 In this query we’ll start searching at the node User with the
name property “Paul”.
 Note how the node is enclosed within parentheses, as shown
in the code snippet below, and the relationship is enclosed by
square brackets.
 Relationships are named with a colon (:) prefix, and the
direction is described using arrows. The placeholder p2 will
contain all the User nodes having the relationship of type
“knows” as an inbound relationship.
 With the return clause we can retrieve the results of the
query.
Match(p1:User { name: 'Paul' } )-[:knows]->(p2:User)
Return p2.name
 Notice the close relationship of how we have
formulated our question verbally and the way the
graph database translates this into a traversal.
 In Neo4j, this impressive expressiveness is made
possible by its graph query language, Cypher.

 To make the examples more interesting, let’s assume

that our data is represented by the graph in figure 7.9.
 We can insert the connected data in figure 7.9 into
Neo4j by using Cypher. We can write Cypher commands
directly in the browser-based interface of Neo4j, or
alternatively through a Python driver (see http:/
/neo4j.com/developer/python/ for an over-view).
To write an appropriate create statement in Cypher,
first we should have a good understanding of which
data we’d like to store as nodes and which as
relationships, what their properties should be, and
whether labels would be useful.
The first decision is to decide which data should be
regarded as nodes and which as relationships to
provide a semantic context for these nodes.
In the following listing we demonstrate how the
different objects could be encoded in Cypher
through one big create statement.
Be aware that Cypher is case sensitive.
Running this create statement in one go has the
advantage that the success of this execution will
ensure us that the graph database has been
successfully created. If an error exists, the graph
won’t be created.
In a real scenario, one should also define indexes
and constraints to ensure a fast lookup and not
search the entire database.
Now that we’ve created our data, we can query it.
The following query will return all nodes and
relationships in the database.
We can ask many questions here. For example:
■ Question 1: Which countries has Annelies visited?
The Cypher code to create the answer is
Match(u:User{name:’Annelies’}) – [:Has_been_in]->
(c:Country)
Return u.name, c.name
■ Question 2: Who has been where? The Cypher code
is
Match ()-[r: Has_been_in]->()
Return r LIMIT 25
 The following query demonstrates how to delete all
nodes and relationships in the database:
MATCH(n)
Optional MATCH (n)-[r]-()
Delete n,r
Applications of graph databases:
 A social graph, for example, can be used to find clusters
of tightly connected nodes inside the graph
communities. People in a cluster who don’t know each
other can then be introduced to each other.
 One of the most popular use cases for graph databases
is the development of recommender engines. Eg: a
recipe recommendation engine, recommend recipes
based on the dish preferences of users and a network
of ingredients.
Text mining in the real world
In your day-to-day life you’ve already come across
text mining and natural language applications.
Autocomplete and spelling correctors are constantly
analyzing the text you type before sending an email
or text message.
Google uses many types of text mining when
presenting you with the results of a query.
Next to shielding its Gmail users from spam, it also
divides the emails into different categories such as
social, updates, and forums.
Text mining has many applications, including, but not
limited to, the following:
■ Entity identification
■ Plagiarism detection
■ Topic identification
■ Text clustering
■ Translation
■ Automatic text summarization
■ Fraud detection
■ Spam filtering
■ Sentiment analysis
Text mining is useful, but is it difficult?
Sorry to disappoint: Yes, it is.
Text mining techniques
The first important concept in text mining is the “bag
of words.”
Bag of words
Bag of words is the simplest way of structuring textual
data: every document is turned into a word vector.
If a certain word is present in the vector it’s labeled
“True”; the others are labeled “False”.
The two word vectors together form the document-
term matrix. The document-term matrix holds a
column for every term and a row for every
document.
Before getting to the actual bag of words, many other data
manipulation steps take place:
■ Tokenization—The text is cut into pieces called “tokens” or
“terms.” (most basic unit of information).
We’ll use unigrams: terms consisting of one word. Often,
however, it’s useful to include bigrams (two words per token)
or trigrams (three words per token) to capture extra meaning
and increase the performance of your models.
■ Stop word filtering—Every language comes with words that
have little value in text analytics because they’re used so
often. NLTK comes with a short list of English stop words we
can filter. If the text is tokenized into words, it often makes
sense to rid the word vector of these low-information stop
words.
■ Lowercasing—Words with capital letters appear at the
beginning of a sentence, others because they’re proper nouns
or adjectives. We gain no added value making that distinction
in our term matrix, so all terms will be set to lowercase.
Stemming and lemmatization
Stemming is the process of bringing words back to their root
form; this way you end up with less variance in the data.
This makes sense if words have similar meanings but are written
differently because, for example, one is in its plural form.
Stemming attempts to unify by cutting off parts of the word. For
example “planes” and “plane” both become “plane.”
Another technique, called lemmatization, has this same goal but
does so in a more grammatically sensitive way.
For example, while both stemming and lemmatization would
reduce “cars” to “car,” lemmatization can also bring back
conjugated verbs to their unconjugated forms such as “are” to
“be.”
Which one you use depends on your case, and lemmatization
profits heavily from POS Tagging (Part of Speech Tagging).
POS Tagging is the process of attributing a grammatical label to
every part of a sentence.
You probably did this manually in school as a language exercise.
Take the sentence “Game of Thrones is a television series.”
If we apply POS Tagging on it we get
({“game”:”NN”},{“of”:”IN},{“thrones”:”NNS},{“is”:”VBZ},{“a”:”DT
},{“television”:”NN}, {“series”:”NN})
NN is a noun, IN is a preposition, NNS is a noun in its plural
form, VBZ is a third-person singular verb, and DT is a determiner.
Table 8.1 has the full list.
POS Tagging is a use case of sentence-tokenization rather than
word-tokenization.
After the POS Tagging is complete you can still proceed to word
tokenization, but a POS Tagger requires whole sentences.
Combining POS Tagging and lemmatization is likely to give
cleaner data than using only a stemmer.
Decision tree classifier
The Naïve Bayes classifier is called that because it considers
each input variable to be independent of all the others, which is
naïve, especially in text mining.
Take the simple examples of “data science,” “data analysis,” or
“game of thrones.” If we cut our data in unigrams we get the
following separate variables (if we ignore stemming and such):
“data,” “science,” “analysis,” “game,” “of,” and “thrones.”
Obviously links will be lost.
This can, in turn, be overcome by creating bigrams (data
science, data analysis) and trigrams (game of thrones).
The decision tree classifier, however, doesn’t consider the
variables to be independent of one another and actively creates
interaction variables and buckets.
An interaction variable is a variable that combines other
variables.
For instance “data” and “science” might be good predictors in their
own right but probably the two of them co-occurring in the same text
might have its own value.
A bucket is somewhat the opposite.
Instead of combining two variables, a variable is split into multiple
new ones. This makes sense for numerical variables. Figure 8.8 shows
what a decision tree might look like and where you can find
interaction and bucketing.

Whereas Naïve Bayes supposes independence of all the input

variables, a decision tree is built upon the assumption of
interdependence. But how does it build this structure?
A decision tree has a few possible criteria it can use to split into
branches and decide which variables are more important (are closer
to the root of the tree) than others.
The one we’ll use in the NLTK decision tree classifier is “information
gain.”
To understand information gain, we first need to look at
entropy. Entropy is a measure of unpredictability or chaos.
A simple example would be the gender of a baby. When a
woman is pregnant, the gender of the fetus can be male or
female, but we don’t know which one it is. If you were to
guess, you have a 50% chance to guess correctly (give or
take, because gender distribution isn’t 100% uniform).
However, during the pregnancy you have the opportunity
to do an ultrasound to determine the gender of the fetus.
An ultrasound is never 100% conclusive, but the farther
along in fetal development, the more accurate it becomes.
This accuracy gain, or information gain, is there because
uncertainty or entropy drops. Let’s say an ultrasound at 12
weeks pregnancy has a 90% accuracy in determining the
gender of the baby. A 10% uncertainty still exists, but the
ultrasound did reduce the uncertainty
To understand information gain, we first need to look at entropy.
Entropy is a measure of unpredictability or chaos.
A simple example would be the gender of a baby. When a
woman is pregnant, the gender of the fetus can be male or
female, but we don’t know which one it is. If you were to guess,
you have a 50% chance to guess correctly (give or take,
because gender distribution isn’t 100% uniform).
However, during the pregnancy you have the opportunity to do
an ultrasound to determine the gender of the fetus. An
ultrasound is never 100% conclusive, but the farther along in
fetal development, the more accurate it becomes.
This accuracy gain, or information gain, is there because
uncertainty or entropy drops. Let’s say an ultrasound at 12
weeks pregnancy has a 90% accuracy in determining the
gender of the baby. A 10% uncertainty still exists, but the
ultrasound did reduce the uncertainty from 50% to 10%. That’s
a pretty good discriminator. A decision tree follows this same
If another gender test has more predictive power, it could
become the root of the tree with the ultrasound test being in the
branches, and this can go on until we run out of variables or
observations.
We can run out of observations, because at every branch split
we also split the input data.
This is a big weakness of the decision tree, because at the leaf
level of the tree robustness breaks down if too few observations
are left; the decision trees starts to overfit the data.
Overfitting allows the model to mistake randomness for real
correlations. To counteract this, a decision tree is pruned: its
meaningless branches are left out of the final model.

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
dp-900 - 2c26aa3133b9 - 260 Questions
100% (1)
dp-900 - 2c26aa3133b9 - 260 Questions
187 pages
9 Neo4j
No ratings yet
9 Neo4j
8 pages
Neo4j - Graph Database PDF
No ratings yet
Neo4j - Graph Database PDF
19 pages
Graph Database
No ratings yet
Graph Database
4 pages
Introtoneo4jwebinar331 160331235041
No ratings yet
Introtoneo4jwebinar331 160331235041
117 pages
Learning Guide 2: Nosql and Newsql: Cloud Computing Databases
No ratings yet
Learning Guide 2: Nosql and Newsql: Cloud Computing Databases
23 pages
Neo4j Notes
No ratings yet
Neo4j Notes
10 pages
neo4j
No ratings yet
neo4j
29 pages
SQL 7
No ratings yet
SQL 7
18 pages
GraphDatabase Lab Practices
No ratings yet
GraphDatabase Lab Practices
18 pages
NOSQL Micro Project
No ratings yet
NOSQL Micro Project
42 pages
Lecture02 GraphDatabases Neo4J PDF
No ratings yet
Lecture02 GraphDatabases Neo4J PDF
95 pages
Building Web Applications With Python and Neo4j - Sample Chapter
No ratings yet
Building Web Applications With Python and Neo4j - Sample Chapter
29 pages
Neo 4 J
No ratings yet
Neo 4 J
16 pages
nosql_module5
No ratings yet
nosql_module5
8 pages
Neo 4 J
100% (1)
Neo 4 J
4 pages
Neo4j Graph Analytics
No ratings yet
Neo4j Graph Analytics
20 pages
Stefan Armbruster Data Modelling With Graphs
No ratings yet
Stefan Armbruster Data Modelling With Graphs
64 pages
Beginnerpresentation 120429104540 Phpapp01[1]
No ratings yet
Beginnerpresentation 120429104540 Phpapp01[1]
30 pages
2011 Webber-A Programmatic Introduction To Neo4j
No ratings yet
2011 Webber-A Programmatic Introduction To Neo4j
66 pages
NoSQL Overview Examples
No ratings yet
NoSQL Overview Examples
15 pages
ADO Lecture IX 2023-25
No ratings yet
ADO Lecture IX 2023-25
44 pages
Neo4j Basics to Advanced Full
No ratings yet
Neo4j Basics to Advanced Full
11 pages
Online AppQ HR Q1-Q30
No ratings yet
Online AppQ HR Q1-Q30
30 pages
To NEO4J: Abhishek Kumar
No ratings yet
To NEO4J: Abhishek Kumar
13 pages
NoSQL - PRACTICAL 7
No ratings yet
NoSQL - PRACTICAL 7
12 pages
ECS765P - W9 - Large-Scale Graph Processing
No ratings yet
ECS765P - W9 - Large-Scale Graph Processing
51 pages
Graph Databases For SQL Server Professionals
No ratings yet
Graph Databases For SQL Server Professionals
34 pages
Presentation ON Neo4J
No ratings yet
Presentation ON Neo4J
5 pages
M11a1 Final
No ratings yet
M11a1 Final
7 pages
NoSQL Module -5
No ratings yet
NoSQL Module -5
28 pages
8 - Graph Databases
No ratings yet
8 - Graph Databases
7 pages
Neo4j Use Case Social
No ratings yet
Neo4j Use Case Social
3 pages
Graph Database Query Feature
No ratings yet
Graph Database Query Feature
6 pages
Learning Graph DB in one night — Neo4j _ by Prashant Mudgal _ Towards Data Science
No ratings yet
Learning Graph DB in one night — Neo4j _ by Prashant Mudgal _ Towards Data Science
20 pages
Neo 4 J
No ratings yet
Neo 4 J
62 pages
EUC1502 Module5 Big-Data
No ratings yet
EUC1502 Module5 Big-Data
46 pages
Neo4j Fundamentals Summary
No ratings yet
Neo4j Fundamentals Summary
1 page
Graph Neo4j
No ratings yet
Graph Neo4j
46 pages
Modeling A Recommendation Engine Workshop
No ratings yet
Modeling A Recommendation Engine Workshop
94 pages
Experiment No. 8: 1. Aim: 2. Objectives
No ratings yet
Experiment No. 8: 1. Aim: 2. Objectives
3 pages
Neo4j PDF
No ratings yet
Neo4j PDF
30 pages
REVIEW-Graphical Databases Using Neo4j
No ratings yet
REVIEW-Graphical Databases Using Neo4j
22 pages
NO SQL
No ratings yet
NO SQL
13 pages
Best of Both Worlds - Combine KG and Vector Search For Enhanced RAG - Neo4j
No ratings yet
Best of Both Worlds - Combine KG and Vector Search For Enhanced RAG - Neo4j
40 pages
Graph Databases: Immanuel Trummer
No ratings yet
Graph Databases: Immanuel Trummer
38 pages
NoSQL Database Document
No ratings yet
NoSQL Database Document
5 pages
Implement - Graph Databases
No ratings yet
Implement - Graph Databases
40 pages
Property Graphs: Neo4j and Cypher
No ratings yet
Property Graphs: Neo4j and Cypher
110 pages
CST8276 - Lab 10 - Working With Graph Databases
No ratings yet
CST8276 - Lab 10 - Working With Graph Databases
10 pages
Neo4j: What's A Graph Database?
No ratings yet
Neo4j: What's A Graph Database?
2 pages
Neo 4 J
No ratings yet
Neo 4 J
15 pages
Introduction To GRAPH Database
No ratings yet
Introduction To GRAPH Database
18 pages
Neo4j Cookbook - Sample Chapter
No ratings yet
Neo4j Cookbook - Sample Chapter
31 pages
10 Class 2016 Partii (Read-Only)
No ratings yet
10 Class 2016 Partii (Read-Only)
23 pages
PR 6 No SQL
No ratings yet
PR 6 No SQL
10 pages
9 NoSQL Database
No ratings yet
9 NoSQL Database
53 pages
Lecture 11 Neo4j PDF
No ratings yet
Lecture 11 Neo4j PDF
46 pages
No SQL
No ratings yet
No SQL
16 pages
Jump Start MySQL: Master the Database That Powers the Web
From Everand
Jump Start MySQL: Master the Database That Powers the Web
Timothy Boronczyk
No ratings yet
Data Visualization with JavaScript
From Everand
Data Visualization with JavaScript
Stephen A. Thomas
5/5 (3)
IAT-I Question Paper With Solution of 18CS823 Nosql Database May-2021-Poonam Tijare
100% (1)
IAT-I Question Paper With Solution of 18CS823 Nosql Database May-2021-Poonam Tijare
12 pages
Neo4j Bloom Visual Guide PDF
No ratings yet
Neo4j Bloom Visual Guide PDF
9 pages
Hybrid Approach To Crime Prediction Using Deep Learning: Jaravindhar@hindustanuniv - Ac.in
No ratings yet
Hybrid Approach To Crime Prediction Using Deep Learning: Jaravindhar@hindustanuniv - Ac.in
10 pages
Adbase Presentation Group 4
No ratings yet
Adbase Presentation Group 4
60 pages
Slide Isp610
No ratings yet
Slide Isp610
64 pages
Cqrs Pattern
No ratings yet
Cqrs Pattern
2 pages
Architecting Data Intensive Applications 1st Edition Anuj Kumar - Own the complete ebook set now in PDF and DOCX formats
100% (3)
Architecting Data Intensive Applications 1st Edition Anuj Kumar - Own the complete ebook set now in PDF and DOCX formats
71 pages
Unit 3 Social Computing
No ratings yet
Unit 3 Social Computing
19 pages
BES3141 - ClassHandout 3141 Shinde AU2024 - 1727476478261001mrud
No ratings yet
BES3141 - ClassHandout 3141 Shinde AU2024 - 1727476478261001mrud
20 pages
Neural Graph Reasoning: Complex Logical Query Answering Meets Graph Databases
No ratings yet
Neural Graph Reasoning: Complex Logical Query Answering Meets Graph Databases
65 pages
Phase 3 Xii Ip(24!12!2024) Set b
No ratings yet
Phase 3 Xii Ip(24!12!2024) Set b
9 pages
Neo4j - WP Fraud Detection With Graph Databases
No ratings yet
Neo4j - WP Fraud Detection With Graph Databases
12 pages
Graph Databases Ian Robinson download
100% (2)
Graph Databases Ian Robinson download
61 pages
On Data Lake Architectures Andmetadata Management
No ratings yet
On Data Lake Architectures Andmetadata Management
24 pages
neo4j_legal_knowledge_graph
No ratings yet
neo4j_legal_knowledge_graph
7 pages
OSINT (Open Source Intelligence) Tools
No ratings yet
OSINT (Open Source Intelligence) Tools
26 pages
Acfeic-Reza's Slides Nafc 26nov2021 03
No ratings yet
Acfeic-Reza's Slides Nafc 26nov2021 03
33 pages
AWS-Marketplace MPD Ebook 1 Architecting-Your-Journey
No ratings yet
AWS-Marketplace MPD Ebook 1 Architecting-Your-Journey
31 pages
Improving The Capabilities of Large Language Model Based Marketing Analytics Copilots With Semantic Search and Fine-Tuning
No ratings yet
Improving The Capabilities of Large Language Model Based Marketing Analytics Copilots With Semantic Search and Fine-Tuning
17 pages
Neo4j Interview Questions Answers
No ratings yet
Neo4j Interview Questions Answers
4 pages
Business Analytics Lab Manual - Complete Program
No ratings yet
Business Analytics Lab Manual - Complete Program
85 pages
AcumaticaERP CustomizationGuide
No ratings yet
AcumaticaERP CustomizationGuide
529 pages
Database Fundamentals by Theophilus Edet
100% (1)
Database Fundamentals by Theophilus Edet
230 pages
Mongodb and Neo4j Practicals
No ratings yet
Mongodb and Neo4j Practicals
12 pages
Downloadable Official CompTIA Data+ Student Guide 3
50% (2)
Downloadable Official CompTIA Data+ Student Guide 3
426 pages
SNA Unit2
No ratings yet
SNA Unit2
25 pages
Graph Databases Ian Robinson 2024 scribd download
100% (1)
Graph Databases Ian Robinson 2024 scribd download
61 pages
A Survey of Relation Extraction of Knowledge Graphs
No ratings yet
A Survey of Relation Extraction of Knowledge Graphs
123 pages

R23-IDS-Unit4-PPT_2.0

Uploaded by

R23-IDS-Unit4-PPT_2.0

Uploaded by

INTRODUCTION TO DATA SCIENCE

(R23 – II Year I Sem)

 To make the examples more interesting, let’s assume

Whereas Naïve Bayes supposes independence of all the input

You might also like