Text Analytics with Python A Practical Real World Approach to Gaining Actionable Insights from Your Data 1st Edition Dipanjan Sarkar pdf download
Text Analytics with Python A Practical Real World Approach to Gaining Actionable Insights from Your Data 1st Edition Dipanjan Sarkar pdf download
https://textbookfull.com/product/text-analytics-with-python-a-
practitioners-guide-to-natural-language-processing-2nd-edition-
dipanjan-sarkar/
https://textbookfull.com/product/practical-python-data-
visualization-a-fast-track-approach-to-learning-data-
visualization-with-python-ashwin-pajankar/
https://textbookfull.com/product/product-analytics-applied-data-
science-techniques-for-actionable-consumer-insights-pearson-
business-analytics-series-1st-edition-rodrigues/
https://textbookfull.com/product/practical-natural-language-
processing-with-python-with-case-studies-from-industries-using-
text-data-at-scale-1st-edition-mathangi-sri/
Stream Analytics with Microsoft Azure Real time data
processing for quick insights using Azure Stream
Analytics 1st Edition Anindita Basak
https://textbookfull.com/product/stream-analytics-with-microsoft-
azure-real-time-data-processing-for-quick-insights-using-azure-
stream-analytics-1st-edition-anindita-basak/
https://textbookfull.com/product/python-real-world-machine-
learning-real-world-machine-learning-take-your-python-machine-
learning-skills-to-the-next-level-1st-edition-joshi/
https://textbookfull.com/product/real-world-python-a-hacker-s-
guide-to-solving-problems-with-code-1st-edition-lee-vaughan/
https://textbookfull.com/product/data-science-and-analytics-with-
python-1st-edition-jesus-rogel-salazar/
Text Analytics
with Python
A Practical Real-World Approach to
Gaining Actionable Insights from
Your Data
—
Dipanjan Sarkar
Text Analytics
with Python
A Practical Real-World
Approach to Gaining Actionable
Insights from your Data
Dipanjan Sarkar
Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable
Insights from Your Data
Dipanjan Sarkar
Bangalore, Karnataka
India
ISBN-13 (pbk): 978-1-4842-2387-1 ISBN-13 (electronic): 978-1-4842-2388-8
DOI 10.1007/978-1-4842-2388-8
Library of Congress Control Number: 2016960760
Copyright © 2016 by Dipanjan Sarkar
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole
or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical
way, and transmission or information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark
symbol with every occurrence of a trademarked name, logo, or image we use the names, logos,
and images only in an editorial fashion and to the benefit of the trademark owner, with no
intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even
if they are not identified as such, is not to be taken as an expression of opinion as to whether or
not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the
date of publication, neither the authors nor the editors nor the publisher can accept any legal
responsibility for any errors or omissions that may be made. The publisher makes no warranty,
express or implied, with respect to the material contained herein.
Managing Director: Welmoed Spahr
Lead Editor: Mr. Sarkar
Technical Reviewer: Shanky Sharma
Editorial Board: Steve Anglin, Pramila Balan, Laura Berendson, Aaron Black,
Louise Corrigan, Jonathan Gennick, Robert Hutchinson, Celestin Suresh John,
Nikhil Karkal, James Markham, Susan McDermott, Matthew Moodie, Natalie Pao,
Gwenan Spearing
Coordinating Editor: Sanchita Mandal
Copy Editor: Corbin Collins
Compositor: SPi Global
Indexer: SPi Global
Artist: SPi Global
Distributed to the book trade worldwide by Springer Science+Business Media New York,
233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505,
e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a
California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc
(SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail [email protected], or visit www.apress.com.
Apress and friends of ED books may be purchased in bulk for academic, corporate,
or promotional use. eBook versions and licenses are also available for most titles.
For more information, reference our Special Bulk Sales–eBook Licensing web page at
www.apress.com/bulk-sales.
Any source code or other supplementary materials referenced by the author in this text are
available to readers at www.apress.com. For detailed information about how to locate your
book’s source code, go to www.apress.com/source-code/. Readers can also access source code
at SpringerLink in the Supplementary Material section for each chapter.
Printed on acid-free paper
This book is dedicated to my parents, partner, well-wishers,
and especially to all the developers, practitioners, and
organizations who have created a wonderful and thriving
ecosystem around analytics and data science.
Contents at a Glance
■
■Chapter 1: Natural Language Basics���������������������������������������������� 1
■
■Chapter 2: Python Refresher�������������������������������������������������������� 51
■
■Chapter 3: Processing and Understanding Text�������������������������� 107
■
■Chapter 4: Text Classification����������������������������������������������������� 167
■
■Chapter 5: Text Summarization�������������������������������������������������� 217
■
■Chapter 6: Text Similarity and Clustering����������������������������������� 265
■
■Chapter 7: Semantic and Sentiment Analysis���������������������������� 319
Index���������������������������������������������������������������������������������������������� 377
v
Contents
■
■Chapter 1: Natural Language Basics���������������������������������������������� 1
Natural Language������������������������������������������������������������������������������������ 2
What Is Natural Language?�������������������������������������������������������������������������������������� 2
The Philosophy of Language������������������������������������������������������������������������������������� 2
Language Acquisition and Usage������������������������������������������������������������������������������ 5
Linguistics����������������������������������������������������������������������������������������������� 8
Language Syntax and Structure������������������������������������������������������������ 10
Words��������������������������������������������������������������������������������������������������������������������� 11
Phrases������������������������������������������������������������������������������������������������������������������� 12
Clauses������������������������������������������������������������������������������������������������������������������� 14
Grammar����������������������������������������������������������������������������������������������������������������� 15
Word Order Typology����������������������������������������������������������������������������������������������� 23
Language Semantics����������������������������������������������������������������������������� 25
Lexical Semantic Relations������������������������������������������������������������������������������������ 25
Semantic Networks and Models����������������������������������������������������������������������������� 28
Representation of Semantics��������������������������������������������������������������������������������� 29
vii
■ Contents
Text Corpora������������������������������������������������������������������������������������������ 37
Corpora Annotation and Utilities����������������������������������������������������������������������������� 38
Popular Corpora������������������������������������������������������������������������������������������������������ 39
Accessing Text Corpora������������������������������������������������������������������������������������������ 40
Text Analytics���������������������������������������������������������������������������������������� 49
Summary����������������������������������������������������������������������������������������������� 50
■
■Chapter 2: Python Refresher�������������������������������������������������������� 51
Getting to Know Python������������������������������������������������������������������������� 51
The Zen of Python��������������������������������������������������������������������������������������������������� 54
Applications: When Should You Use Python?���������������������������������������������������������� 55
Drawbacks: When Should You Not Use Python?����������������������������������������������������� 58
Python Implementations and Versions������������������������������������������������������������������� 59
Installation and Setup��������������������������������������������������������������������������� 60
Which Python Version?������������������������������������������������������������������������������������������� 60
Which Operating System?�������������������������������������������������������������������������������������� 61
Integrated Development Environments������������������������������������������������������������������ 61
Environment Setup������������������������������������������������������������������������������������������������� 62
Virtual Environments���������������������������������������������������������������������������������������������� 64
viii
■ Contents
Functional Programming����������������������������������������������������������������������� 84
Functions���������������������������������������������������������������������������������������������������������������� 84
Recursive Functions����������������������������������������������������������������������������������������������� 85
Anonymous Functions�������������������������������������������������������������������������������������������� 86
Iterators������������������������������������������������������������������������������������������������������������������ 87
Comprehensions����������������������������������������������������������������������������������������������������� 88
Generators�������������������������������������������������������������������������������������������������������������� 90
The itertools and functools Modules���������������������������������������������������������������������� 91
Classes�������������������������������������������������������������������������������������������������� 91
Working with Text���������������������������������������������������������������������������������� 94
String Literals��������������������������������������������������������������������������������������������������������� 94
String Operations and Methods������������������������������������������������������������������������������ 96
ix
■ Contents
■
■Chapter 3: Processing and Understanding Text�������������������������� 107
Text Tokenization��������������������������������������������������������������������������������� 108
Sentence Tokenization������������������������������������������������������������������������������������������ 108
Word Tokenization������������������������������������������������������������������������������������������������ 112
Summary��������������������������������������������������������������������������������������������� 165
■
■Chapter 4: Text Classification����������������������������������������������������� 167
What Is Text Classification?����������������������������������������������������������������� 168
Automated Text Classification������������������������������������������������������������� 170
Text Classification Blueprint���������������������������������������������������������������� 172
Text Normalization������������������������������������������������������������������������������� 174
Feature Extraction������������������������������������������������������������������������������� 177
x
■ Contents
xi
■ Contents
Summary��������������������������������������������������������������������������������������������� 263
■
■Chapter 6: Text Similarity and Clustering����������������������������������� 265
Important Concepts����������������������������������������������������������������������������� 266
Information Retrieval (IR)�������������������������������������������������������������������������������������� 266
Feature Engineering��������������������������������������������������������������������������������������������� 267
Similarity Measures���������������������������������������������������������������������������������������������� 267
Unsupervised Machine Learning Algorithms�������������������������������������������������������� 268
xii
■ Contents
Summary��������������������������������������������������������������������������������������������� 317
■
■Chapter 7: Semantic and Sentiment Analysis���������������������������� 319
Semantic Analysis������������������������������������������������������������������������������� 320
Exploring WordNet������������������������������������������������������������������������������� 321
Understanding Synsets����������������������������������������������������������������������������������������� 321
Analyzing Lexical Semantic Relations������������������������������������������������������������������ 323
Index���������������������������������������������������������������������������������������������� 377
xiii
About the Author
Sarkar has been an analytics practitioner for over four years, specializing in statistical,
predictive, and text analytics. He has also authored a couple of books on R and machine
learning, reviews technical books, and acts as a course beta tester for Coursera.
Dipanjan’s interests include learning about new technology, financial markets, disruptive
startups, data science, and more recently, artificial intelligence and deep learning. In his
spare time he loves reading, gaming, and watching popular sitcoms and football.
xv
About the Technical
Reviewer
Shanky Sharma Currently leading the AI team at Nextremer India, Shanky Sharma’s work
entails implementing various AI and machine learning–related projects and working on
deep learning for speech recognition in Indic languages. He hopes to grow and scale new
horizons in AI and machine learning technologies. Statistics intrigue him and he loves
playing with numbers, designing algorithms, and giving solutions to people. He sees
himself as a solution provider rather than a scripter or another IT nerd who codes. He
loves heavy metal and trekking and giving back to society, which, he believes, is the task
of every engineer. He also loves teaching and helping people. He is a firm believer that we
learn more by helping others learn.
xvii
Acknowledgments
This book would definitely not be a reality without the help and support from some
excellent people in my life. I would like to thank my parents, Digbijoy and Sampa,
my partner Durba, and my family and well-wishers for their constant support and
encouragement, which really motivates me and helps me strive to achieve more.
This book is based on various experiences and lessons learned over time. For that I
would like to thank my managers, Nagendra Venkatesh and Sanjeev Reddy, for believing
in me and giving me an excellent opportunity to tackle challenging problems and also
grow personally. For the wealth of knowledge I gained in text analytics in my early days,
I would like to acknowledge Dr. Mandar Mutalikdesai and Dr. Sanket Patil for not only
being good managers but excellent mentors.
A special mention goes out to my colleagues Roopak Prajapat and Sailaja
Parthasarathy for collaborating with me on various problems in text analytics. Thanks to
Tamoghna Ghosh for being a great mentor and friend who keeps teaching me something
new every day, and to my team, Raghav Bali, Tushar Sharma, Nitin Panwar, Ishan
Khurana, Ganesh Ghongane, and Karishma Chug, for making tough problems look easier
and more fun.
A lot of the content in this book would not have been possible without Christine Doig
Cardet, Brandon Rose, and all the awesome people behind Python, Continuum Analytics,
NLTK, gensim, pattern, spaCy, scikit-learn, and many more excellent open source
frameworks and libraries out there that make our lives easier. Also to my friend Jyotiska,
thank you for introducing me to Python and for learning and collaborating with me on
various occasions that have helped me become what I am today.
Last, but never least, a big thank you to the entire team at Apress, especially
to Celestin Suresh John, Sanchita Mandal, and Laura Berendson for giving me this
wonderful opportunity to share my experience and what I’ve learned with the community
and for guiding me and working tirelessly behind the scenes to make great things happen!
xix
Introduction
I have been into mathematics and statistics since high school, when numbers began to
really interest me. Analytics, data science, and more recently text analytics came much
later, perhaps around four or five years ago when the hype about Big Data and Analytics
was getting bigger and crazier. Personally I think a lot of it is over-hyped, but a lot of it is
also exciting and presents huge possibilities with regard to new jobs, new discoveries, and
solving problems that were previously deemed impossible to solve.
Natural Language Processing (NLP) has always caught my eye because the human
brain and our cognitive abilities are really fascinating. The ability to communicate
information, complex thoughts, and emotions with such little effort is staggering once
you think about trying to replicate that ability in machines. Of course, we are advancing
by leaps and bounds with regard to cognitive computing and artificial intelligence (AI),
but we are not there yet. Passing the Turing Test is perhaps not enough; can a machine
truly replicate a human in all aspects?
The ability to extract useful information and actionable insights from heaps of
unstructured and raw textual data is in great demand today with regard to applications in
NLP and text analytics. In my journey so far, I have struggled with various problems, faced
many challenges, and learned various lessons over time. This book contains a major
chunk of the knowledge I’ve gained in the world of text analytics, where building a fancy
word cloud from a bunch of text documents is not enough anymore.
Perhaps the biggest problem with regard to learning text analytics is not a lack of
information but too much information, often called information overload. There are
so many resources, documentation, papers, books, and journals containing so much
theoretical material, concepts, techniques, and algorithms that they often overwhelm
someone new to the field. What is the right technique to solve a problem? How does
text summarization really work? Which are the best frameworks to solve multi-class text
categorization? By combining mathematical and theoretical concepts with practical
implementations of real-world use-cases using Python, this book tries to address this
problem and help readers avoid the pressing issues I’ve faced in my journey so far.
This book follows a comprehensive and structured approach. First it tackles the
basics of natural language understanding and Python constructs in the initial chapters.
Once you’re familiar with the basics, it addresses interesting problems in text analytics
in each of the remaining chapters, including text classification, clustering, similarity
analysis, text summarization, and topic models. In this book we will also analyze text
structure, semantics, sentiment, and opinions. For each topic, I cover the basic concepts
and use some real-world scenarios and data to implement techniques covering each
concept. The idea of this book is to give you a flavor of the vast landscape of text analytics
and NLP and arm you with the necessary tools, techniques, and knowledge to tackle your
own problems and start solving them. I hope you find this book helpful and wish you the
very best in your journey through the world of text analytics!
xxi
CHAPTER 1
We have ushered in the age of Big Data where organizations and businesses are having
difficulty managing all the data generated by various systems, processes, and transactions.
However, the term Big Data is misused a lot due to the nature of its popular but vague
definition of “the 3 V’s”—volume, variety, and velocity of data. This is because sometimes
it is very difficult to exactly quantify what data is “Big.” Some might think a billion records
in a database would be Big Data, but that number seems really minute compared to the
petabytes of data being generated by various sensors or even social media. There is a large
volume of unstructured textual data present across all organizations, irrespective of their
domain. Just to take some examples, we have vast amounts of data in the form of tweets,
status updates, comments, hashtags, articles, blogs, wikis, and much more on social
media. Even retail and e-commerce stores generate a lot of textual data from new product
information and metadata with customer reviews and feedback.
The main challenges associated with textual data are twofold. The first challenge
deals with effective storage and management of this data. Usually textual data is
unstructured and does not adhere to any specific predefined data model or schema,
which is usually followed by relational databases. However, based on the data semantics,
you can store it in either SQL-based database management systems (DBMS) like SQL
Server or even NoSQL-based systems like MongoDB. Organizations having enormous
amounts of textual datasets often resort to file-based systems like Hadoop where they
dump all the data in the Hadoop Distributed File System (HDFS) and access it as needed,
which is one of the main principles of a data lake.
The second challenge is with regard to analyzing this data and trying to extract
meaningful patterns and useful insights that would be beneficial to the organization.
Even though we have a large number of machine learning and data analysis techniques
at our disposal, most of them are tuned to work with numerical data, hence we have
to resort to areas like natural language processing (NLP) and specialized techniques,
transformations, and algorithms to analyze text data, or more specifically natural
language, which is quite different from programming languages that are easily
understood by machines. Remember that textual data, being highly unstructured, does
not follow or adhere to structured or regular syntax and patterns—hence we cannot
directly use mathematical or statistical models to analyze it.
Before we dive into specific techniques and algorithms to analyze textual data, we will be
going over some of the main concepts and theoretical principles associated with the nature
of text data in this chapter. The primary intent here is to get you familiarized with concepts
and domains associated with natural language understanding, processing, and text analytics.
We will be using the Python programming language in this book primarily for accessing and
analyzing text data. The examples in this chapter will be pretty straightforward and fairly easy
to follow. However, you can quickly skim over Chapter 2 in case you want to brush up on
Python before going through this chapter. All the examples are available with this book and
also in my GithHub repository at https://github.com/dipanjanS/text-analytics-with-
python which includes programs, code snippets and datasets. This chapter covers concepts
relevant to natural language, linguistics, text data formats, syntax, semantics, and grammars
before moving on to more advanced topics like text corpora, NLP, and text analytics.
Natural Language
Textual data is unstructured data but it usually belongs to a specific language following
specific syntax and semantics. Any piece of text data—a simple word, sentence, or
document—relates back to some natural language most of the time. In this section, we
will be looking at the definition of natural language, the philosophy of language, language
acquisition, and the usage of language.
2
CHAPTER 1 ■ NATURAL LANGUAGE BASICS
3
CHAPTER 1 ■ NATURAL LANGUAGE BASICS
One of the most popular models is the triangle of reference, which is used to explain
how words convey meaning and ideas in the minds of the receiver and how that meaning
relates back to a real world entity or fact. The triangle of reference was proposed by
Charles Ogden and Ivor Richards in their book, The Meaning of Meaning, first published
in 1923, and is denoted in Figure 1-1.
The triangle of reference model is also known as the meaning of meaning model,
and I have depicted the same in Figure1-1 with a real example of a couch being perceived
by a person which is present in front of him. A symbol is denoted as a linguistic symbol,
like a word or an object that evokes thought in a person’s mind. In this case, the symbol
is the couch, and this evokes thoughts like what is a couch, a piece of furniture that can
be used for sitting on or lying down and relaxing, something that gives us comfort. These
thoughts are known as a reference and through this reference the person is able to relate it
to something that exists in the real world, termed a referent. In this case the referent is the
couch which the person perceives to be present in front of him.
The second way to find out relationships between language and reality is known as
the direction of fit, and we will talk about two main directions here. The word-to-world
direction of fit talks about instances where the usage of language can reflect reality. This
indicates using words to match or relate to something that is happening or has already
happened in the real world. An example would be the sentence The Eiffel Tower is really
big, which accentuates a fact in reality. The other direction of fit, known as world-to-word,
talks about instances where the usage of language can change reality. An example here
would be the sentence I am going to take a swim, where the person I is changing reality
by going to take a swim by representing the same in the sentence being communicated.
Figure 1-2 shows the relationship between both the directions of fits.
4
CHAPTER 1 ■ NATURAL LANGUAGE BASICS
It is quite clear from the preceding depiction that based on the referent that is
perceived from the real world, a person can form a representation in the form of a symbol
or word and consequently can communicate the same to another person, which forms a
representation of the real world based on the received symbol, thus forming a cycle.
5
CHAPTER 1 ■ NATURAL LANGUAGE BASICS
The history of language acquisition dates back centuries. Philosophers and scholars
have tried to reason and understand the origins of language acquisition and came up
with several theories, such as language being a god-gifted ability that is passed down
from generation to generation. Plato indicated that a form of word-meaning mapping
would have been responsible in language acquisition. Modern theories have been
proposed by various scholars and philosophers, and some of the popular ones, most
notably B.S. Skinner, indicated that knowledge, learning, and use of language were
more of a behavioral consequent. Human beings, or to be more specific, children, when
using specific words or symbols of any language, experience language based on certain
stimuli which get reinforced in their memory thanks to consequent reactions to their
usage repeatedly. This theory is based on operant or instrumentation conditioning,
which is a type of conditional learning where the strength of a particular behavior or
action is modified based on its consequences such as reward or punishment, and these
consequent stimuli help in reinforcing or controlling behavior and learning. An example
would be that children would learn that a specific combination of sounds made up a word
from repeated usage of it by their parents or by being rewarded by appreciation when
they speak it correctly or by being corrected when they make a mistake while speaking
the same. This repeated conditioning would end up reinforcing the actual meaning and
understanding of the word in a child’s memory for the future. To sum it up, children try to
learn and use language mostly behaviorally by imitating and hearing from adults.
However, this behavioral theory was challenged by renowned linguist Noam
Chomsky, who proclaimed that it would be impossible for children to learn language just
by imitating everything from adults. This hypothesis does stand valid in the following
examples. Although words like go and give are valid, children often end up using an
invalid form of the word, like goed or gived instead of went or gave in the past tense.
It is assured that their parents didn’t utter these words in front of them, so it would be
impossible to pick these up based on the previous theory of Skinner. Consequently,
Chomsky proposed that children must not only be imitating words they hear but also
extracting patterns, syntax, and rules from the same language constructs, which is
separate from just utilizing generic cognitive abilities based on behavior.
Considering Chomsky’s view, cognitive abilities along with language-specific
knowledge and abilities like syntax, semantics, concepts of parts of speech, and grammar
together form what he termed a language acquisition device that enabled humans to
have the ability of language acquisition. Besides cognitive abilities, what is unique
and important in language learning is the syntax of the language itself, which can be
emphasized in his famous sentence Colorless green ideas sleep furiously. If you observe
the sentence and repeat it many times, it does not make sense. Colorless cannot be
associated with green, and neither can ideas be associated with green, nor can they sleep
furiously. However, the sentence has a grammatically correct syntax. This is precisely
what Chomsky tried to explain—that syntax and grammar depict information that is
independent from the meaning and semantics of words. Hence, he proposed that the
learning and identifying of language syntax is a separate human capability compared
to other cognitive abilities. This proposed hypothesis is also known as the autonomy
of syntax. These theories are still widely debated among scholars and linguists, but it is
useful to explore how the human mind tends to acquire and learn language. We will now
look at the typical patterns in which language is generally used.
6
CHAPTER 1 ■ NATURAL LANGUAGE BASICS
Language Usage
The previous section talked about speech acts and how the direction of fit model
is used for relating words and symbols to reality. In this section we will cover some
concepts related to speech acts that highlight different ways in which language is used in
communication.
There are three main categories of speech acts: locutionary, illocutionary, and
perlocutionary acts. Locutionary acts are mainly concerned with the actual delivery
of the sentence when communicated from one human being to another by speaking
it. Illocutionary acts focus further on the actual semantics and significance of the
sentence which was communicated. Perlocutionary acts refer to the actual effect the
communication had on its receiver, which is more psychological or behavioral.
A simple example would be the phrase Get me the book from the table spoken by a
father to his child. The phrase when spoken by the father forms the locutionary act. This
significance of this sentence is a directive, which directs the child to get the book from the
table and forms an illocutionary act. The action the child takes after hearing this, that is, if
he brings the book from the table to his father, forms the perlocutionary act.
The illocutionary act was a directive in this case. According to the philosopher John
Searle, there are a total of five different classes of illocutionary speech acts, as follows:
• Assertives are speech acts that communicate how things are already
existent in the world. They are spoken by the sender when he tries
to assert a proposition that could be true or false in the real world.
These assertions could be statements or declarations. A simple
example would be The Earth revolves round the Sun. These messages
represent the word-to-world direction of fit discussed earlier.
• Directives are speech acts that the sender communicates to the
receiver asking or directing them to do something. This represents
a voluntary act which the receiver might do in the future after
receiving a directive from the sender. Directives can either be
complied with or not complied with, since they are voluntary. These
directives could be simple requests or even orders or commands.
An example directive would be Get me the book from the table,
discussed earlier when we talked about types of speech acts.
• Commisives are speech acts that commit the sender or speaker
who utters them to some future voluntary act or action. Acts like
promises, oaths, pledges, and vows represent commisives, and
the direction of fit could be either way. An example commisive
would be I promise to be there tomorrow for the ceremony.
• Expressives reveal a speaker or sender’s disposition and outlook
toward a particular proposition communicated through the
message. These can be various forms of expression or emotion,
such as congratulatory, sarcastic, and so on. An example
expressive would be Congratulations on graduating top of the class.
7
CHAPTER 1 ■ NATURAL LANGUAGE BASICS
Linguistics
We have touched on what natural language means, how language is learned and used,
and the origins of language acquisition. These kinds of things are formally researched
and studied in linguistics by researchers and scholars called linguists. Formally, linguistics
is defined as the scientific study of language, including form and syntax of language,
meaning, and semantics depicted by the usage of language and context of use. The origins
of linguistics can be dated back to the 4th century BCE, when Indian scholar and linguist
Panini formalized the Sanskrit language description. The term linguistics was first defined
to indicate the scientific study of languages in 1847, approximately before which the term
philology was used to indicate the same. Although a detailed exploration of linguistics is
not needed for text analytics, it is useful to know the different areas of linguistics because
some of them are used extensively in natural language processing and text analytics
algorithms. The main distinctive areas of study under linguistics are as follows:
• Phonetics: This is the study of the acoustic properties of sounds
produced by the human vocal tract during speech. It includes
studying the properties of sounds as well as how they are created
and by human beings. The smallest individual unit of human
speech in a specific language is called a phoneme. A more generic
term across languages for this unit of speech is phone.
• Phonology: This is the study of sound patterns as interpreted in
the human mind and used for distinguishing between different
phonemes to find out which ones are significant. The structure,
combination, and interpretations of phonemes are studied in
detail, usually by taking into account a specific language at a
time. The English language consists of around 45 phonemes.
Phonology usually extends beyond just studying phonemes and
includes things like accents, tone, and syllable structures.
• Syntax: This is usually the study of sentences, phrases, words, and
their structures. It includes researching how words are combined
together grammatically to form phrases and sentences. Syntactic
order of words used in a phrase or a sentence matter because the
order can change the meaning entirely.
8
CHAPTER 1 ■ NATURAL LANGUAGE BASICS
9
CHAPTER 1 ■ NATURAL LANGUAGE BASICS
From the collection of words in Figure 1-3, it is very difficult to ascertain what it
might be trying to convey or mean. Indeed, languages are not just comprised of groups of
unstructured words. Sentences with proper syntax not only help us give proper structure
and relate words together but also help them convey meaning based on the order or
position of the words. Considering our previous hierarchy of sentence → clause → phrase
→ word, we can construct the hierarchical sentence tree in Figure 1-4 using shallow
parsing, a technique using for finding out the constituents in a sentence.
10
CHAPTER 1 ■ NATURAL LANGUAGE BASICS
From the hierarchical tree in Figure 1-4, we get the sentence The brown fox is quick
and he is jumping over the lazy dog. We can see that the leaf nodes of the tree consist of
words, which are the smallest unit here, and combinations of words form phrases, which
in turn form clauses. Clauses are connected together through various filler terms or words
such as conjunctions and form the final sentence. In the next section, we will look at each
of these constituents in further detail and understand how to analyze them and find out
what the major syntactic categories are.
Words
Words are the smallest units in a language that are independent and have a meaning of
their own. Although morphemes are the smallest distinctive units, morphemes are not
independent like words, and a word can be comprised of several morphemes. It is useful
to annotate and tag words and analyze them into their parts of speech (POS) to see the
major syntactic categories. Here, we will cover the main categories and significance of the
various POS tags. Later in Chapter 3 we will examining them in further detail and looking
at methods of generating POS tags programmatically.
Usually, words can fall into one of the following major categories.
• N(oun): This usually denotes words that depict some object or
entity which may be living or nonliving. Some examples would be
fox, dog, book, and so on. The POS tag symbol for nouns is N.
• V(erb): Verbs are words that are used to describe certain actions,
states, or occurrences. There are a wide variety of further
subcategories, such as auxiliary, reflexive, and transitive verbs (and
many more). Some typical examples of verbs would be running,
jumping, read, and write. The POS tag symbol for verbs is V.
• Adj(ective): Adjectives are words used to describe or qualify other
words, typically nouns and noun phrases. The phrase beautiful
flower has the noun (N) flower which is described or qualified
using the adjective (ADJ) beautiful. The POS tag symbol for
adjectives is ADJ.
• Adv(erb): Adverbs usually act as modifiers for other words
including nouns, adjectives, verbs, or other adverbs. The phrase
very beautiful flower has the adverb (ADV) very, which modifies
the adjective (ADJ) beautiful, indicating the degree to which the
flower is beautiful. The POS tag symbol for adverbs is ADV.
Besides these four major categories of parts of speech, there are other categories
that occur frequently in the English language. These include pronouns, prepositions,
interjections, conjunctions, determiners, and many others. Furthermore, each POS tag
like the noun (N) can be further subdivided into categories like singular nouns (NN),
singular proper nouns (NNP), and plural nouns (NNS). We will be looking at POS tags in
further detail in Chapter 3 when we process and parse textual data and implement POS
taggers to annotate text.
11
CHAPTER 1 ■ NATURAL LANGUAGE BASICS
Considering our previous example sentence (The brown fox is quick and he is
jumping over the lazy dog) where we built the hierarchical syntax tree, if we were to
annotate it using basic POS tags, it would look like Figure 1-5.
In Figure 1-5 you may notice a few unfamiliar tags. The tag DET stands for
determiner, which is used to depict articles like a, an, the, and so on. The tag CONJ
indicates conjunction, which is usually used to bind together clauses to form sentences.
The PRON tag stands for pronoun, which represents words that are used to represent or
take the place of a noun.
The tags N, V, ADJ and ADV are typical open classes and represent words belonging
to an open vocabulary. Open classes are word classes that consist of an infinite set of words
and commonly accept the addition of new words to the vocabulary which are invented
by people. Words are usually added to open classes through processes like morphological
derivation, invention based on usage, and creating compound lexemes. Some popular
nouns added fairly recently include Internet and multimedia. Closed classes consist of a
closed and finite set of words and do not accept new additions. Pronouns are a closed class.
The following section looks at the next level of the hierarchy: phrases.
Phrases
Words have their own lexical properties like parts of speech, which we saw earlier. Using
these words, we can order them in ways that give meaning to the words such that each
word belongs to a corresponding phrasal category and one of the words is the main or head
word. In the hierarchy tree, groups of words make up phrases, which form the third level
in the syntax tree. By principle, phrases are assumed to have at least two or more words,
considering the pecking order of words ← phrases ← clauses ← sentences. However, a
phrase can be a single word or a combination of words based on the syntax and position
of the phrase in a clause or sentence. For example, the sentence Dessert was good has only
three words, and each of them rolls up to three phrases. The word dessert is a noun as well
as a noun phrase, is depicts a verb as well as a verb phrase, and good represents an adjective
as well as an adjective phrase describing the aforementioned dessert.
There are five major categories of phrases:
• Noun phrase (NP): These are phrases where a noun acts as
the head word. Noun phrases act as a subject or object to a
verb. Usually a noun phrases can be a set of words that can be
replaced by a pronoun without rendering the sentence or clause
syntactically incorrect. Some examples would be dessert, the lazy
dog, and the brown fox.
12
CHAPTER 1 ■ NATURAL LANGUAGE BASICS
• Verb phrase (VP): These phrases are lexical units that have a
verb acting as the head word. Usually there are two forms of verb
phrases. One form has the verb components as well as other
entities such as nouns, adjectives, or adverbs as parts of the
object. The verb here is known as a finite verb. It acts as a single
unit in the hierarchy tree and can function as the root in a clause.
This form is prominent in constituency grammars. The other form
is where the finite verb acts as the root of the entire clause and
is prominent in dependency grammars. Another derivation of
this includes verb phrases strictly consisting of verb components
including main, auxiliary, infinitive, and participles. The sentence
He has started the engine can be used to illustrate the two types of
verb phrases that can be formed. They would be has started the
engine and has started, based on the two forms just discussed.
• Adjective phrase (ADJP): These are phrases with an adjective as
the head word. Their main role is to describe or qualify nouns
and pronouns in a sentence, and they will be either placed before
or after the noun or pronoun. The sentence The cat is too quick
has an adjective phrase, too quick, qualifying cat, which is a noun
phrase.
• Adverb phrase (ADVP): These phrases act like adverbs since
the adverb acts as the head word in the phrase. Adverb phrases
are used as modifiers for nouns, verbs, or adverbs themselves
by providing further details that describe or qualify them. In
the sentence The train should be at the station pretty soon, the
adjective phrase pretty soon describes when the train would be
arriving.
• Prepositional phrase (PP): These phrases usually contain a
preposition as the head word and other lexical components like
nouns, pronouns, and so on. It acts like an adjective or adverb
describing other words or phrases. The phrase going up the stairs
contains a prepositional phrase up, describing the direction of the
stairs.
These five major syntactic categories of phrases can be generated from words using
several rules, some of which have been discussed, like utilizing syntax and grammars
of different types. We will be exploring some of the popular grammars in a later section.
Shallow parsing is a popular natural language processing technique to extract these
constituents, including POS tags as well as phrases from a sentence. For our sentence The
brown fox is quick and he is jumping over the lazy dog, we have obtained seven phrases
from shallow parsing, as shown in Figure 1-6.
Language: English
The
CRYSTAL BALL
By
ROY J. SNELL
COPYRIGHT 1936
BY
THE REILLY & LEE CO.
PRINTED IN THE U. S. A.
CONTENTS
CHAPTER PAGE
I Midnight Blue Velvet 11
II “Just Nothing at All” 28
III Danger Tomorrow 36
IV The “Tiger Woman” 45
V Florence Gazes into the Crystal 51
VI Gypsies That Are Not Gypsies 62
VII The Bright Shawl 75
VIII A Vision for Another 86
IX Jeanne Plans an Adventure 104
X A Voodoo Priestess 113
XI Fireside Reflections 128
XII Jeanne’s Fortune 134
XIII A Startling Revelation 148
XIV Fire Destroys All 157
XV The Interpreter of Dreams 169
XVI The Secret of Lost Lake 177
XVII From Out the Past 189
XVIII D.X.123 195
XIX One Wild Dream 199
XX Some Considerable Treasure 213
XXI Battle Royal 228
XXII Little Lady in Gray 238
XXIII Strange Treasure 252
XXIV Through the Picture 266
XXV A Visit in the Night 274
XXVI In Which Some Things Are Well Finished 2
79
[11]
For all that, she put out a hand to grasp the knob. In a
city office building, ten stories up, one does not knock.
Florence did not so much as allow the yielding door to
make a sound. She turned the knob as one imagines a
robber might turn the dial of a safe—slowly, silently.
Why did she do this? Could she have answered this [12]
question? Probably not! Certainly she was not spying on
the occupants of that room—at least, not yet. Perhaps
that was the way she always opened a door. We all have
our ways of doing things. Some of us seize a door knob,
give it a quick turn, a yank, and there we are. And
some, like Florence, move with the slyness and softness
of a cat. It is their nature.
Well, for what? She did not finish. That was the reason
for her visit, to find out what. She was engaged, these
days, in finding out all manner of curious and fantastic
goings on. Was this to be one of the strangest,
weirdest, most fantastic, or was it, like many another, to
turn out as a simple, flat, uninteresting corner of a sad
little world?
What she saw gave her a start. It was, she thought, like
entering the “Holy of Holies” of Bible times or the
“Forbidden City” of Mongol kings. For there, resting in a
low receptacle at the exact center of a large room, was
a faintly gleaming crystal ball. This ball, which might
have been six inches in diameter with its holder, rested
on a cloth of midnight blue. Before it sat a silent figure.
The whole affair, the midnight blue of the curtains, the [15]
spot of light that was a crystal ball, the girl sitting there
like a statue, all seemed so unreal that Florence found
herself pinching her arm. “No,” she whispered, “it is not
a dream.”
Just then the voice began again to speak. This time the [16]
voice was low. Words were said in a distinct tone and all
just alike. This is what it said:
The effect on the girl was strange. She shook like one
with a chill. She gripped the arms of the black chair
until, in that strange light, her hands appeared
glistening white. Then, seeming to gain control of
herself, she settled back in her place and, at the
command of that slow, monotonous voice, “Keep your
eyes on the crystal,” fell into an attitude of repose. Not,
however, before Florence had noted a strange fact.
“That girl in the glass ball,” she told herself, “is the one
sitting in that black chair.
“But no! How could she be? Besides, the one in the ball [17]
is younger, much younger. This is impossible. And yet,
there are the same eyes, the same hair, the same
profile. It is strange.”
Now again there was a change coming over the crystal [18]
ball. A sudden lighting up of its gray interior announced
the opening of a door in that fanciful house, the letting
in of bright sunshine. The door closed. Gray shadows
reappeared. Into those shadows walked a distinguished
appearing, tall, gray-haired man. At once, into his arms
sprang the fair-haired child. All this appeared to go
forward in that astonishing crystal ball.
Florence did not resist. Before she knew what had [19]
happened she was out in the dark and dusty hallway.
The door she had entered was closed and locked
against her.
“So that’s that!” she said with a forced smile. But was
that that? Was there to be much more? Very much
more? Only time would tell. When one discovers an
enthralling mystery, one does not soon forget. Such a
mystery was contained in that crystal ball.
“That’s one of them!” Florence declared emphatically to
herself. “It surely must be!
“This,” she concluded, “is a case that calls for action. I’ll
see Frances Ward first thing in the morning.
And then, out of the clear October sky that shone over [24]
the park by the lake there in Chicago, their good angel
had appeared.
It was not she who had appeared at once. Far from it.
Instead, when Jeanne went to the park that day she
had found at first only a group of tired and rather
ragged gypsies, who, having parked their rusty cars,
had gathered on the grass to eat a meager lunch.
[28]
CHAPTER II
“JUST NOTHING AT ALL”
The one cherished bit from the Old World that adorned
the room was a picture. It was a masterpiece of the
nineteenth century. In that picture the sun shone bright
upon a flock of sheep hurrying for shelter from a storm
that lay black as night against the rugged hills behind.
Trees were bending before a gale, the shepherd’s cloak
was flying, every touch told of the approaching storm.
She broke short off to listen. A stairway led up from the [31]
top of the elevator shaft, one floor below. She did not
recognize the tread of the person coming up the stairs.
She wondered and shuddered. Somehow she felt that
on leaving that room of midnight blue and a crystal ball,
she had been followed. Had she? If so, why? She was
not long in guessing the reason. Twice in the last few
weeks she had whispered a few well-chosen words in
the ears of Patrick Moriarity, a bright young policeman
who was interested in people, just any kind of people.
Patrick had rapped on certain doors and had said his
little say. When next Florence passed that way, there
was a “For Rent” sign on the door, right where Patrick
had rapped.
At once her face sobered. Was it, after all, a laughing [32]
matter, this having your fortune told? For some surely it
was not. She had seen them seated on hard chairs,
waiting. There were lines of sorrow and disappointment
on their faces. They had come to ask the crystal-gazer,
the palmist, the phrenologist, the reader of cards or
stars, to tell their fortune. They wanted terribly to know
when the tide of fortune would turn for them, when
prosperity would come ebbing back again. And she,
Florence, all too often could read in their faces the
answer which came to her like the wash of the waves
on a sandy shore:
“Never—never—never.”
“And yet—” Once again her smile vanished. Was there, [33]
after all, in some of it something real? That crystal ball
now—the one she had seen that very afternoon. She
had been told that visions truly do come to those who
gaze into the crystal ball. Had she not seen visions? And
that fair-haired girl, had she not seen visions as well?
Once again her mood changed. What was it this girl had
wanted to know? She had said, “My long lost father!”
Was her father really lost? Who was her father? She was
dressed like a child of the rich. Was she rich? And was
she in danger?
“She had the money. They told her to leave it with them
for luck. The luck was all wrong. They vanished.”
[36]
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com