Regular Expressions
Regular Expressions
Technologies
• Speech recognition
– Spoken language is recognized and
transformed in into text as in
dictation systems, into commands
as in robot control systems, or into
some other internal representation.
• Speech synthesis
– Utterances in spoken language are
produced from text (text-to-speech
systems) or from internal
representations of words or
sentences (concept-to-speech
systems)
Introduction to NLP 2
Technologies
• Text categorization
– This technology assigns texts to
categories. Texts may belong to
more than one category, categories
may contain other categories.
Filtering is a special case of
categorization with just two
categories.
• Text Summarization
– The most relevant portions of a
text are extracted as a summary.
The task depends on the needed
lengths of the summaries.
Summarization is harder if the
summary has to be specific to a
certain query.
Introduction to NLP 3
Technologies
• Text Indexing
– As a precondition for document
retrieval, texts are stored in an indexed
database. Usually a text is indexed for
all word forms or – after lemmatization
– for all lemmas. Sometimes indexing
is combined with categorization and
summarization.
• Text Retrieval
– Texts are retrieved from a database that
best match a given query or document.
The candidate documents are ordered
with respect to their expected relevance.
Indexing, categorization,
summarization and retrieval are often
subsumed under the term information
retrieval.
Introduction to NLP 4
Technologies
• Information Extraction
– Relevant information pieces of
information are discovered and marked
for extraction. The extracted pieces can
be: the topic, named entities such as
company, place or person names,
simple relations such as prices,
destinations, functions etc. or complex
relations describing accidents, company
mergers or football matches.
• Data Fusion and Text Data Mining
– Extracted pieces of information from
several sources are combined in one
database. Previously undetected
relationships may be discovered.
Introduction to NLP 5
Technologies
• Question Answering
– Natural language queries are used
to access information in a
database. The database may be a
base of structured data or a
repository of digital texts in which
certain parts have been marked as
potential answers.
• Report Generation
– A report in natural language is
produced that describes the
essential contents or changes of a
database. The report can contain
accumulated numbers, maxima,
minima and the most drastic
changes.
Introduction to NLP 6
Regular Expressions
Regular Expressions
• In computer science, RE is a language used for specifying text search
string.
• RE search requires
– a pattern that we want to search for, and
– a corpus of texts to search through.
• Regular expressions are case sensitive; lower case /s/ is distinct from
upper case /S/. This means that the pattern /woodchucks/ will not
match the string Woodchucks.
Introduction to NLP 10
Basic Regular Expression Patterns
• The string of characters inside the braces specifies a disjunction of
characters to match.
If the caret ˆ is the first symbol after the open square brace [, the resulting pattern is
negated.
Introduction to NLP 12
The Kleene star means “zero or more occurrences of the immediately
previous character or regular expression”.
Sheep Language
baa!
baaa!
baaaa!
baaaaa!
/baaa*!/ or /baa+!/.
Regular Expressions and Automata 13
The use of the period . to specify any character.
The period (/./), a wildcard expression that matches any single character
(except a carriage return)
The most common anchors are the caret ˆ and the dollar sign $.
The caret ˆ matches the start of a line. The dollar sign $ matches the end of a line.
/ˆThe dog\.$/ matches a line that contains only the phrase The dog.
14
Introduction to NLP 15
Disjunction, Grouping, and Precedence
• Disjunction operator, also called the pipe symbol |.
• The pattern /cat|dog/ matches either the string cat or the string dog.
• How can I specify both guppy and guppies?
• We cannot simply say /guppy|ies/, because that would match only the
strings guppy and ies.
16
A Simple Example
we wanted to write a RE to find cases of the English article the
will miss the word when it begins a sentence and hence is capitalized (i.e., The).
/[tT]he/
we need to specify that we want instances with a word boundary on both sides:
/\b[tT]he\b/
We need to specify that we want instances in which there are no alphabetic letters
on either side of the the:
/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
18
Some characters that need to be backslashed
Here’s a regular expression for a dollar sign followed by a string of digits /$[0-9]+/
Now we just need to deal with fractions of dollars. We’ll add a decimal point and two
digits afterwards: /$[0-9]+\.[0-9][0-9]/
This pattern only allows $199.99 but not $199. We need to make the cents optional
and to make sure we’re at a word boundary: /(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/
This pattern allows prices like $199999.99 which would be far too expensive! We
need to limit the dollars:
/(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/
We’ll need to allow for optional fractions again (5.5 GB); note the use of ? for making
the final s optional, and the use of / */ to mean “zero or more spaces” since there
might always be extra spaces lying around:
/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/ 20