0% found this document useful (0 votes)
17 views20 pages

Regular Expressions

Uploaded by

divyamanjari1604
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views20 pages

Regular Expressions

Uploaded by

divyamanjari1604
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

Introduction to NLP

Technologies
• Speech recognition
– Spoken language is recognized and
transformed in into text as in
dictation systems, into commands
as in robot control systems, or into
some other internal representation.
• Speech synthesis
– Utterances in spoken language are
produced from text (text-to-speech
systems) or from internal
representations of words or
sentences (concept-to-speech
systems)

Introduction to NLP 2
Technologies
• Text categorization
– This technology assigns texts to
categories. Texts may belong to
more than one category, categories
may contain other categories.
Filtering is a special case of
categorization with just two
categories.
• Text Summarization
– The most relevant portions of a
text are extracted as a summary.
The task depends on the needed
lengths of the summaries.
Summarization is harder if the
summary has to be specific to a
certain query.

Introduction to NLP 3
Technologies
• Text Indexing
– As a precondition for document
retrieval, texts are stored in an indexed
database. Usually a text is indexed for
all word forms or – after lemmatization
– for all lemmas. Sometimes indexing
is combined with categorization and
summarization.
• Text Retrieval
– Texts are retrieved from a database that
best match a given query or document.
The candidate documents are ordered
with respect to their expected relevance.
Indexing, categorization,
summarization and retrieval are often
subsumed under the term information
retrieval.

Introduction to NLP 4
Technologies
• Information Extraction
– Relevant information pieces of
information are discovered and marked
for extraction. The extracted pieces can
be: the topic, named entities such as
company, place or person names,
simple relations such as prices,
destinations, functions etc. or complex
relations describing accidents, company
mergers or football matches.
• Data Fusion and Text Data Mining
– Extracted pieces of information from
several sources are combined in one
database. Previously undetected
relationships may be discovered.

Introduction to NLP 5
Technologies
• Question Answering
– Natural language queries are used
to access information in a
database. The database may be a
base of structured data or a
repository of digital texts in which
certain parts have been marked as
potential answers.
• Report Generation
– A report in natural language is
produced that describes the
essential contents or changes of a
database. The report can contain
accumulated numbers, maxima,
minima and the most drastic
changes.

Introduction to NLP 6
Regular Expressions
Regular Expressions
• In computer science, RE is a language used for specifying text search
string.

• A regular expression is a formula in a special language that is used for


specifying a simple class of string.

• Formally, a regular expression is an algebraic notation for characterizing


a set of strings.

• RE search requires
– a pattern that we want to search for, and
– a corpus of texts to search through.

Regular Expressions and Automata 8


Regular Expressions

• A regular expression search function will search through the corpus,


returning all texts that match the pattern.

• For example, the Unix command-line tool grep takes a regular


expression and returns every line of the input document that matches
the expression.

Regular Expressions and Automata 9


Basic Regular Expression Patterns
• The simplest kind of regular expression is a sequence of simple
characters; putting characters in sequence is called concatenation.

• To search for woodchuck, we type /woodchuck/.

• Regular expressions are case sensitive; lower case /s/ is distinct from
upper case /S/. This means that the pattern /woodchucks/ will not
match the string Woodchucks.

Introduction to NLP 10
Basic Regular Expression Patterns
• The string of characters inside the braces specifies a disjunction of
characters to match.

• The use of the brackets [] to specify a disjunction of characters.

• The regular expression /[1234567890]/ specifies any single digit.

• there is a well-defined sequence associated with a set of characters, the


brackets can be used with the dash (-) to specify any one character in a
range.

Regular Expressions and Automata 11


The square braces can also be used to specify what a single character cannot be, by use
of the caret ˆ.

If the caret ˆ is the first symbol after the open square brace [, the resulting pattern is
negated.

The question-mark ? marks optionality of the previous expression.

Introduction to NLP 12
The Kleene star means “zero or more occurrences of the immediately
previous character or regular expression”.

So /a*/ means “any string of zero or more a’s”

Kleene +, which means “one or more occurrences of the immediately


preceding character or regular expression”.

So /a+/ means “any string of one or more a’s”

Sheep Language
baa!
baaa!
baaaa!
baaaaa!

There are thus two ways to specify the sheep language:

/baaa*!/ or /baa+!/.
Regular Expressions and Automata 13
The use of the period . to specify any character.

The period (/./), a wildcard expression that matches any single character
(except a carriage return)

Anchors are special characters that anchor regular expressions to particular


places.

The most common anchors are the caret ˆ and the dollar sign $.

The caret ˆ matches the start of a line. The dollar sign $ matches the end of a line.

/ˆThe dog\.$/ matches a line that contains only the phrase The dog.

14
Introduction to NLP 15
Disjunction, Grouping, and Precedence
• Disjunction operator, also called the pipe symbol |.
• The pattern /cat|dog/ matches either the string cat or the string dog.
• How can I specify both guppy and guppies?

• We cannot simply say /guppy|ies/, because that would match only the
strings guppy and ies.

• sequences like guppy take precedence over the disjunction operator |


• Use the parenthesis operators ( and ). the pattern /gupp(y|ies)/

Operator precedence hierarchy

16
A Simple Example
we wanted to write a RE to find cases of the English article the

A simple (but incorrect) pattern might be: /the/

will miss the word when it begins a sentence and hence is capitalized (i.e., The).
/[tT]he/

we need to specify that we want instances with a word boundary on both sides:
/\b[tT]he\b/

We need to specify that we want instances in which there are no alphabetic letters
on either side of the the:
/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/

Before the ‘the’ we require either the beginning-of-line or a non-alphabetic


character, and the same at the end of the line:
/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/ 17
More Operators
Aliases for common sets of characters

Regular expression operators for counting

18
Some characters that need to be backslashed

Regular Expressions and Automata 19


The user might want “any machine with at least 6 GHz and 500 GB of
disk space for less than $1000”

Here’s a regular expression for a dollar sign followed by a string of digits /$[0-9]+/

Now we just need to deal with fractions of dollars. We’ll add a decimal point and two
digits afterwards: /$[0-9]+\.[0-9][0-9]/

This pattern only allows $199.99 but not $199. We need to make the cents optional
and to make sure we’re at a word boundary: /(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/

This pattern allows prices like $199999.99 which would be far too expensive! We
need to limit the dollars:
/(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/

We’ll need to allow for optional fractions again (5.5 GB); note the use of ? for making
the final s optional, and the use of / */ to mean “zero or more spaces” since there
might always be extra spaces lying around:

/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/ 20

You might also like