0% found this document useful (0 votes)

17 views20 pages

Regular Expressions

Uploaded by

divyamanjari1604

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views20 pages

Regular Expressions

Uploaded by

divyamanjari1604

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 20

Introduction to NLP

Technologies
• Speech recognition
– Spoken language is recognized and
transformed in into text as in
dictation systems, into commands
as in robot control systems, or into
some other internal representation.
• Speech synthesis
– Utterances in spoken language are
produced from text (text-to-speech
systems) or from internal
representations of words or
sentences (concept-to-speech
systems)

Introduction to NLP 2
Technologies
• Text categorization
– This technology assigns texts to
categories. Texts may belong to
more than one category, categories
may contain other categories.
Filtering is a special case of
categorization with just two
categories.
• Text Summarization
– The most relevant portions of a
text are extracted as a summary.
The task depends on the needed
lengths of the summaries.
Summarization is harder if the
summary has to be specific to a
certain query.

Introduction to NLP 3
Technologies
• Text Indexing
– As a precondition for document
retrieval, texts are stored in an indexed
database. Usually a text is indexed for
all word forms or – after lemmatization
– for all lemmas. Sometimes indexing
is combined with categorization and
summarization.
• Text Retrieval
– Texts are retrieved from a database that
best match a given query or document.
The candidate documents are ordered
with respect to their expected relevance.
Indexing, categorization,
summarization and retrieval are often
subsumed under the term information
retrieval.

Introduction to NLP 4
Technologies
• Information Extraction
– Relevant information pieces of
information are discovered and marked
for extraction. The extracted pieces can
be: the topic, named entities such as
company, place or person names,
simple relations such as prices,
destinations, functions etc. or complex
relations describing accidents, company
mergers or football matches.
• Data Fusion and Text Data Mining
– Extracted pieces of information from
several sources are combined in one
database. Previously undetected
relationships may be discovered.

Introduction to NLP 5
Technologies
• Question Answering
– Natural language queries are used
to access information in a
database. The database may be a
base of structured data or a
repository of digital texts in which
certain parts have been marked as
potential answers.
• Report Generation
– A report in natural language is
produced that describes the
essential contents or changes of a
database. The report can contain
accumulated numbers, maxima,
minima and the most drastic
changes.

Introduction to NLP 6
Regular Expressions
Regular Expressions
• In computer science, RE is a language used for specifying text search
string.

• A regular expression is a formula in a special language that is used for

specifying a simple class of string.

• Formally, a regular expression is an algebraic notation for characterizing

a set of strings.

• RE search requires
– a pattern that we want to search for, and
– a corpus of texts to search through.

Regular Expressions and Automata 8

Regular Expressions

• A regular expression search function will search through the corpus,

returning all texts that match the pattern.

• For example, the Unix command-line tool grep takes a regular

expression and returns every line of the input document that matches
the expression.

Regular Expressions and Automata 9

Basic Regular Expression Patterns
• The simplest kind of regular expression is a sequence of simple
characters; putting characters in sequence is called concatenation.

• To search for woodchuck, we type /woodchuck/.

• Regular expressions are case sensitive; lower case /s/ is distinct from
upper case /S/. This means that the pattern /woodchucks/ will not
match the string Woodchucks.

Introduction to NLP 10
Basic Regular Expression Patterns
• The string of characters inside the braces specifies a disjunction of
characters to match.

• The use of the brackets [] to specify a disjunction of characters.

• The regular expression /[1234567890]/ specifies any single digit.

• there is a well-defined sequence associated with a set of characters, the

brackets can be used with the dash (-) to specify any one character in a
range.

Regular Expressions and Automata 11

The square braces can also be used to specify what a single character cannot be, by use
of the caret ˆ.

If the caret ˆ is the first symbol after the open square brace [, the resulting pattern is
negated.

The question-mark ? marks optionality of the previous expression.

Introduction to NLP 12
The Kleene star means “zero or more occurrences of the immediately
previous character or regular expression”.

So /a*/ means “any string of zero or more a’s”

Kleene +, which means “one or more occurrences of the immediately

preceding character or regular expression”.

So /a+/ means “any string of one or more a’s”

Sheep Language
baa!
baaa!
baaaa!
baaaaa!

There are thus two ways to specify the sheep language:

/baaa*!/ or /baa+!/.
Regular Expressions and Automata 13
The use of the period . to specify any character.

The period (/./), a wildcard expression that matches any single character
(except a carriage return)

Anchors are special characters that anchor regular expressions to particular

places.

The most common anchors are the caret ˆ and the dollar sign $.

The caret ˆ matches the start of a line. The dollar sign $ matches the end of a line.

/ˆThe dog\.$/ matches a line that contains only the phrase The dog.

14
Introduction to NLP 15
Disjunction, Grouping, and Precedence
• Disjunction operator, also called the pipe symbol |.
• The pattern /cat|dog/ matches either the string cat or the string dog.
• How can I specify both guppy and guppies?

• We cannot simply say /guppy|ies/, because that would match only the
strings guppy and ies.

• sequences like guppy take precedence over the disjunction operator |

• Use the parenthesis operators ( and ). the pattern /gupp(y|ies)/

Operator precedence hierarchy

16
A Simple Example
we wanted to write a RE to find cases of the English article the

A simple (but incorrect) pattern might be: /the/

will miss the word when it begins a sentence and hence is capitalized (i.e., The).
/[tT]he/

we need to specify that we want instances with a word boundary on both sides:
/\b[tT]he\b/

We need to specify that we want instances in which there are no alphabetic letters
on either side of the the:
/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/

Before the ‘the’ we require either the beginning-of-line or a non-alphabetic

character, and the same at the end of the line:
/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/ 17
More Operators
Aliases for common sets of characters

Regular expression operators for counting

18
Some characters that need to be backslashed

Regular Expressions and Automata 19

The user might want “any machine with at least 6 GHz and 500 GB of
disk space for less than $1000”

Here’s a regular expression for a dollar sign followed by a string of digits /$[0-9]+/

Now we just need to deal with fractions of dollars. We’ll add a decimal point and two
digits afterwards: /$[0-9]+\.[0-9][0-9]/

This pattern only allows $199.99 but not $199. We need to make the cents optional
and to make sure we’re at a word boundary: /(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/

This pattern allows prices like $199999.99 which would be far too expensive! We
need to limit the dollars:
/(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/

We’ll need to allow for optional fractions again (5.5 GB); note the use of ? for making
the final s optional, and the use of / */ to mean “zero or more spaces” since there
might always be extra spaces lying around:

/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/ 20

TSQL
No ratings yet
TSQL
56 pages
Lab 1 Introduction To MS Access: Fig. 1 Database Window
No ratings yet
Lab 1 Introduction To MS Access: Fig. 1 Database Window
6 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
NLP Chapter 5
No ratings yet
NLP Chapter 5
70 pages
2
No ratings yet
2
29 pages
Usage of Regular Expressions in NLP
No ratings yet
Usage of Regular Expressions in NLP
7 pages
Usage of Regular Expressions in NLP
No ratings yet
Usage of Regular Expressions in NLP
7 pages
Chapter Two
No ratings yet
Chapter Two
72 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
30 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
CS 491 Natural Language Processing Module 2: Basic Text Processing
No ratings yet
CS 491 Natural Language Processing Module 2: Basic Text Processing
24 pages
Regular Expressions & Automata
No ratings yet
Regular Expressions & Automata
62 pages
Lecture-2n-04032024-081220pm-19022025-105409am
No ratings yet
Lecture-2n-04032024-081220pm-19022025-105409am
38 pages
Regex Cheat Sheet
No ratings yet
Regex Cheat Sheet
10 pages
Chapter Three Regular Expressions and Finite-State Automata
No ratings yet
Chapter Three Regular Expressions and Finite-State Automata
19 pages
lec02-1-BasicTextProcessing
No ratings yet
lec02-1-BasicTextProcessing
47 pages
2 Regular Expressions
No ratings yet
2 Regular Expressions
34 pages
3-Regular Expressions
No ratings yet
3-Regular Expressions
34 pages
3 REGULAR EXPRESSION
No ratings yet
3 REGULAR EXPRESSION
15 pages
Lecture 5
No ratings yet
Lecture 5
24 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
Regular Expressions: SESSION - 14 - 15 - 16
No ratings yet
Regular Expressions: SESSION - 14 - 15 - 16
42 pages
Module2 NLP BAD613B Notes
100% (1)
Module2 NLP BAD613B Notes
16 pages
UNIT I_NLP
No ratings yet
UNIT I_NLP
24 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
23 pages
Multimedia Application L2
No ratings yet
Multimedia Application L2
47 pages
mod 2
No ratings yet
mod 2
49 pages
Manipulating Text
No ratings yet
Manipulating Text
13 pages
03 Regular Expressions and Grammars Parser Generators 16102023 041542pm
No ratings yet
03 Regular Expressions and Grammars Parser Generators 16102023 041542pm
32 pages
CC 2
No ratings yet
CC 2
65 pages
ATFL Assignment 1
No ratings yet
ATFL Assignment 1
4 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
NLP Module 2_1
No ratings yet
NLP Module 2_1
86 pages
Introduction to regular expressions
No ratings yet
Introduction to regular expressions
18 pages
Co Data
No ratings yet
Co Data
76 pages
Lecture 2 3
No ratings yet
Lecture 2 3
102 pages
Lect2 Regular Expressions
No ratings yet
Lect2 Regular Expressions
41 pages
Lec 1.1
No ratings yet
Lec 1.1
26 pages
Text-Processing-For-NLP-Understanding-Regex (7)
No ratings yet
Text-Processing-For-NLP-Understanding-Regex (7)
16 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
Regular Expressions and Its Applications
No ratings yet
Regular Expressions and Its Applications
6 pages
Chapter 2
No ratings yet
Chapter 2
209 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
Chapter THREE
No ratings yet
Chapter THREE
24 pages
Regular Expressions
No ratings yet
Regular Expressions
17 pages
Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
No ratings yet
Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
43 pages
NLP m1
No ratings yet
NLP m1
148 pages
L02_Programming_RE plc
No ratings yet
L02_Programming_RE plc
35 pages
Regex
No ratings yet
Regex
24 pages
40-Multitrack TM, Pattern Matching-02-05-2024
No ratings yet
40-Multitrack TM, Pattern Matching-02-05-2024
17 pages
NLP
No ratings yet
NLP
38 pages
Regular Expressions
100% (5)
Regular Expressions
94 pages
regular expressions - Pattern matching
No ratings yet
regular expressions - Pattern matching
107 pages
Chapter 5 Regular Expressions, Rollover and Frames Regular Expression
No ratings yet
Chapter 5 Regular Expressions, Rollover and Frames Regular Expression
16 pages
Regex
100% (1)
Regex
42 pages
Regular Expression Tutorial: What Regular Expressions Are Exactly - Terminology
No ratings yet
Regular Expression Tutorial: What Regular Expressions Are Exactly - Terminology
42 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
10 SQL Nested Queries
No ratings yet
10 SQL Nested Queries
11 pages
current_log1
No ratings yet
current_log1
48 pages
MS Access 6 - Create Tables
No ratings yet
MS Access 6 - Create Tables
11 pages
TTLM Lo1
No ratings yet
TTLM Lo1
19 pages
DBMS Concurrency Control
No ratings yet
DBMS Concurrency Control
18 pages
Experiment No. 1 DBMS Creating A Database
No ratings yet
Experiment No. 1 DBMS Creating A Database
12 pages
Lucky
No ratings yet
Lucky
6 pages
Top 50 SQL Questions
No ratings yet
Top 50 SQL Questions
15 pages
Odbcref
No ratings yet
Odbcref
42 pages
VB Report
No ratings yet
VB Report
10 pages
IT 244 Project
No ratings yet
IT 244 Project
6 pages
Principle of Database Management Part 1
No ratings yet
Principle of Database Management Part 1
63 pages
Java Assignment
No ratings yet
Java Assignment
7 pages
Infosphere Information Server (Datastage) - Installation Process
No ratings yet
Infosphere Information Server (Datastage) - Installation Process
12 pages
Brahmarishi Bawra Shanti Vidya Peeth
No ratings yet
Brahmarishi Bawra Shanti Vidya Peeth
23 pages
Database Administrator: Oracle Dba
No ratings yet
Database Administrator: Oracle Dba
4 pages
System Design Interview
No ratings yet
System Design Interview
4 pages
Project Report ON Online Food Oerdering System
No ratings yet
Project Report ON Online Food Oerdering System
101 pages
Basis Data - Modul 01 Fundamental of DB System PDF
No ratings yet
Basis Data - Modul 01 Fundamental of DB System PDF
19 pages
SSIS Prefix Naming
No ratings yet
SSIS Prefix Naming
1 page
SQLITE3 FOR PYTHON
No ratings yet
SQLITE3 FOR PYTHON
7 pages
Harsh - Data Engineer
No ratings yet
Harsh - Data Engineer
8 pages
Service-Oriented Architecture: A Field Guide To Integrating XML and Web Services
No ratings yet
Service-Oriented Architecture: A Field Guide To Integrating XML and Web Services
29 pages
Realtional Model - Relational Algebra
No ratings yet
Realtional Model - Relational Algebra
67 pages
JDBC-Basic Beginner Guidance
No ratings yet
JDBC-Basic Beginner Guidance
80 pages
DMS MP
No ratings yet
DMS MP
4 pages
Course+Syllabus+-+MySQL
No ratings yet
Course+Syllabus+-+MySQL
4 pages
Paperwork Aims To Be An Open
No ratings yet
Paperwork Aims To Be An Open
10 pages

Regular Expressions

Uploaded by

Regular Expressions

Uploaded by

Introduction to NLP

• A regular expression is a formula in a special language that is used for

• Formally, a regular expression is an algebraic notation for characterizing

Regular Expressions and Automata 8

• A regular expression search function will search through the corpus,

• For example, the Unix command-line tool grep takes a regular

Regular Expressions and Automata 9

• To search for woodchuck, we type /woodchuck/.

• The use of the brackets [] to specify a disjunction of characters.

• The regular expression /[1234567890]/ specifies any single digit.

• there is a well-defined sequence associated with a set of characters, the

Regular Expressions and Automata 11

The question-mark ? marks optionality of the previous expression.

So /a*/ means “any string of zero or more a’s”

Kleene +, which means “one or more occurrences of the immediately

So /a+/ means “any string of one or more a’s”

There are thus two ways to specify the sheep language:

Anchors are special characters that anchor regular expressions to particular

• sequences like guppy take precedence over the disjunction operator |

Operator precedence hierarchy

A simple (but incorrect) pattern might be: /the/

Before the ‘the’ we require either the beginning-of-line or a non-alphabetic

Regular expression operators for counting

Regular Expressions and Automata 19

You might also like