Regular Expressions For NLP
Regular Expressions For NLP
What is a Regular
Expression
Notation for specifying set of strings
Used for search
Corpus: text(s) to search through / learn from
Used to define (formal) language
Creating a Regular
Expression
Perl notation uses / / around regexes
Expressions composed of:
Category
Symbols
Literal
Characters
Example
/the/
Example
Matches
the, other, The
A, a, t, S, Z, ab
Disjunction
/T|the/
The, the
Boundaries
\b \B ^ $ \n \t
/\bthe\b/
Quantifiers
* + ? {}
/colou?r/
color, colour
Special
Characters
/.+\.com/
Yahoo.com
Capturing
( ) \1
/(\d{5}).+\1/
Creating a Regular
Expression
Defining a regex involves iteratively
improving:
Accuracy/Precision: minimizing false positives
e.g. /the/ /\bthe\b/
Coverage/Recall: minimizing false negatives
e.g. /the/ /T|the/
Using Regular
Expressions
Generally used to search or replace:
Perl:
$str = other people
if($str =~ /the/)
Java:
import java.util.regex.*;
Pattern r = Pattern.compile(\d);
Matcher m = r.matcher(D0es th1s c0nta1n d1g1ts?);
if(m.find())
Python:
import re
searchObj = re.search(rthe, other people)
phone = Tel: 209-867-5309
re.sub(r\d, #, phone)
References
Good tutorials and cheat sheets available
online:
http://regexone.com/lesson
http://web.mit.edu/hackl/www/lab/turkshop/slide
s/regex-cheatsheet.pdf
http://donovanh.com/pages/regex_list.html
ELIZA (1966)
ELIZA
ELIZA
ELIZA
FSA representation
Formal Representation
Specify the following:
Q = {q0,q1,qn-1} a finite set of N states
Transition Table
Convenient for computer representation,
too:
Input
State
D-Recognize
Deterministic: no choice points
Generative Uses
Non-Deterministic FSAs
More than one transition possible for a
particular state and input combination:
Non-Deterministic FSAs
In NFSA there exists at least one path
through the machine for any string in the
language defined by the machine.
Not all paths directed through the machine
for an acceptable string lead to an accept
state.
No paths through the machine lead to an
accept state for a string not in the
language.
Challenge: what to do if make wrong
Resolving NonDeterminism
Backup: when reach a choice point, mark
state and input position (search-state),
then if needed roll backwards.
Look-Ahead: Look at following input
symbols to try to choose correct transition.
Parallelism: Follow each of the transition
options in parallel.
Convert: All NFSAs can be converted to an
equivalent FSA.
Backup
Need to modify transition table:
Add epsilon transition column
Allow multiple destination states for given
search-state.
Input
Input
Stat
e
Stat
e
2,3
Computing Theory
You may recall from (or learn in) COMP
147:
Class of languages definable by regular
expressions is same as class definable by
FSAs. These are called regular languages.
Your Turn
Lab 1: Regular Expression Practice
Project 1: ELIZA reborn