0% found this document useful (0 votes)
51 views46 pages

Chap-2 2 (RegularExpression)

Regular expressions are used to describe patterns in strings and are the basis for lexical analysis. They allow defining the valid token types of a programming language using operators like union, concatenation, and Kleene closure. A regular expression consists of constants denoting string sets and operators denoting operations on these sets. Common operators include union (+), concatenation (.), and Kleene closure (*) which allows zero or more repetitions. Regular expressions are an equivalent formalism to finite automata for representing regular languages.

Uploaded by

Mohmed Ibrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views46 pages

Chap-2 2 (RegularExpression)

Regular expressions are used to describe patterns in strings and are the basis for lexical analysis. They allow defining the valid token types of a programming language using operators like union, concatenation, and Kleene closure. A regular expression consists of constants denoting string sets and operators denoting operations on these sets. Common operators include union (+), concatenation (.), and Kleene closure (*) which allows zero or more repetitions. Regular expressions are an equivalent formalism to finite automata for representing regular languages.

Uploaded by

Mohmed Ibrahim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Chapter_2 (2-2)

Regular Expressions

Regular expressions: Method to describe


regular languages in formal language theory.
Lexical Analysis (scanner)
Source
Program Tokens
Lexical
(Character analyzer
Stream)

▪ Lexical Analysis is
also known as
lexical scanner.
average = (sum/count)

◼ The lexical analyzer needs to scan and identify


only a finite set of valid string/token/lexeme
that belong to the language in hand.
average = (sum/count)

average identifier
= Assignment operator
( open parenthesis
sum identifier
/ Division operator
count Identifier
) Close parenthesis
Valid Token/string

▪ There are some predefined rules for every lexeme to


be identified as a valid token.
▪ These rules are defined by grammar rules, by
means of a pattern.

o A pattern explains what can be a token, and these


patterns are defined by means of regular expressions.
History

Stephen Cole Kleene

Regular expressions originated in 1951, when


mathematician Stephen Cole Kleene described regular
languages using his mathematical notation called regular
sets.

These arose in theoretical computer science, in the


subfields of automata theory (models of computation)
and the description and classification of formal
languages.
Formal Language and Natural Language

• Natural Language: is one which is normally spoken by


people. (Arabic, English, …)

• Formal Language: It is one that can be specified


precisely and is amenable for use with computers.
• A (formal) language is a set of strings from a given
alphabet.

• The syntax of Java is an example of a formal language.


Regular Language

a Regular Language (also called a rational language) is a formal


language that can be defined by a regular expression

A simple example of a language that is not regular is the set of


strings { anbn | n ≥ 0 }.
Intuitively, it cannot be recognized with a finite automaton, since a
finite automaton has finite memory and it cannot remember the exact
number of a's.
Equivalent formalisms

A regular language satisfies the following equivalent properties:

✓ it is the language of a regular expression (by the above definition)


✓ it is the language accepted by a nondeterministic finite automaton (NFA)
✓ it is the language accepted by a deterministic finite automaton (DFA)
✓ it can be generated by a regular grammar
✓ it is the language accepted by an alternating finite automaton
✓ it is the language accepted by a two-way finite automaton
✓ it can be generated by a prefix grammar
✓ it can be accepted by a read-only Turing machine
✓ it can be defined in monadic second-order logic (Büchi–Elgot–
Trakhtenbrot theorem)
What is a Regular Expression?

Regular Expressions are used to represent regular


languages. If a language can’t be represented by the
regular expression, then it means that language is not
regular.
What is a Regular Expression?

A regular expression, regex or regexp (sometimes called


a rational expression) is a sequence of characters that
define a search pattern.

Usually, this pattern is used by string searching algorithms


for "find" or "find and replace" operations on strings, or for
input validation. It is a technique that developed in
theoretical computer science and formal language theory.
"find" or "find and replace" operations in word file
REGULAR EXPRESSIONS

• Regular Expressions is the metalanguage used to


define the token types of a programming language
• Regular Expressions consist of constants, which
denote sets of strings, and operator symbols, which
denote operations over these sets.

Example of Regular Expressions :

( a ( b + c )* )* d
12
Given a finite alphabet Σ, the following constants are
defined as Regular Expressions:
•(empty set) ∅ denoting the set ∅.

•(empty string) ε denoting the set containing only the "empty"


string, which has no characters at all.

•(literal character) a in Σ denoting the set containing only the


character a.
Language Elements:

An alphabet  is a finite set of symbols (characters)

A string s is a finite sequence of symbols from 

• s denotes the length of string s

•  denotes the empty string, thus  = 0

A language is a specific set of strings over some fixed


alphabet 
Language Elements:

Example (Assume Language alphabet is (0,1)):

Then ===>  = (0,1)

L1 is {0,10,1011}
L2 is {ε,0,00,000,0000,00000,. . . . }
A language is a set of strings

A string is a finite sequence of symbols taken


from a finite alphabet 
• The C language is the (infinite) set of all strings that
constitute legal C programs

• The language of C reserved words is the (finite) set of all


alphabetic strings that cannot be used as identifiers in
the C programs

16
REGULAR EXPRESSIONS

There are a number of algebraic laws that are obeyed by


regular expressions, which can be used to manipulate
regular expressions into equivalent forms.

• These are formulas or expressions consisting of


three possible operations on languages :

1. Union.
2. Concatenation.
3. Kleene Star.
REGULAR EXPRESSIONS
Language Operators

(1) Union of two languages:

• L U M = all strings that are either in L or M


• Note: A union of two languages produces a third
language
REGULAR EXPRESSIONS
Language Operators

(1) Union of two languages:

The union of two sets is that set which contains all the
elements in each of the two sets and nothing else.

The union operation on languages is designated with a ‘+’.

For example,

1. {abc, ab, ba} + {ba, bb} = {abc, ab, ba, bb}


2. L + {} =L
REGULAR EXPRESSIONS
Language Operators

(2) Concatenation of two languages:

• L . M = all strings that are of the form xy


s.t., x  L and y  M

• The dot operator is usually omitted


• i.e., LM is same as L.M
REGULAR EXPRESSIONS
Language Operators

(2) Concatenation of two languages:

In order to define concatenation of languages, we must


first define concatenation of strings.

This is simply the two strings forming a new string.

For example,

abc . ba = abcba
REGULAR EXPRESSIONS
Language Operators

(2) Concatenation of two languages:

Note that any string concatenated with the null string is


that string itself:

s . ε = s.

The concatenation of two languages is that language


formed by concatenating each string in one language with
each string in the other language.
REGULAR EXPRESSIONS
Language Operators

(2) Concatenation of two languages:


For example
{ab, a, c} . {b, ε} = {ab.b, ab. ε, a.b, a. ε, c.b, c. ε}
= {abb, ab, a, cb, c}
In this example, the string ab need not be listed twice.

Note that if L1 and L2 are two languages, then L1 . L2 is not


necessarily equal to L2 . L1.

Also, L . {ε} = L, but L . φ = φ.


REGULAR EXPRESSIONS
Language Operators

(2) Concatenation of two languages:

➢ If L1 And L2 Are Two Languages, Then L1 . L2 Is Not


Necessarily Equal To L2 . L1.
REGULAR EXPRESSIONS

(3) Kleene Closure (the * operator)

This operation is a unary operation (designated by


a postfix asterisk) and is often called closure.

If L is a language, we define:

L0 = {ε}
L1 = L
L2 = L . L
L3 = L . L.L
Ln = L . Ln-1
L* = L0 + L1 + L2 + L3 + L4 + L5 + ...
“i” here refers to how many strings to concatenate from the parent
language L to produce strings in the language Li

(3) Kleene Closure (the * operator)

Kleene Closure of a given language L:

• L0= {}

• L1= {w | for some w  L}

• L2= { w1w2 | w1  L, w2  L (duplicates allowed)}

• Li= { w1w2…wi | all w’s chosen are  L (duplicates allowed)}

• (Note: the choice of each wi is independent)

• L* = Ui≥0 Li (arbitrary number of concatenations)


special notes

L* is an infinite set iff |L|≥1 and L≠{}


If L={}, then L* = {}
If L = Φ, then L* = {}

Σ* denotes the set of all words over an


alphabet Σ
• Therefore, an abbreviated way of saying there
is an arbitrary language L over an alphabet Σ
is:
• L  Σ*
Precedence of Operators

Highest to lowest:

(*) operator (star) has the highest precedence

(.) (concatenation) has the second highest


precedence
(+) operator has the lowest precedence of all

Example:
01* + 1 = ( 0 . ((1)*) ) + 1
REGULAR EXPRESSIONS
(3) Kleene Closure (the * operator)
Example:

Let L = {1, 00}

L0 = {}

L1 = {1,00}

L2 = {11,100,001,0000}

L3 = {111,1100,1001,10000,000000,00001,00100,0011}
…….

L* = L0 U L1 U L2 U …
REGULAR EXPRESSIONS
(3) Kleene Closure (the * operator)
Example:

If M is a language:

, we define:
Examples
Regular Language

a|b L= {a, b}
(a | b)(a | b) L= {aa, ab, ba, bb}
a* L= {, a, aa, aaa, ...}
(a | b)* The set of all strings of a’s and b’s.
L= {a, b, ……?……}
a | a *b The set containing the string a and
all strings consisting of zero or
31
more a’s followed by b.
L= {a, b, ab, aab, …, a…ab}
Examples

a|b* denotes {ε, "a", "b", "bb", "bbb", …}

(a|b)* denotes the set of all strings with no symbols other than "a"
and "b", including the empty string: {ε, "a", "b", "aa", "ab", "ba", "bb",
"aaa", …}

ab*(c|ε) denotes the set of strings starting with "a", then zero or
more "b"s and finally optionally a "c": {"a", "ac", "ab", "abc", "abb",
"abbc", …}

(0|(1(01*0)*1))* denotes the set of binary numbers that are


multiples of 3: { ε, "0", "00", "11", "000", "011", "110", "0000", "0011",
"0110", "1001", "1100", "1111", "00000", … }
Algebraic Laws of Regular Expressions

Commutative:
• E+F = F+E
Associative:
• (E+F)+G = E+(F+G)
• (EF)G = E(FG)
Identity:
• E+Φ = E
•E=E=E
Annihilator:
• ΦE = EΦ = Φ
Algebraic Laws of Regular Expressions

Distributive:
• E(F+G) = EF + EG
• (F+G)E = FE+GE
Idempotent: E + E = E
Involving Kleene closures:
• (E*)* = E*
• Φ* = 
• * = 
• E+ =EE*
• E? =  +E
Summary
• These are formulas or expressions consisting of three possible
operations on languages – (union, concatenation, and Kleene
star)
• Union –The union of two sets is that set which contains all the
elements in each of the two sets and nothing else. And it is
designated with a ‘+’.
• For example: {abc, ab, ba} + {ba, bb} = {abc, ab, ba, bb}
• Concatenation –concatenating each string in one set with each
string in the other set. And it is designated with a ‘.’
• For example, {ab, a, c} . {b} = {ab.b, a.b, c.b} = {abb, ab, cb}
• Kleene * -generates zero or more concatenations of strings
from the language to which it is applied. And it is designated
with a ‘*’.
• For example, a* = {, a, aa, aaa, aaaa, aaaaa, aaaaaaaaaaaaa}
◼ Optional characters ? ,* and +
➢ ? (0 or 1)
◼ /colou?r/ ➔ color or colour
➢ * (0 or more)
◼ /oo*h!/ ➔ oh! or Ooh! or Ooooh!

➢ + (1 or more)
◼ /o+h!/ ➔ oh! or Ooh! or Ooooh!
Examples
• For each of the following regular expressions,
list six strings which are in its language.

1. (a(b+c)*)*d

2. (a+b)*(c+d)

3. (a*b*)*
Regular Expiration For Language

Example:
String of a’s and b’s that start and end with a.

a (a | b)* a
Example:
all strings of lowercase letters in which the letters
are in ascending lexicographic order.

a* b* c* …..z*
Exercises
• Suppose L1 represents the set of all strings
from the alphabet 0,1 which contain an even
number of ones (even parity). Which of the
following strings belong to L1?
(a) 0101
(b) 110211
(c) 000
(d) 010011
(e) 
41
Exercises
• Suppose L2 represents the set of all strings
from the alphabet a,b,c which contain an
equal number of a’s, b’s, and c’s. Which of
the following strings belong to L2?
(a) bca
(b) accbab
(c) 
(d) aaa
(e) aabbcc
42
Exercises
• Which of the following strings belong to the
language specified by this regular expression:
(a+bb)*a
(a) ε
(b) aaa
(c) ba
(d) bba
(e) abba

43
TRUE OR FALSE?
Let R and S be two regular expressions.
Then:

1. ((R*)*)* = R* ?

2. (R+S)* = R* + S* ?

3. (RS + R)* RS = (RR*S)* ?


More Examples of Regular Expression

1.Regular Expression for no 0 or many triples of 0’s and many 1 in the strings.

2.RegExp for strings of one or many 11 or no 11.

3.A regular expression for ending with abb

4.A regular expression for all strings having 010 or 101.

5.Regular expression for Even Length Strings defined over {a,b}

6.Regular Expression for strings having at least one double 0 or double 1.

7.Regular Expression of starting with 0 and having multiple even 1’s or no 1.

8.Regular Expression for an odd number of 0’s or an odd number of 1’s in the strings.

9.Regular Expression for having strings of multiple double 1’s or null.

10.Regular Expression (RE) for starting with 0 and ending with 1.

You might also like