0% found this document useful (0 votes)
158 views

Specification of Tokens Using Regular Expressions

Regular expressions can be used to specify patterns for tokens. They are built recursively from smaller expressions using rules like: - Symbols represent single character languages - Union (|) combines alternative languages - Concatenation (.) combines sequential languages - Kleene star (*) represents zero or more repetitions Common operations include union, concatenation, Kleene star. Precedence is * highest, then concatenation, then union. Regular expressions can define languages and names can be given to subexpressions for readability.

Uploaded by

Sam Alex
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views

Specification of Tokens Using Regular Expressions

Regular expressions can be used to specify patterns for tokens. They are built recursively from smaller expressions using rules like: - Symbols represent single character languages - Union (|) combines alternative languages - Concatenation (.) combines sequential languages - Kleene star (*) represents zero or more repetitions Common operations include union, concatenation, Kleene star. Precedence is * highest, then concatenation, then union. Regular expressions can define languages and names can be given to subexpressions for readability.

Uploaded by

Sam Alex
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 8

SPECIFICATION OF

TOKENS USING
REGULAR
EXPRESSIONS
 Strings and Languages
– An alphabet is any finite set of symbols.
– Examples of symbols are letters, digits,
and punctuation.
– The set {0,1} is the binary alphabet.
 A string over an alphabet is a finite sequence
of symbols drawn from that alphabet.
– "sentence" and "word" are often used as synonyms
for "string.“
– The empty string, denoted , is the string of
length zero.
 A language is any countable set of strings over
some fixed alphabet.
– The set containing only the empty string, are
languages {}.
REGULAR EXPRESSION
Regular expressions are an important notation
for specifying lexeme patterns. They are
effective in specifying those types of patterns
that we need for tokens

Regular expression notations for identifiers are


identifier=letter (letter/digit)*
– The regular expressions are built recursively out of smaller
regular expressions, using the rules described below.
– Regular expression construction rules
• Є is a regular expression denoting {є}, that is, the language
containing only the empty string
• If a is a symbol in ∑(alphabet), a is a regular expression denoting
{a}, the language with only one string.
• If r and s are regular expressions denoting languages
L ( r) and L(s ) respectively, then
» (r)|(s) is a regular expression denoting L( r) U L(s)
» (r).(s) is a regular expression denoting L( r). L(s)
» (r)* is a regular expression denoting (L(r ))*
• Precedence of operations
• The unary operator * has highest precedence and is left associative.
• Concatenation has second highest precedence and is left
associative.
• | has lowest precedence and is left associative
• For any regular expressions R , S and T the following
axioms holds
• R|S=S|R (| is commutative)
• R|(S|T)=(R|S)|T (| is assosiative)
• R(ST)=(RS)T (concatenation is assosiative)
• R(S|T)=RS|RT (concatenation distributes over |)
• ЄR=Rє=R (є is the identity for concatenation)
• The regular expression a|b denotes the
language {a, b}.
• (a|b)(a|b) denotes {aa, ah, ba, bb}, the set
of all strings of length two over the
alphabet.
• a* denotes the language consisting of all
strings of zero or more a's, that is, { , a , a
a , a a a , . . . }.
• (a|b)* denotes the set of all strings
consisting of zero or more instances of a or
b
• a|a*b denotes the language {a, b, ab, aab,
aaab,...}
REGULAR DEFINITION
 We may wish to give names to certain
regular expressions and use those names in
subsequent expressions as if the names were
themselves symbols
 di-> ri
 e.g. for language of C identifiers
 letter_->A|B|…|Z|a|b|…|z|_
 digit -> 0|1|…|9
 id -> letter_(letter_|digit)*
EXTENSION OF REGULAR
EXPRESSION
 + one/ more instance
 * zero/ more instance
 ? Zero/ one instance
 [ ] character classes e.g. [a-z]

 ws -> (blank|tab|newline)+
 When ws is recognized , we do not return
anything but restart to the character following
white space

You might also like