Conversion DFA To Regular Expression
Conversion DFA To Regular Expression
Background
Kleenes seminal article denes regular expressions and their relationship to nite automata [7]. Kleene proves the equivalence of nite automata and regular expressions thereby providing us with the rst technique, the transitive closure method, for converting DFAs to regular expressions. Later Brzozowski expanded on Kleenes method by introducing the notion of derivatives of regular expressions [3], but his paper passed into obscurity until G. Berry and R. Sethi brought Brzozowskis paper to the forefront in [2]1 . The state removal method appears in [4] but Linz presents a more straightforward method in [8].
Denitions
We will using the Moore model [9] for nite automata. Given an automaton M with an input alphabet Ak = {0, 1, , k 1}, M has m states Ms = {q1 , q2 , , qm } where q is the starting state of M and Mf = {qf1 , qf2 , , qfn } are the n nal states of M where n m. For clarity, we will use letters to represent each value in the alphabet A instead of using the numeric representation (a = 0, b = 1, etc), and for convenience we will assume q1 is the starting state unless otherwise noted. We will use Kleenes denition [7] of regular expressions. Regular expressions are dened recursively as: 1. The symbols 0, 1, , k 1, , and are regular expressions, where is the empty string and is the empty set.
1 However, eciently converting regular expressions to automata is the focus of the G. Berry and R. Sethi paper, not converting DFAs to regular expressions
2. Set Union : Given the regular expressions x and y , the union of x and y , expressed as x + y , is a regular expression. 3. Concatenation : Given the regular expressions x and y , the concatenation (or product) of x and y , expressed as xy is a regular expression. 4. Iteration : Given the regular expression x, the iteration (or star) of x, expressed as x, is a regular expression. 5. Given a regular expression x, (x) is a regular expression. 6. All regular expressions can be constructed by a nite application of rules 1-5. The regular expressions in rule 1 are terminals. The concatenation of terminals is a string. For a given regular expression x, there is a set X which contains all the strings represented by x. We will use x and X interchangeably. A single string is simply represented by the set containing that single string. We will also refer to and use the following identities: (ab)c = a(bc) = abc x = x = x x = x = +x = x + x = x ( + x) = x Since + is commutative, all the commutative versions apply.
q1
q2
q3
q4
q to qf , there exists only one regular expression R such that R represents the same string as the DFA M . However, this is a trivial automaton, let us examine how to expand this to a more general case.
q1
q2
As we can see, this successive construction builds up regular expressions until we have Rij . We can then construct a regular expression representing M as the union of all Rf where q is the starting state and f Mf (the nal states for M ). This technique is similar in nature to the all-pairs shortest path problem. The only dierence being that we are taking the union and concatenation of regular expressions instead of summing up distances. This solution is of the same form as transitive closure and belongs to the constellation of problems associated with closed semirings. The chief problem of the transitive closure approach is that it creates k very large regular expressions. Examining the formula for an Rij , it is clear the signicant length is due to the repeated union of concatenated terms. Even by using the previous identities, we still have long expressions.
The state removal approach identies patterns within the graph and removes states, building up regular expressions along each transition. The
advantage of this technique over the transitive closure method is that it is easier to visualize. This technique is described by Du and Ko [4], but we examine a much simpler approach is given by Linz [8]. First, any multi-edges are unied into a single edge that contains the union of inputs. Suppose from q2 to q5 there is an edge a and an edge b, those would be unied into one edge from q2 to q5 that has the value a + b. Now, consider a subgraph of the automaton M which is of the form given in gure 3. State q may be removed and the automaton may be reduced to the form in gure 4. The pattern may still be applied if edges are missing. For an edge that is missing, leave out the corresponding edge in gure 4. This process repeats until the automaton is of the form in gure 5. Then by direct calculation, the regular expression is: r = r1 r2 (r4 + r3 r1 r2 ) .
e a qi d q c b qj
Figure 3: Desired pattern for state removal. ae d ae b qi ce d Figure 4: Results after state removal. r1 r2 q1 r3 Figure 5: Final form. qf r4 qj ce d
Brzozowski method [3]2 takes a unique approach to generating regular expressions. We create a system of regular expressions with one regular expression unknown for each state in M , and then we solve the system for R where R is the regular expression associated with starting state q . These equations are the characteristic equations of M . Constructing the characteristic equations is straightforward. For each state qi in M , the equation for Ri is a union of terms. Each term can be constructed like so: for a transition a from qi to qj , the term is aRj . If Ri is a nal state, is also one of the terms. This leads to a system of equations in the form: R1 = a 1 R1 + a 2 R2 + R2 = a 1 R1 + a 2 R2 + R3 = a 1 R1 + a 2 R2 + + . . . . .=. Rm = a 1 R1 + a 2 R2 + + where ax = if there is no transition from Ri to Rj . The system can be solved via straightforward substitution, except when an unknown appears on both the right and left hand side of the equation. This situation occurs when there is a self loop for state qi . Ardens theorem [1] is the key to solving these situations. The theorem is as follows: Given an equation of the form X = AX + B where / A, the equation has the solution X = A B . We use this equation to isolate Ri on the left hand size and successively substitute Ri into the another equation. We repeat the process until we have found R with no unknowns on the right hand side. For example, consider again the automaton in gure 2. The characteristic equations are as follows (where R = R1 ): R1 = aR1 + bR2 R2 = bR2 + We solve for R2 using Ardens theorem and the previously mentioned identities: R2 = bR2 + = b = b We substitute into R1 and solve: R1 = aR1 + b(b) = aR1 + bb = a (bb) = a bb Thus, the regular expression for the automaton in gure 2 is a bb.
2 Kain
[6] explains this method in more detail and gives illustrative examples.
Conclusions
The state removal approach seems useful for determining regular expressions via manual inspection, but is not as straightforward to implement as the transitive closure approach and the algebraic approach. The transitive closure approach gives a clear and simple implementation, but tends to create very long regular expressions. The algebraic approach is elegant, leans toward a recursive approach, and generates reasonably compact regular expressions. Brzozowskis method is particularly suited for recursion oriented languages, such as functional languages, where the transitive closure approach would be cumbersome to implement.
References
[1] Dean N. Arden. Delayed-logic and nite-state machines. In Theory of Computing Machine Design, pages 135. U. of Michigan Press, Ann Arbor, MI, 1960. [2] G. Berry and R. Sethi. From regular expressions to deterministic automata. TCS: Theoretical Computer Science, 48:117126, 1987. [3] Janusz A. Brzozowski. Derivatives of regular expressions. J. ACM, 11(4):481494, 1964. [4] Ding-Shu Du and Ker-I Ko. Problem Solving in Automata, Languages, and Complexity. John Wiley & Sons, New York, NY, 2001. [5] John E. Hopcroft and Jeery D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, Reading, MA, 1979. [6] Richard Y. Kain. Automata Theory: Machines and Languages. Robert E. Krieger Publishing Company, Malabar, FL, 1981. [7] S. C. Kleene. Representation of events in nerve nets and nite automata. In Automata studies, pages 340. Ann. of Math. Studies No. 34, Princeton University Press, Princeton, NJ, 1956. [8] Peter Linz. An introduction to Formal Languages and Automata. Jones and Bartlett Publishers, Sudbury, MA, third edition, 2001. [9] E. F. Moore. Gedanken experiments on sequential machines. In Automata Studies, pages 129153. Ann. of Math. Studies No. 34, Princeton University Press, Princeton, NJ, 1956. [10] Arto Salomaa. Jewels of Formal Language Theory. Computer Science Press, Rockville, MD, 1984.