LR (K) Parser Construction Using Bottom-Up Formal Analysis: Nazir Ahmad Zafar
LR (K) Parser Construction Using Bottom-Up Formal Analysis: Nazir Ahmad Zafar
org/journal/jsea)
21
ABSTRACT
Design and construction of an error-free compiler is a difficult and challenging process. The main functionality of a compiler is to translate a source code to an executable machine code correctly and efficiently. In formal verification of software, semantics of a language has more meanings than the syntax. It means source program verification does not give guarantee the generated code is correct. This is because the compiler may lead to an incorrect target program due to bugs in itself. It means verification of a compiler is much more important than verification of a source program. In this paper, we present a new approach by linking context-free grammar and Z notation to construct LR (K) parser. This has several advantages because correctness of the compiler depends on describing rules that must be written in formal languages. First, we have defined grammar then language derivation procedure is given using right-most derivations. Verification of a given language is done by recursive procedures based on the words. Ambiguity of a language is checked and verified. The specification is analyzed and validated using Z/Eves tool. Formal proofs are presented using powerful techniques of reduction and rewriting available in Z/Eves. Keywords: Compiler Construction; LR(K) Parser; Context-Free Grammar; Z Specification; Correctness; Verification
1. Introduction
A compiler is a program that translates a source code into its equivalent machine readable code. The translation process is termed as compilation which then can be used to execute the resultant code specified in the original source code. It is noted that the source language is at higher level as compared to machine code. The higher level languages not only increase abstraction level between source and resulting codes but also increase complexity to formalize such abstract structures. The target language is normally a low level language generated from a source code. Compiler construction has always been considered as an advanced research area than other programming practices mainly due to the size and complexity of the code generated. The design and construction of a fully verified compiler will remain a challenge of twenty first century. As mentioned above, the main functionality of a compiler is to translate a source code written by programmers to an executable machine code correctly and efficiently. Although a lot of work is done in this area but compiler construction is a mature area of research which needs further investigation. This is because the bugs in the compiler can lead to an incorrect machine code even the source code is fully verified to be correct. Further, as
Copyright 2012 SciRes.
executable generated code is tested and if bugs are detected it might be due to the source program or compiler itself. This issue has led to verification of a compiler that proves that a source program is correct before allowing it to run on the machine. Formal methods are mathematical-based techniques used for specification, proving and verification of software and hardware systems [1]. The process of formal verification means applying these approaches to verify the properties ensuring correctness of a system. Formal verification of software targets the source program where semantics of the language gives precise meanings to the program analyzed. On the other hand, program verification does not mean that the resultant executable code is correct as specified by the semantics of the source program. This is because the compiler may lead to an incorrect target program because of the bugs in the compiler and it can invalidate the guarantees ensured by the formal methods. It proves that verification of a compiler is much more important than verification of a source program to be compiled. Parser or syntactic analyzer is an important part of a compiler. Parsing is the process of analyzing a sequence of tokens generated by the lexical analyzer to determine its grammatical structure with respect to a given grammar. More precisely, task of a parser is to determine how
JSEA
22
an input string can be derived from the start symbol of the grammar using set of rules called production of the grammar. There are two main approaches of parsing, i.e., top-down and bottom-up parsing. In top-down, left most derivations are used to accept an input stream and tokens are consumed from left to right. Whereas in case of bottomup parsing, right-most derivations are used to accept an input stream and tokens are consumed from left to right. LR(k) and shift-reduce parsers are examples of bottomup parsers. In the previous work [2], formal verification of top down parsing was done. Few other preliminary results of this research were presented in [3,4] by formalizing some important concepts of context-free grammar. In this paper, the bottom-up parsing analysis of a sequence of tokens generated from the lexical analyzer is presented using Z by right most derivations. Ambiguity of the language is checked and its well-defined-ness is verified. Initially, formal definition of context-free grammar is given. In next, a right most derivation procedure is described by replacing non-terminals with terminals and non-terminals using bottom up approach. Then LR(1) parser is described first for a word and then extended to a language. The derivation procedure is defined to analyze a sequence of tokens using production rules of the context-free grammar. The parsing analysis for a language is specified by introducing recursion using derivations used in generation of a word. Ambiguity of a word is checked by specifying if there exists more than two right most derivation trees for a given words. The same notion is formalized for the language to check if it is ambiguous or well-defined. The formal specification is analyzed and validated using Z Eves tool set. The results of this paper will be used in our ongoing project on construction and verification of a compiler. The major objectives of this research are: Linking context-free grammar and formal techniques to be useful in the verification of a compiler Preparing a synthesis of approaches to be used in the development of automated tools Identifying and proposing an integration of existing traditional and formal approaches Establishing a syntactically and semantically verified relationship between Z and context-free grammar Under the current development in formal methods, it is not possible to develop a complete and consistent software system using a single formal technique and hence integration of approaches is required. Although integration of approaches is a well-researched area [5-11], but there does not exist much work on formalization of automata and context-free languages. Dong et al. have described an integration of timed automata and Object Z [12,13]. Constable has proposed a formalization of few important concepts of automata theory using Nuprl which is a formal language [14,15]. A formal linkage is investiCopyright 2012 SciRes.
gated between Petri-nets and Z notation in [16]. An integration of B, a formal technique, and UML, a semiformal technique, is presented in [17,18]. Wechler has introduced few algebraic structures using fuzzy automata [19]. A formal treatment of fuzzy automata and language theory is discussed in [20]. In [21], an important notion of algebraic theory and automata theory is presented. Rest of the paper is organized as follows: In Section 2, an introduction to formal methods is given. In Section 3, the role of context-free grammar in parsers for compiler construction is provided. Formal construction of LR(K), for K = 1, is given in Section 4. Model analysis for validating the specification is given in Section 5. Finally, conclusion and future work are discussed in Section 6.
2. Formal Methods
Formal methods are mathematical techniques and notations used for describing and analyzing properties of software and hardware systems. These techniques are based on discrete mathematics such as sets, sequences, relations, functions, graphs, automata, first order logic and higher order logic. Formal approaches may be classified mainly in terms of property oriented and model descriptive methods. Property oriented formal methods are used to describe software in terms of properties and invariants defined that must be true. Model oriented formal methods are used to construct a model of a system focusing on both statics and dynamics of the system [22]. Although use of formal methods can be observed in almost all major areas of computer science but mainly their use can be found to improve quality by describing and specifying software systems in a well-defined and structured manner. Although there are various notations of formal methods but at the current stage of their development, it needs an integration of formal and existing traditional approaches for a consistent design and complete description of a system. Z notation is a specification language used at an abstract level of modeling the systems. The Z is a model centered approach based on sets, sequences, bags, relations and first order predicate logic [23]. Usually, Z is used for specifying behavior of sequential programs by the abstract data types. Z is selected for this research to be linked with context-free language because both have abstract power of expressing the systems. The Z has standard set operators, for example, union, intersection, comprehensions, Cartesian products and power sets. The logic of Z is formulated using first order predicate calculus and refinements. The Z allows organizing a system into its smaller components using a powerful structure named as schema. The schema defines a way in which state of a system can be specified, refined and modified. Mathematical refinement is a promising aspect of Z
JSEA
23
supporting verifiable stepwise transformation of an abstract specification into an executable code [24]. Once a formal specification is written in Z, it can further be refined and transformed into an implemented system.
nomial time algorithm. Context-free languages have their own limitations as well. For example, some operators which are well-defined in other models of automata theory do not behave well in context-free grammar. As an example, the intersection of two context-free languages, is not context-free in general. Similarly, the complement of a CFG may not be context-free. There are various applications of context-free grammar in addition to compilers. Robotics, software engineering and maintenance, speech recognition are few application areas of it [26]. Applications of context-free grammar in pattern recognition increase an accuracy of the patterns to be recognized. This is because it can provide a higher level of abstraction by defining the semantics rules for patterns as compared to other specifications techniques, for example, strings and regular expressions. This abstract level semantic analysis can be used to reduce the false identification of the patterns [27]. The applications of pattern recognition can be observed everywhere from language processing to computer networks. In speech recognition, the spoken words can be generated by CFG using dynamic programming algorithms. In software engineering, the components in a source code are recognized using context-free grammar [28]. As the output of parsing is larger and less-ambiguous, therefore, for interactive voice response systems, the use of CFG can be highly effective [29,30].
24
always a single non-terminal. Since all rules have nonterminals on the left hand side and, hence, can easily be replaced with the string on the right hand side of the production rule. The context in which the symbols occur is not important and, hence, the grammar is called context-free grammar. The CFG is always recognized by a finite state machine having a single infinite tap thereat. The current state is pushed at the start and is recovered at the end for keeping track of the nested units. In the formal analysis, CFG is represented using Z notation consisting of 4-tuple as defined above. Mathematically, R in the definition of CFG is a relation from N to (N T)* such that t (N T)*, S N and (S, t) R. The notation * represents to any combination of symbols of N and T. In the specification of CFG, X is defined as a set of symbols which is a collection of terminals or non-terminals. We define the sets of non-terminals by N and set of terminals by T based on the definition of X. The X, N and T are defined as sets at an abstract level of specification over which operators cannot be defined.
[X]; T = X; N = X
4.1. Invariants
The start variable is an element of non-terminals. The sets of terminals and non-terminals are nonempty. There does not exist any element which is common to both sets of terminals and non-terminals. Each element in the sets of terminals and non-terminals is an element of the set of symbols. Each element in the set of symbols is an element of the sets of terminals or non-terminals. The domain of production relation is a subset of the non-terminals. Each element in the range of production relation is a subset of set of symbols. There exists at least one production rule which contains start variable on the left hand side of it.
Formal definition of context-free grammar is given below and is represented by the schema Grammer. The schema consists of five components, i.e., terminals, nonterminals, symbols, productions and inistate representing set of terminals, set of non-terminals, set of all symbols of the grammar, set of productions and start variable. The set of terminals is a type of power set of T, the set of non-terminals is a type of power set of N and the set of symbols is a power set of X. The productions are a set of rules defined by the relation between N and seq X. The start variable is of type of N. In the schema, it is described that there exists exactly one rule, (S0, t) productions where S0 is the start non-terminal and t is a string of type seq X. The components of the grammar are given in first part of the schema and invariants are defined in the second part of it.
JSEA
25
In the schema given below, it is verified that the given word of the language is generated unambiguously. In the schema, it is stated that word is unambiguously generated if there exists two derivations or parsing trees for the same word then both the parsing trees must be same.
Now we check the ambiguity of the word generated using the schema LRW1A. The schema consists of same three components gram, word? and multiple as in case of derivation of the word. In the schema, it is stated that word is ambiguously generated if there exists two derivations or parsing trees for the same word.
JSEA
26
In next, we check the ambiguity of the language using the schema LRL1A. The schema consists of same three components gram, language? and multiple as in case of derivation of the language. In the schema, it is stated that language is ambiguously generated if there a word in the language such that there exists two derivations or parsing trees for the word.
5. Model Analysis
In this section, formal analysis is done for the specification. Although computer tools are rigorously used for the formal specification but, on the hand, there does not exist any real computer tool which may assure about complete correctness of a formal model. Therefore, even the specification is well-written using any of the formal specification languages it may contain potential bugs or errors. That is an art of writing a formal specification never guarantee that the system is correct, complete and consistent. But if the specification is checked and analyzed with a computer tool it certainly increases the confidence over the system to be developed by identifying the errors, if exists, in the syntax and semantics of the formal specification. The Z/Eves is one of the powerful tools which is used for analyzing the specification written for construction of LR(1). A snapshot of the tool for analyzing the formal specification using Z/Eves tool is presented in Figure 1. The first column on the left of the figure shows status of the syntax checking and the second column represents the proof correctness of the specification. The symbol Y stands that the specification is correct syntactically and proof is also correct while the symbol N shows that errors exist which can be listed with the tool support. All the schemas are checked to prove that specification is correct in syntax and has a correct proof. Some proofs were conducted by reduction and rewriting techniques available in the tool. Summary of the results of the formal specification is presented in Table 1. In the first column of the table, name of schema is given for which the specification is described. These schemas are analyzed by using the model exploration techniques provided in the Z/Eves tool. The symbol Y in column 2 indicates that all the schemas are well-written and proved automatically. Similarly, domain checking, reduction and proof by reduction are represented in columns 3, 4 and 5, respectively. The character Y* annotated with * describes that the schemas are proved by performing reduction on the prediTable 1. Results of model analysis.
Schema Name Grammer RightDerivations LRW1 LRW1A LRW1U LRL1 LRL1A LRL1U Syntax Type Check Y Y Y Y Y Y Y Y Domain Check Y Y Y Y Y Y Y Y Reduction Y Y Y* Y Y Y Y* Y* Proof Y Y Y Y Y Y Y Y
In the schema given below, it is verified that the given language is generated unambiguously. In the schema, it is stated that language is unambiguous if for any word if there exists two derivations or parsing trees for the same word then both the parsing trees must be same. Grammar used, input language to be generated and derivation rules are defined in first part of the schema and language derivation process is described in the second part of the schema. This complete our formal model for construction of LR(1).
JSEA
27
works [31-39] were found but our approach is different because of abstract and conceptual level integration of CFG and Z. In the benefits of using Z, every object is assigned a unique type providing a useful programming practice. Several type checking tools exist to support the formal specification. The Z/Eves is a powerful tool to prove and analyze the specification. The rich mathematical notations made it possible to reason about behavior of a system more effectively. Formalization of some other concepts, useful in compiler verification, are in progress and will appear soon in our future work.
REFERENCES
[1] C. J. Burgess, The Role of Formal Methods in Software Engineering Education and Industry, Technical Report, University of Bristol, Bristol, 1995. K. A. Buragga and N. A. Zafar, Formal Parsing Analysis of Context-Free Grammar Using Left Most Derivations, International Conference on Software Engineering Advances, 2011. N. A. Zafar, S. A. Khan and B. Kamran, Formal Procedure of Deriving Language from Context-Free Grammar, International Conference on Intelligence and Information Technology, Vol. 1, 2010, pp. 533-536. N. A. Zafar and B. Kamran, Formal Construction of Possible Operators on Context-Free Grammar, International Conference on Intelligence and Information Technology, 2010. H. Beek, A. Fantechi, S. Gnesi and F. Mazzanti, State/
[2]
[3]
[4]
[5]
JSEA
28
LR(K) Parser Construction Using Bottom-Up Formal Analysis Event-Based Software Model Checking, Integrated Formal Methods, Springer, Berlin, 2004, pp. 128-147. [23] J. M. Spivey, The Z Notation: A Reference Manual, Printice-Hall, Austin, 1989. [24] J. M. Wing, A Specifier, Introduction to Formal Methods, IEEE Computer, Vol. 23, No. 9, 1990, pp. 8-24. doi:10.1109/2.58215 [25] C. Lindig, Random Testing of C Calling Conventions, ACM, 2005. [26] J. A. Anderson, Automata Theory with Modern Applications, Cambridge University Press, Cambridge, 2006. doi:10.1017/CBO9780511607202 [27] H. C. Young, J. Moscola and J. W. Lockwood, Context-Free Grammar Based Token Tagger in Reconfigurable Devices, Proceedings of International Conference of Data Engineering, 2005, p. 78. [28] M. V. D. Brand, A. Sellink and C. Verhoef, Generation of Components for Software Renovation Factories from Context-Free Grammars, Counselors of Real Estate, 2001, pp. 144-153. [29] M. Balakrishna, D. Moldovan and E. K. Cave, Automatic Creation and Tuning of Context-Free Grammars for Interactive Voice Response Systems, IEEE NLP-KE, 2005, pp. 158-163. [30] L. Pedersen and H. Reza, A Formal Specification of a Programming Language: Design of Pit, 2nd International Symposium on Leveraging Applications of Formal Methods, Verification and Validation, 2008, pp. 111-118. [31] D. P. Tuan, Computing with Words in Formal Methods, Technical Report, University of Canberra, Canberra, 2000. [32] S. A. Vilkomir and J. P. Bowen, Formalization of Software Testing Criterion, South Bank University, London, 2001. [33] A. Hall, Correctness by Construction: Integrating Formality into a Commercial Development Process, Praxis Critical Systems Limited, Springer, Berlin, Vol. 2391, 2002, pp. 139-157. [34] B. A. L. Gwandu and D. J. Creasey, Importance of Formal Specification in the Design of Hardware Systems, Birmingham University, Birmingham, 1994. [35] D. K. Kaynar and N. Lynchn, The Theory of Timed I/O Automata, Morgan & Claypool Publishers, 2006. [36] D. Jackson, I. Schechter and I. Shlyakhter, Alcoa: The Alloy Constraint Analyzer, Proceedings of the 22nd International Conference of Software Engineering, 2000, pp. 730-733. [37] D. Aspinall and L. Beringer, Optimisation Validation, Electronic Notes in Theoretical Computer Science, Vol. 176, No. 3, 2007, pp. 37-59. doi:10.1016/j.entcs.2006.06.017 [38] S. Briaisa and U. Nestmannb, A Formal Semantics for Protocol Narrations, Theoretical Computer Science, Vol. 389, No. 3, 2007, pp. 484-511.
[6]
O. Hasan and S. Tahar, Verification of Probabilistic Properties in the HOL Theorem Prover, Integrated Formal Methods, Springer, Berlin, 2007, pp. 333-352. F. Gervais, M. Frappier and R. Laleau, Synthesizing B Specifications from EB3 Attribute Definitions, Integrated Formal Methods, Springer, Berlin, 2005, pp. 207226. doi:10.1007/11589976_13 K. Araki, A. Galloway, and K. Taguchi, Integrated Formal Methods, Proceedings of the 1st International Conference on Integrated Formal Methods, Springer, Berlin, 1999. B. Akbarpour, S. Tahar and A. Dekdouk, Formalization of Cadence SPW Fixed-Point Arithmetic in HOL, Integrated Formal Methods, Springer, Berlin, 2002, pp. 185-204.
[7]
[8]
[9]
[10] J. Derrick and G. Smith, Structural Refinement of Object-Z/CSP Specifications, The Institute of Finance Management, Springer, Berlin, 2000, pp. 194-213. [11] T. B. Raymond, Integrating Formal Methods by Unifying Abstractions, Springer, Berlin, 2004, pp. 441-460. [12] J. S. Dong, R. Duke and P. Hao, Integrating Object-Z with Timed Automata, 2005, pp. 488-497. [13] J. S. Dong, et al., Timed Patterns: TCOZ to Timed Automata, The 6th International Conference on Formal Engineering Methods, 2004, pp. 483-498. [14] R. L. Constable, et al., Formalizing Automata II: Decidable Properties, Technical Report, Cornell University, Cornell, 1997. [15] R. L. Constable, et al., Constructively Formalizing Automata Theory, Foundations of Computing Series, MIT Press, Cambridge, 2000. [16] M. Heiner and M. Heisel, Modeling Safety Critical Systems with Z and Petri Nets, International Conference on Computer Safety, Reliability and Security, Springer, Berlin, 1999, pp. 361-374. doi:10.1007/3-540-48249-0_31 [17] H. Leading and J. Souquieres, Integration of UML and B Specification Techniques: Systematic Transformation from OCL Expressions into B, Asia-Pacific Software Engineering Conference, 2002, pp. 495-504. [18] H. Leading and J. Souquieres, Integration of UML Views Using B Notation, Proceedings of Workshop on Integration and Transformation of UML Models, 2002. [19] W. Wechler, The Concept of Fuzziness in Automata and Language Theory, Akademic-Verlag, Berlin, 1978. [20] N. M. John and S. M. Davender, Fuzzy Automata and Languages: Theory and Applications, Chapman & Hall, London, 2002. [21] M. Ito, Algebraic Theory of Automata and Languages, World Scientific Publishing Co., Singapore, 2004. doi:10.1142/9789812562685 [22] M. Brendan and J. S. Dong, Blending Object-Z and Timed CSP: An Introduction to TCOZ, 20th International Conference on Software Engineering, IEEE Computer Society, Kyoto, 1998.
JSEA
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.