0% found this document useful (0 votes)
4 views

PPL UNIT 1 NOTES

The document covers the principles of programming languages, including syntax, semantics, and the evolution of various programming languages. It discusses the reasons for studying programming languages, programming domains, language evaluation criteria, influences on language design, and categories of programming languages. Additionally, it provides a historical overview of significant programming languages and their development, highlighting their features and impact on computing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

PPL UNIT 1 NOTES

The document covers the principles of programming languages, including syntax, semantics, and the evolution of various programming languages. It discusses the reasons for studying programming languages, programming domains, language evaluation criteria, influences on language design, and categories of programming languages. Additionally, it provides a historical overview of significant programming languages and their development, highlighting their features and impact on computing.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

CCS358 / PRINCIPLES OF PROGRAMMING LANGUAGES

UNIT I SYNTAX AND SEMANTICS


Evolution of programming languages – describing syntax – context-free grammars – attribute grammars
– describing semantics – lexical analysis – parsing – recursive-descent – bottom up parsing

Preliminary Concepts

1.1 Reasons for Studying Concepts of Programming Languages

• Increased ability to express ideas


• Improved background for choosing appropriate languages
• Increased ability to learn new languages
• Better understanding of significance of implementation
• Overall advancement of computing

Increased ability to express ideas


It is believed that the depth at which we think is influenced by the expressive power of the language
in which we communicate our thoughts. It is difficult for people to conceptualize structures they
can’t describe, verbally or in writing.
Language in which they develop S/W places limits on the kinds of control structures, data structures,
and abstractions they can use.
Awareness of a wider variety of P/L features can reduce such limitations in S/W development.
Can language constructs be simulated in other languages that do not support those constructs
directly?
Improved background for choosing appropriate languages
Many programmers, when given a choice of languages for a new project, continue to use the
language with which they are most familiar, even if it is poorly suited to new projects.
If these programmers were familiar with other languages available, they would be in abetter
position to make informed language choices.
Greater ability to learn new languages
Programming languages are still in a state of continuous evolution, which means continuous
learning is essential.
Programmers who understand the concept of OO programming will have easier timelearning
Java.Once a thorough understanding of the fundamental concepts of languages is acquired, it
becomes easier to see how concepts are incorporated into the design of the language being
learned.
Understand significance of implementation
Understanding of implementation issues leads to an understanding of why languages are designed
the way they are.
This in turn leads to the ability to use a language more intelligently, as it was designed tobe used.
Ability to design new languages
The more languages you gain knowledge of, the better understanding of programming languages
concepts you understand.
Overall advancement of computing
In some cases, a language became widely used, at least in part, b/c those in positions tochoose
languages were not sufficiently familiar with P/L concepts.
Many believe that ALGOL 60 was a better language than Fortran; however, Fortran wasmost widely
used. It is attributed to the fact that the programmers and managers didn’t understand the
conceptual design of ALGOL 60.

1.2 Programming Domains

• Scientific applications
– In the early 40s computers were invented for scientific applications.
– The applications require large number of floating point computations.
– Fortran was the first language developed scientific applications.
– ALGOL 60 was intended for the same use.
• Business applications
– The first successful language for business was COBOL.
– Produce reports, use decimal arithmetic numbers and characters.
– The arrival of PCs started new ways for businesses to use computers.
– Spreadsheets and database systems were developed for business.

• Artificial intelligence
– Symbolic rather than numeric computations are manipulated.
– Symbolic computation is more suitably done with linked lists than arrays.
– LISP was the first widely used AI programming language.
• Systems programming
– The O/S and all of the programming supports tools are collectively known as its
systemsoftware.
– Need efficiency because of continuous use.
• Scripting languages
– Put a list of commands, called a script, in a file to be executed.
– PHP is a scripting language used on Web server systems. Its code is embedded in
HTMLdocuments. The code is interpreted on the server before the document is sent to a
requesting browser.
• Special-purpose languages

1.3 Language Evaluation Criteria

• Readability : the ease with which programs can be read and understood
• Writability : the ease with which a language can be used to create programs
• Reliability : conformance to specifications (i.e., performs to its specifications)
• Cost : the ultimate total cost
Readability
• Overall simplicity
– A manageable set of features and constructs
– Few feature multiplicity (means of doing the same operation)
– Minimal operator overloading
• Orthogonality
– A relatively small set of primitive constructs can be combined in a relatively small
number of ways
– Every possible combination is legal
• Control statements
– The presence of well-known control structures (e.g., while statement)
• Data types and structures
– The presence of adequate facilities for defining data structures
• Syntax considerations
– Identifier forms: flexible composition
– Special words and methods of forming compound statements
– Form and meaning: self-descriptive constructs, meaningful keywords

Writability
• Simplicity and Orthogonality
– Few constructs, a small number of primitives, a small set of rules for combining them
• Support for abstraction
– The ability to define and use complex structures or operations in ways that allow details
to be ignored
• Expressivity
– A set of relatively convenient ways of specifying operations
– Example: the inclusion of for statement in many modern languages
Reliability
• Type checking
– Testing for type errors
• Exception handling
– Intercept run-time errors and take corrective measures
• Aliasing
– Presence of two or more distinct referencing methods for the same memory location
• Readability and writability
– A language that does not support “natural” ways of expressing an algorithm will
necessarily use “unnatural” approaches, and hence reduced reliability
Cost
• Training programmers to use language
• Writing programs (closeness to particular applications)
• Compiling programs
• Executing programs
• Language implementation system: availability of free compilers
• Reliability: poor reliability leads to high costs
• Maintaining programs
Others

• Portability
– The ease with which programs can be moved from one implementation to another
• Generality
– The applicability to a wide range of applications
• Well-defined
– The completeness and precision of the language‘s official definition

1.4 Influences on Language Design


• Computer Architecture
– Languages are developed around the prevalent computer architecture, known as the von
Neumann architecture
• Programming Methodologies
– New software development methodologies (e.g., object-oriented software development) led to
new programming paradigms and by extension, new programming languages

Computer Architecture
• Well-known computer architecture: Von Neumann
• Imperative languages, most dominant, because of von Neumann computers
– Data and programs stored in memory
– Memory is separate from CPU
– Instructions and data are piped from memory to CPU
– Basis for imperative languages
• Variables model memory cells
• Assignment statements model piping
• Iteration is efficient

Figure 1.1 The von Neumann Computer Architecture

Programming Methodologies
• 1950s and early 1960s: Simple applications; worry about machine efficiency
• Late 1960s: People efficiency became important; readability, better control structures
– structured programming
– top-down design and step-wise refinement
• Late 1970s: Process-oriented to data-oriented
– data abstraction
• Middle 1980s: Object-oriented programming
– Data abstraction + inheritance + polymorphism
1.5 Language Categories

• Imperative
– Central features are variables, assignment statements, and iteration
– Examples: C, Pascal
• Functional
– Main means of making computations is by applying functions to given parameters
– Examples: LISP, Scheme
• Logic
– Rule-based (rules are specified in no particular order)
– Example: Prolog
• Object-oriented
– Data abstraction, inheritance, late binding
– Examples: Java, C++
• Markup
– New; not a programming per se, but used to specify the layout of information in Web
documents
– Examples: XHTML, XML
Language Design Trade-Offs
• Reliability vs. cost of execution
– Conflicting criteria
– Example: Java demands all references to array elements be checked for proper indexing
but that leads to increased execution costs
• Readability vs. writability
– Another conflicting criteria
– Example: APL provides many powerful operators (and a large number of new symbols),
allowing complex computations to be written in a compact program but at the cost of poor
readability
• Writability (flexibility) vs. reliability
– Another conflicting criteria
– Example: C++ pointers are powerful and very flexible but not reliably used

1.6 EVOLUTION OF PROGRAMMING LANGUAGE


A. Plankalkül - Was the first language invented in 1945 by Zuse

- Was Never implemented


- It had advanced data structures
- floating point, arrays, records
- Notation of code :
Each statement had three lines of code . The first line is like the statements in current
languages. Second line indicates array subscripts. Third line indicates data type.

A(7) := 5 * B(6)

| 5 * B => A
V | 6 7 (array subscripts)

S | 1.n 1.n (data types)

1.n indicates integer of n bits


B. Pseudocodes – were invented in 1949

The disadvantages of machine code were

• Poor readability
• Poor modifiability
• Expression coding was tedious
• Machine deficiencies--no indexing or floating point

These were overcome by pseudocodes

- SHORT CODE was a category of pseudocode invented in 1949 for BINAC machine by Mauchly
- Expressions were coded, left to right
- Some operations:
1n => (n+2)nd power

2n => (n+2)nd root

- SPEEDCODING was second category of pseudocode invented in 1954 for IBM 701 machine by Backus
- Pseudo ops for arithmetic and math functions
- Conditional and unconditional branching
- Autoincrement registers for array access
- Slow!
- Only 700 words left for user program
C. Laning and Zierler System - 1953

- Implemented on the MIT Whirlwind computer


- First "algebraic" compiler system
- Subscripted variables, function calls, expression translation
- Never ported to any other machine
D. FORTRAN I - 1957
(FORTRAN 0 - 1954 - not implemented)

- Designed for the new IBM 704, which had indexregisters and floating point hardware
- The Environment under which FORTRAN was developed was :
1. Computers were small and unreliable
2. Applications were scientific
3. No programming methodology or tools
4. Machine efficiency was most important

- The Environment h a d a s i g n i f i c a n t i m p a c t on design


1. No need for dynamic storage
2. Need good array handling and counting loops
3. No string handling, decimal arithmetic, orpowerful input/output (commercial stuff)

The characteristics of first implemented version of FORTRAN

- Names could have up to six characters


- Posttest counting loop (DO)
- Formatted i/o
- User-defined subprograms
- Three-way selection statement (arithmetic IF)
- No data typing statements
- No separate compilation
- Compiler released in April 1957, after 18 worker/years of effort
- Programs larger than 400 lines rarely compiled correctly, mainly due to poor reliability of the 704
- Code was very fast
- Quickly became widely used
E. FORTRAN II - 1958

- Independent compilation of subroutines


- Fix the bugs of FORTRAN I
F. FORTRAN IV – 1960-62

- Explicit type declarations


- Logical selection statement
- Subprogram names could be parameters
- ANSI standard in 1966
G. FORTRAN 77 – 1978

- Character string handling


- Logical loop control statement
- IF-THEN-ELSE statement
H. FORTRAN 90 – 1990

- Dynamic arrays
- Pointers
- Recursion
- CASEstatement
- Parameter type checking
FORTRAN Evaluation

- Dramatically changed forever the way computers are used


I. LISP - 1959

- LISt Processing language (Designed at MIT by McCarthy)


- AI research needed a language that:
- Process data in lists (rather than arrays)
- Symbolic computation (rather than numeric)

- Only two data types: atoms and lists


- Syntax is based on lambda calculus
- Pioneered functional programming
- No need for variables or assignment
- Control via recursion and conditionalexpressions
- Still the dominant language for AI
- COMMON LISP and Scheme are contemporary dialects of LISP.
- ML, Miranda, and Haskell are related languages
J. ALGOL 58 – 1958

- Environment of development:
1. FORTRAN had (barely) arrived for IBM 70x
2. Many other languages were being developed, all for specific machines
3. No portable language; all were machine-dependent
4. No universal language for communicatingalgorithm

- Goals of the language:


1. Close to mathematical notation
2. Good for describing algorithms
3. Must be translatable to machine code
- Language Features:
- Concept of type was formalized
- Names could have any length
- Arrays could have any number of subscripts
- Parameters were separated by mode (in & out)
- Subscripts were placed in brackets
- Compound statements (begin ... end)
- Semicolon as a statement separator
- Assignment operator was :=
- If had an else-if
- Comments:

- Not meant to be implemented, but variations were (MAD, JOVIAL)


- Although IBM was initially enthusiastic, allsupport was dropped by mid-1959

K. ALGOL 60 - 1960

- Modified ALGOL 58 at 6-day meeting in Paris


- New Features:
- Block structure (local scope)
- Two parameter passing methods
- Subprogram recursion
- Stack-dynamic arrays
- Still no i/o and no string handling

- Successes:
- It was the standard way to publish algorithms for over 20 years
- All subsequent imperative languages arebased on it
- First machine-independent language
- First language whose syntax was formally defined
- Failure:

- Never widely used, especially in U.S.


Reasons:
1. No i/o and the character set made programs non portable
2. .Too flexible--hard to implement
3. Intrenchment of FORTRAN
4. Formal syntax description
5. Lack of support of IBM

L. COBOL - 1960

- Environment of development:
- UNIVAC was beginning to use FLOW-MATIC
- USAF was beginning to use AIMACO
- IBM was developing COMTRAN
Based on FLOW-MATIC
- FLOW-MATIC features:
- Names up to 12 characters, with embedded hyphens
- English names for arithmetic operators
- Data and code were completely separate
- Verbs were first word in every statement
First Design Meeting - May 1959
- Design goals:
1. Must look like simple English
2. Must be easy to use, even if that means it will be less powerful
3. Must broaden the base of computer users
4. Must not be biased by current compilerproblems
- Design committee were all from computermanufacturers and DoD branches
- Design Problems: arithmetic expressions?subscripts? Fights among manufacturers
- Contributions:
- First macro facility in a high-level language
- Hierarchical data structures (records)
- Nested selection statements
- Long names (up to 30 characters), with hyphens
- Data Division
- Comments:
- First language required by DoD; would havefailed without DoD
- Still the most widely used businessapplications language

M. BASIC - 1964

- Designed by Kemeny & Kurtz at Dartmouth


- Design Goals:
- Easy to learn and use for non-science students
- Must be ”pleasant and friendly"
- Fast turnaround for homework
- Free and private access
- User time is more important than computer time
- Current popular dialects: Quick BASIC and Visual BASIC

N. PL/I - 1965

- Designed by IBM and SHARE


Computing situation in 1964 (IBM's point of view)

1. Scientific computing
- IBM 1620 and 7090 computers
- FORTRAN
- SHARE user group
2. Business computing

- IBM 1401, 7080 computers


- COBOL
- GUIDE user group
- By 1963, however,
- Scientific users began to need more elaborate i/o, like COBOL had; Business users began to
need fl. pt. and arrays (MIS)

- It looked like many shops would begin to need two kinds of computers, languages, and
support staff--too costly

- The obvious solution:


1. Build a new computer to do both kinds ofapplications
2. Design a new language to do both kinds of applications

- PL/I contributions:
1. First unit-level concurrency
2. First exception handling
3. Switch-selectable recursion
4. First pointer data type
5. First array cross sections

- Comments:
- Many new features were poorly designed
- Too large and too complex
- Was (and still is) actually used for bothscientific and business applications
0. Early Dynamic Languages

a. Characterized by dynamic typing and dynamicstorage allocation

b. APL (A Programming Language) 1962


i. Designed as a hardware description language(at IBM by Ken Iverson)
ii. Highly expressive (many operators, for bothscalars and arrays of various dimensions)

Programs are very difficult to read

c. SNOBOL(1964)
i. Designed as a string manipulation language
(at Bell Labs by Farber, Griswold, and Polensky)
ii. Powerful operators for string pattern matching

P. SIMULA 67 – 1967

- Designed primarily for system simulation (in Norway by Nygaard and Dahl)
- Based on ALGOL 60 and SIMULA I
- Primary Contribution:
- Coroutines - a kind of subprogram
- Implemented in a structure called a class
- Classes are the basis for data abstraction
- Classes are structures that include both local data and functionality

Q. ALGOL 68 – 1968

- From the continued development of ALGOL 60, but it is not a superset of that language
- Design is based on the concept of orthogonality
- Contributions:
1. User-defined data structures
2. Reference types
3. Dynamic arrays (called flex arrays)

- Comments:
- Had even less usage than ALGOL 60
- Had strong influence on subsequent languages,especially Pascal, C, and Ada
P. Pascal – 1971
- Designed by Wirth, who quit the ALGOL 68 committee (didn't like the direction of that work)
- Designed for teaching structured programming
- Small, simple, nothing really new
- Still the most widely used language for teachingprogramming in colleges (but use is shrinking)
Q. C - 1972

- Designed for systems programming(at Bell Labs by Dennis Richie)


- Evolved primarily from B, but also ALGOL 68
- Powerful set of operators, but poor type checking
- Initially spread through UNIX
R. Other descendants of ALGOL

- Modula-2 (Founded by BY mid-1970s by Niklaus Wirth)

- Pascal plus modules and some low-level features designed for systems programming

- Modula-3 (late 1980s at Digital & Olivetti)

- Modula-2 plus classes, exception handling, garbage collection, and concurrency

- Oberon (late 1980s by Wirth at ETH)


- Adds support for OOP to Modula-2
- Many Modula-2 features were deleted(e.g., for statement, enumeration types, with
statement, noninteger array indices)

- Delphi (Borland)

- Pascal plus features to support OOP


- More elegant and safer than C++

S. Prolog - 1972

- Developed at the University of Aix-Marseille,by Comerauer and Roussel, with some


help from Kowalski at the University of Edinburgh
- Based on formal logic

- Non-procedural

- Can be summarized as being an intelligent database system that uses an


inferencing process to infer the truth of given queries

T. Ada - 1983 (began in mid-1970s)


- Huge design effort, involving hundreds of people, much money, and about eight years
- Contributions:
1. Packages - support for data abstraction
2. Exception handling – elaborate
3. Generic program units
4. Concurrency - through the tasking model
- Comments:
- Competitive design
- Included all that was then known about software engineering and language design
- First compilers were very difficult; the first really usable compiler came nearly five yearsafter the
language design was completed

- Ada 95 (began in 1988)


- Support for OOP through type derivation
- Better control mechanisms for shared data(new concurrency features)
- More flexible libraries

U. Smalltalk - 1972-1980

- Developed at Xerox PARC, initially by Alan Kay,later by Adele Goldberg


- First full implementation of an object-orientedlanguage (data abstraction, inheritance, and dynamic
type binding)
- Pioneered the graphical user interface everyonenow uses
V. C++ - 1985

- Developed at Bell Labs by Stroustrup


- Evolved from C and SIMULA 67
- Facilities for object-oriented programming, takenpartially from SIMULA 67, were added to C
- Also has exception handling
- A large and complex language, in part because itsupports both procedural and OO programming
- Rapidly grew in popularity, along with OOP
- ANSI standard approved in November, 1997
- Eiffel - a related language that supports OOP
- (Designed by Bertrand Meyer - 1992)
- Not directly derived from any other language
- Smaller and simpler than C++, but still has most the power

W. Java (1995)
- Developed at Sun in the early 1990s
- Based on C++
- Significantly simplified
- Supports only OOP
- Has references, but not pointers
- Includes support for applets and a form ofconcurrency
1.7 Syntax and Semantics

Introduction
• Syntax: the form or structure of the expressions, statements, and program units
• Semantics: the meaning of the expressions, statements, and program units
• Syntax and semantics provide a language‘s definition
– Users of a language definition
– Other language designers
– Implementers
– Programmers (the users of the language)

The General Problem of Describing Syntax

• A sentence is a string of characters over some alphabet


• A language is a set of sentences
• A lexeme is the lowest level syntactic unit of a language (e.g., *, sum, begin)
• A token is a category of lexemes (e.g., identifier)

• Languages Recognizers
– A recognition device reads input strings of the language and decides whether the input
strings belong to the language
– Example: syntax analysis part of a compiler

• Languages Generators
– A device that generates sentences of a language
– One can determine if the syntax of a particular sentence is correct by comparing it to
the structure of the generator

Formal Methods of Describing Syntax

• Backus-Naur Form and Context-Free Grammars


– Most widely known method for describing programming language syntax
• Extended BNF
– Improves readability and writability of BNF
• Grammars and Recognizers

Backus-Naur Form and Context-Free Grammars

• Context-Free Grammars
• Developed by Noam Chomsky in the mid-1950s
• Language generators, meant to describe the syntax of natural languages
• Define a class of languages called context-free languages

Backus-Naur Form (BNF)


• Backus-Naur Form (1959)
– Invented by John Backus to describe ALGOL 58
– BNF is equivalent to context-free grammars
– BNF is a metalanguage used to describe another language
– In BNF, abstractions are used to represent classes of syntactic structures-- they act
like syntactic variables (also called nonterminal symbols)
BNF Fundamentals
• Non-terminals: BNF abstractions
• Terminals: lexemes and tokens
• Grammar: a collection of rules
– Examples of BNF rules:
<ident_list> → identifier | identifer, <ident_list>
<if_stmt> → if <logic_expr> then <stmt>

BNF Rules

• A rule has a left-hand side (LHS) and a right-hand side (RHS), and consists of
terminal and nonterminal symbols
• A grammar is a finite nonempty set of rules
• An abstraction (or nonterminal symbol) can have more than one RHS
<stmt> → <single_stmt>
| begin <stmt_list> end
Describing Lists
• Syntactic lists are described using recursion
<ident_list> → ident
| ident, <ident_list>
• A derivation is a repeated application of rules, starting with the start symbol and ending
with a sentence (all terminal symbols)
An Example Grammar
<program> → <stmts>
<stmts> → <stmt> | <stmt> ; <stmts>
<stmt> → <var> = <expr>
<var> → a | b | c | d
<expr> → <term> + <term> | <term> - <term>
<term> → <var> | const

Parse Tree
A hierarchical representation of a derivation
An example derivation Figure 1.2 Parse Tree
<program> <stmts>
 <stmt>
 <var>=<expr>
 a=<expr>
 a=<term>+<term>
 a=<var>+<term>
 a=b+<term>
 a=b+const

Derivation
• Every string of symbols in the derivation is a sentential form
• A sentence is a sentential form that has only terminal symbols
• A leftmost derivation is one in which the leftmost nonterminal in each sentential form is the
one that is expanded
• A derivation may be neither leftmost nor rightmost
Ambiguity in Grammars
• A grammar is ambiguous iff it generates a sentential form that has two or more distinct
parse trees
An Unambiguous Expression Grammar
If we use the parse tree to indicate precedence levels of the operators, we cannot have
ambiguity
<expr> → <expr> - <term>|<term>
<term> → <term> / const|const
Figure 1.3 An Ambiguous Expression Figure 1.4 An Unambiguous Expression
Grammar Grammar

Associativity of Operators Figure 1.5 Parse Tree


for Associativity
Operator associativity can also be indicated by a operator
grammar
<expr> → <expr> + <expr> | const
(ambiguous)
<expr> → <expr> + const | const
(unambiguous)

Extended Backus-Naur Form (EBNF)


• Optional parts are placed in brackets ([ ])
<proc_call> → ident [(<expr_list>)]
• Alternative parts of RHSs are placed inside parentheses and separated viavertical bars
<term> → <term> (+|-) const
• Repetitions (0 or more) are placed inside braces ({ })
<ident> → letter {letter|digit}
BNF and EBNF
• BNF
<expr> → <expr> + <term>
| <expr> - <term>
| <term>
<term> → <term> * <factor>
• EBNF | <term> / <factor>
| <factor>

<expr> → <term> {(+ | -) <term>}


<term> → <factor> {(* | /) <factor>}

Attribute Grammars
• Context-free grammars (CFGs) cannot describe all of the syntax of programming languages
• Additions to CFGs to carry some semantic info along parse trees
• Primary value of attribute grammars (AGs):
– Static semantics specification
– Compiler design (static semantics checking)
Definition
• An attribute grammar is a context-free grammar G = (S, N, T, P) with the following additions:
– For each grammar symbol x there is a set A(x) of attribute values
– Each rule has a set of functions that define certain attributes of the nonterminals in
the rule
– Each rule has a (possibly empty) set of predicates to check for attribute consistency
– Let X0 X1 ... Xn be a rule
– Functions of the form S(X0) = f(A(X1), ... , A(Xn)) define synthesized attributes
– Functions of the form I(Xj) = f(A(X0), ... , A(Xn)), for i <= j <= n, define
inherited attributes
– Initially, there are intrinsic attributes on the leaves

Example
• Syntax
<assign> → <var> = <expr>
<expr> → <var> + <var> | <var>
<var> → A | B | C
• actual_type: synthesized for <var> and <expr>
• expected_type: inherited for <expr>
• Syntax rule :<expr> → <var>[1] + <var>[2]
Semantic rules :<expr>.actual_type → <var>[1].actual_type
Predicate :<var>[1].actual_type == <var>[2].actual_type
<expr>.expected_type == <expr>.actual_type
• Syntax rule :<var> → id
Semantic rule :<var>.actual_type lookup (<var>.string)

• How are attribute values computed?


– If all attributes were inherited, the tree could be decorated in top-down order.
– If all attributes were synthesized, the tree could be decorated in bottom-up order.
– In many cases, both kinds of attributes are used, and it is some combination of top-
down and bottom-up that must be used.

<expr>.expected_type inherited from parent
<var>[1].actual_type lookup (A)
<var>[2].actual_type lookup (B)
<var>[1].actual_type =? <var>[2].actual_type
<expr>.actual_type <var>[1].actual_type
<expr>.actual_type =? <expr>.expected_type

Describing the Meanings of Programs: Dynamic Semantics


• There is no single widely acceptable notation or formalism for describing semantics
• Operational Semantics
– Describe the meaning of a program by executing its statements on a machine, either
simulated or actual. The change in the state of the machine (memory, registers, etc.)
defines the meaning of the statement
• To use operational semantics for a high-level language, a virtual machine is needed
• A hardware pure interpreter would be too expensive
• A software pure interpreter also has problems:
– The detailed characteristics of the particular computer would make actions difficult to
understand
– Such a semantic definition would be machine- dependent
Operational Semantics
• A better alternative: A complete computer simulation
• The process:
– Build a translator (translates source code to the machine code of an
idealized computer)
– Build a simulator for the idealized computer
• Evaluation of operational semantics:
– Good if used informally (language manuals, etc.)
– Extremely complex if used formally (e.g., VDL), it was used for describing
semantics of PL/I.
• Axiomatic Semantics
– Based on formal logic (predicate calculus)
– Original purpose: formal program verification
– Approach: Define axioms or inference rules for each statement type in the
language (to allow transformations of expressions to other expressions)
– The expressions are called assertions
Axiomatic Semantics
• An assertion before a statement (a precondition) states the relationships
and constraints among variables that are true at that point in execution
• An assertion following a statement is a postcondition
• A weakest precondition is the least restrictive precondition that will guarantee
the postcondition
• Pre-post form: {P} statement {Q}
• An example: a = b + 1 {a > 1}
• One possible precondition: {b > 10}
• Weakest precondition: {b > 0}
• Program proof process: The postcondition for the whole program is the desired
result. Work back through the program to the first statement. If the precondition
on the first statement is the same as the program spec, the program is correct.
• An axiom for assignment
statements (x = E):
{Qx->E} x = E {Q}
• An inference rule for sequences
– For a sequence S1;S2:
– {P1} S1 {P2}
– {P2} S2 {P3}
• An inference rule for logical
pretest loops For the loop
construct:
{P} while B do S end {Q}
Characteristics of the loop
invariant I must meet the
following conditions:
– P => I (the loop invariant must be true initially)
– {I} B {I} (evaluation of the Boolean must not change the validity of I)
– {I and B} S {I} (I is not changed by executing the body of the loop)
– (I and (not B)) => Q (if I is true and B is false, Q is implied)
– The loop terminates (this can be difficult to prove)
• The loop invariant I is a weakened version of the loop postcondition, and it
is also a precondition.
• I must be weak enough to be satisfied prior to the beginning of the loop, but when
combined with the loop exit condition, it must be strong enough to force the truth
of the postcondition.
Evaluation of Axiomatic Semantics:
– Developing axioms or inference rules for all of the statements in a language
is difficult
– It is a good tool for correctness proofs, and an excellent framework for
reasoning about programs, but it is not as useful for language users and
compiler writers
– Its usefulness in describing the meaning of a programming language is
limited for language users or compiler writers

Denotational Semantics
– Based on recursive function theory
– The most abstract semantics description method
– Originally developed by Scott and Strachey (1970)
– The process of building a denotational spec for a language (not necessarilyeasy):
– Define a mathematical object for each language entity
– Define a function that maps instances of the language entities onto instancesof the
corresponding mathematical objects
– The meaning of language constructs are defined by only the values of the
program's variables
– The difference between denotational and operational semantics: In
operational semantics, the state changes are defined by coded algorithms;
in denotational semantics, they are defined by rigorous mathematical
functions
– The state of a program is the values of all its current variables
s = {<i1, v1>, <i2, v2>, …, <in, vn>}
– Let VARMAP be a function that, when given a variable name and a state,returns
the current value of the variable
VARMAP(ij, s) = vj
• Decimal Numbers
– The following denotational semantics description maps decimal numbers asstrings
of symbols into numeric values
<dec_num> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
 | <dec_num> (0 | 1 | 2 | 3 | 4 |5 | 6 | 7 | 8 | 9)
Mdec('0') = 0, Mdec ('1') = 1, …, Mdec ('9') = 9
Mdec (<dec_num> '0') = 10 * Mdec
(<dec_num>) Mdec (<dec_num> '1’) =
10 * Mdec (<dec_num>) + 1

Mdec (<dec_num> '9') = 10 * Mdec (<dec_num>) + 9

Expressions
• Map expressions onto Z {error}
• We assume expressions are decimal numbers, variables, or binary expressions
having one arithmetic operator and two operands, each of which can be an
expression
• Assignment Statements
– Maps state sets to state sets
• Logical Pretest Loops
– Maps state sets to state sets
• The meaning of the loop is the value of the program variables after the statements
in the loop have been executed the prescribed number of times, assuming there
have been no errors
• In essence, the loop has been converted from iteration to recursion, where the
recursive control is mathematically defined by other recursive state mapping
functions
• Recursion, when compared to iteration, is easier to describe with mathematical rigor
• Evaluation of denotational semantics
– Can be used to prove the correctness of programs
– Provides a rigorous way to think about programs
– Can be an aid to language design
– Has been used in compiler generation systems
– Because of its complexity, they are of little use to language users
1.8 Lexical Analysis

Introduction
A lexical analyzer is essentially a pattern matcher. A pattern matcher attempts to find
a substring of a given string of characters that matches a given character pattern.
Pattern matching is a traditional part of computing. One of the earliest uses of pattern
matching was with text editors, such as the ed line editor, which was introduced in an
early version of UNIX. Since then, pattern matching has found its way into some
programming languages—for example,Perl and JavaScript. It is also available through
the standard class libraries of Java, C++, and C#.
Lexemes and tokens

A lexical analyzer serves as the front end of a syntax analyzer. Technically, lexical
analysis is a part of syntax analysis. A lexical analyzer performs syntax analysis at the
lowest level of program structure. An input program appears to a compiler as a single
string of characters. The lexical analyzer collects characters into logical groupings and
assigns internal codes to the groupings according to their structure. These logical
groupings are named lexemes, and the internal codes for categories of these
groupings are named tokens. Lexemes are recognized by matching the input
character string against character string patterns. Although tokens are usually
represented as integer values, for the sake of readability of lexical and syntax
analyzers, theyare often referenced through named constants.

Consider the following example of an assignment statement:

result = oldsum - value / 100;

Following are the tokens and lexemes of this statement:


Integration with syntax analyzer

Lexical analyzers extract lexemes from a given input string and produce the
corresponding tokens. Lexical analyzers are subprograms that locate the next lexeme
in the input, determine its associated token code, and return them to the caller, which
is the syntax analyzer. So, each call to the lexical analyzer returns a single lexeme and
its token. The only view of the input program seen by the syntax analyzer is the output
of the lexical analyzer, one token at a time.

The lexical-analysis process includes skipping comments and white space outside
lexemes, as they are not relevant to the meaning of the program. Also, the lexical
analyzer inserts lexemes for user-defined names into the symbol table, which is used
by later phases of the compiler. Finally, lexical analyzers detect syntactic errors in
tokens, such as ill-formed floating-point literals, and report such errors to the user.

Building a lexical analyzer

There are three approaches for building a lexical analyzer:

1. Write a formal description of the token patterns of the language using a descriptive
language related to regular expressions. These descriptionsare used as input to a
software tool that automatically generates a lexica lanalyzer. There are many such
tools available for this. The oldest of these, named lex, is commonly included as part
of UNIX systems.

These regular expressions are the basis for the pattern-matching facilities now
part of many programming languages, either directly orthrough a class library.

2. Design a state transition diagram that describes the token patterns of thelanguage
and write a program that implements the diagram.

3. Design a state transition diagram that describes the token patterns of thelanguage
and hand-construct a table-driven implementation of the state diagram.

Lexical analyzer construction using state diagram

Suppose we need a lexical analyzer that recognizes only arithmetic expressions,


including variable names and integer literals as operands.

Assume that the variable names consist of strings of uppercase letters, lowercase
letters, and digits but must begin with a letter. Names have no length limitation. The
first thing to observe is that there are 52 different characters (any uppercase or
lowercase letter) that can begin a name, which require 52 transitions from the
transition diagram’s initial state.

However, a lexical analyzer is interested only in determining that it is a nameand


is not concerned with which specific name it happens to be. Therefore, we define a
character class named LETTER for all 52 letters and use a single transition on the first
letter of any name.
Another opportunity for simplifying the transition diagram is with the integer literal
tokens. There are 10 different characters that could begin an integer literal lexeme.
This would require 10 transitions from the start state of the state diagram. Because
specific digits are not a concern of the lexical analyzer, we can build a much more
compact state diagram if we define a character class named DIGIT for digits and use
a single transition on any character in this character class to a state that collects integer
literals.
Because our names can include digits, the transition from the node following the first
character of a name can use a single transition on LETTER or DIGIT to continue
collecting the characters of a name.
The state diagram describes the patterns for our tokens. It includes the actions required
on each transition of the state diagram.

A state diagram to recognize names, parentheses,


andarithmetic operators
1.9 The Parsing Problem

The part of the process of analyzing syntax that is referred to as syntax analysis is often called
parsing. We will use these two interchangeably.

This section discusses the general parsing problem and introduces the two main categories
of parsing algorithms, top-down and bottom-up, as well as the complexity of the parsing
process.

Introduction to Parsing

Parsers for programming languages construct parse trees for given programs.In some
cases, the parse tree is only implicitly constructed, meaning that perhaps only a traversal of the
tree is generated. But in all cases, the information required to build the parse tree is created
during the parse. Both parse trees and derivations include all of the syntactic information needed
bya language processor.

There are two distinct goals of syntax analysis: First, the syntax analyzer must check the input
program to determine whether it is syntactically correct.When an error is found, the analyzer
must produce a diagnostic message and recover. In this case, recovery means it must get back
to a normal state and continue its analysis of the input program. This step is required so that
the compiler finds as many errors as possible during a single analysis of the inputprogram. If it
is not done well, error recovery may create more errors, or at least more error messages. The
second goal of syntax analysis is to produce acomplete parse tree, or at least trace the structure
of the complete parse tree, for syntactically correct input. The parse tree (or its trace) is used
as the basis for translation.

Parsers are categorized according to the direction in which they build parsetrees. The
two broad classes of parsers are top-down, in which the tree is built from the root downward
to the leaves, and bottom-up, in which the parse tree is built from the leaves upward to the
root.

Notational symbols for grammars

1. Terminal symbols—lowercase letters at the beginning of the alphabet (a,b, . . .)

2. Nonterminal symbols—uppercase letters at the beginning of thealphabet (A, B, . . .)

3. Terminals or nonterminals—uppercase letters at the end of the alphabet(W, X, Y, Z)

4. Strings of terminals—lowercase letters at the end of the alphabet (w, x,y, z)

5. Mixed strings (terminals and/or nonterminals)—lowercase Greek letters(α, β, δ, γ)

For programming languages, terminal symbols are the small-scale syntacticconstructs of the
language, what we have referred to as lexemes. The nonterminal symbols of programming
languages are usually connotative names or abbreviations, surrounded by angle brackets—for
example,<while_statement>, <expr>, and <function_def>. The sentences of a language
(programs, in the case of a programming language) are strings of terminals. Mixed strings
describe right-hand sides (RHSs) of grammar rulesand are used in parsing algorithms.

Top-Down Parsers

A top-down parser traces or builds a parse tree in preorder. A preorder traversal of a parse tree
begins with the root. Each node is visited before its branches are followed. Branches from a
particular node are followed in left-to-right order. This corresponds to a leftmost derivation.

Given a sentential form that is part of a leftmost derivation, the parser’s task is to find the
next sentential form in that leftmost derivation. The general form of a left sentential form is xAα,
whereby our notational conventions x isa string of terminal symbols, A is a nonterminal, and α is a
mixed string.Because x contains only terminals, A is the leftmost nonterminal in the sentential
form, so it is the one that must be expanded to get the next sentential form in a leftmost derivation.
Determining the next sentential formis a matter of choosing the correct grammar rule that has A
as its LHS. For example, if the current sentential form is xAα and the A-rules are A→bB, A→cBb,
and A→a, a top-down parser must choose among these three rules to get the next sentential form,
which could be xbBα, xcBbα, orxaα. This is the parsing decision problem for top-down parsers.

Different top-down parsing algorithms use different information to make parsing


decisions. The most common top-down parsers choose the correct RHS for the leftmost
nonterminal in the current sentential form by comparing the next token of input with the first
symbols that can be generated by the RHSs of those rules. Whichever RHS has that token at
the left end of the string it generates is the correct one. So, in the sentential form xAα, the parser
would use whatever token would be the first generated by A to determine which A-rule should
be used to get the next sentential form. In the example above, the three RHSs of the A-rules all
begin with different terminal symbols. The parser can easily choose the correct RHS based on
the next token of input, which must be a, b, or c in this example. In general, choosing the correct
RHS is not so straightforward, because some of the RHSs of the leftmost nonterminal in the
current sentential form may begin with a nonterminal.

The most common top-down parsing algorithms are closely related. A recursive-descent
parser is a coded version of a syntax analyzer based directly on the BNF description of the
syntax of language. The most common alternative to recursive descent is to use a parsing table,
rather than code, to implement the BNF rules. Both of these, which are called LL algorithms,
are equally powerful, meaning they work on the same subset of all context-free grammars. The
first L in LL specifies a left-to-right scan of the input; the second L specifies that a leftmost
derivation is generated. The recursive-descent approach to implementing an LL parser is
introduced.

Bottom-Up Parsers

A bottom-up parser constructs a parse tree by beginning at the leaves and progressing toward
the root. This parse order corresponds to the reverse of a rightmost derivation. That is, the
sentential forms of the derivation are produced in order of last to first. In terms of the derivation,
a bottom-up parser can be described as follows: Given a right sentential form α, the parser must
determine what substring of α is the RHS of the rule in the grammar that must be reduced to its
LHS to produce the previous sentential form in the rightmost derivation. For example, the first
step for a bottom-up parser is to determine which substring of the initial given sentence is the
RHS to be reduced to its corresponding LHS to get the second last sentential form in the
derivation. The process of finding the correct RHS to reduce is complicated by the fact that a
given right sentential form may include more than one RHS from the grammar of the language
being parsed. The correct RHS is called the handle. A right sentential form is a sentential form
that appears in a rightmost derivation.
Consider the following grammar and derivation:
S → aAc
A→aA | b
S => aAc => aaAc => aabc

A bottom-up parser of this sentence, aabc, starts with the sentence and must find the handle in
it. In this example, this is an easy task, for the string contains only one RHS, b. When the parser
replaces b with its LHS, A, it gets the second to last sentential form in the derivation, aaAc. In the
general case, as stated previously, finding the handle is much more difficult, because a sentential
form may include several different RHSs.
A bottom-up parser finds the handle of a given right sentential form by examining the symbols on
one or both sides of a possible handle. Symbols to the right of the possible handle are usually
tokens in the input that have not yet been analysed.The most common bottom-up parsing
algorithms are in the LR family, where the L specifies a left-to-right scan of the input the specifies
that a rightmost derivation is generated.

Recursive-Descent Parsing

A recursive-descent parser is so named because it consists of a collection of subprograms, many of which


are recursive, and it produces a parse tree in top-down order. This recursion is a reflection of the nature of
programming languages, which include several different kinds of nested structures. For example,
statements are often nested in other statements. Also, parentheses in expressions must be properly nested.
The syntax of these structures is naturally described with recursive grammar rules.
EBNF is ideally suited for recursive-descent parsers. The primary EBNF extensions are braces, which
specify that what they enclose can appear zero or more times, and brackets, which specify that what they
enclose can appear once or not at all. Note that in both cases, the enclosed symbols are optional. Consider
the following examples:
<if_statement> → if <logic_expr> <statement> [else <statement>] <ident_list> → ident {, ident} In the first
rule, the else clause of an if statement is optional. In the second, an <ident_list> is an identifier, followed
by zero or more repetitions of a comma and an identifier.
A recursive-descent parser has a subprogram for each nonterminal in its associated grammar. The
responsibility of the subprogram associated with a particular nonterminal is as follows: When given an input
string, it traces out the parse tree that can be rooted at that nonterminal and whose leaves match the input
string. In effect, a recursive-descent parsing subprogram is a parser for the language (set of strings) that is
generated by its associated nonterminal.

Consider the following EBNF description of simple arithmetic expressions:

<expr> → <term> {(+ | -) <term>}


<term> → <factor> {(* | /) <factor>}
<factor> → id | int_constant | ( <expr> )

An EBNF grammar for arithmetic expressions, such as this one, does not force any associativity rule.
Therefore, when using such a grammar as the basis for a compiler, one must take care to ensure that the
code generation process, which is normally driven by syntax analysis, produces code that adheres to the
associativity rules of the language. This can be done easily when recursive-descent parsing is used.In the
following recursive-descent function, expr, the lexical analyser is the function that is implemented. It gets
the next lexeme and puts its token code in the global variable nextToken. The token codes are defined as
named constant
A recursive-descent subprogram for a rule with a single RHS is relatively simple. For each terminal
symbol in the RHS, that terminal symbol is compared with nextToken. If they do not match, it is a syntax
error. If they match, the lexical analyser is called to get the next input token. For each nonterminal, the
parsing subprogram for that nonterminal is called.

The recursive-descent subprogram for the first rule in the previous example grammar, written in C, is

/* expr
Parses strings in the language generated by the rule:
<expr> -> <term> {(+ | -) <term>}
*/
void expr() { printf("Enter <expr>\n");

/* Parse the first term */ term();


/* As long as the next token is + or -, get the next token and
parse the next term */
while (nextToken == ADD_OP || nextToken == SUB_OP) { lex();
term();
}
printf("Exit <expr>\n");
} /* End of function expr */

Notice that the expr function includes tracing output statements, which are included to produce the example
output shown later .
Recursive-descent parsing subprograms are written with the convention that each one leaves the next
token of input in nextToken. So, whenever a parsing function begins, it assumes that nextToken has the
code for the leftmost token of the input that has not yet been used in the parsing process.
The part of the language that the expr function parses consists of one or more terms, separated by either
plus or minus operators. This is the language generated by the nonterminal <expr>. Therefore, first it calls
the function that parses terms (term). Then it continues to call that function as long as it finds ADD_OP or
SUB_OP tokens (which it passes over by calling lex). This recursive-descent function is simpler than most,
because its associated rule has only one RHS. Furthermore, it does not include any code for syntax error
detection or recovery, because there are no detectable errors associated with the grammar rule.
A recursive-descent parsing subprogram for a nonterminal whose rule has more than one RHS begins with
code to determine which RHS is to be parsed. Each RHS is examined (at compiler construction time) to
determine the set of terminal symbols that can appear at the beginning of sentences it can generate. By
matching these sets against the next token of input, the parser can choose the correct RHS.

The parsing subprogram for <term> is similar to that for <expr>:


/* term
Parses strings in the language generated by the rule:
<term> -> <factor> {(* | /) <factor>)

*/
void term() { printf("Enter <term>\n");
/* Parse the first factor */ factor();
/* As long as the next token is * or /, get the next token and
parse the next factor */
while (nextToken == MULT_OP || nextToken == DIV_OP) { lex();
factor();
}
printf("Exit <term>\n");
} /* End of function term */
The function for the <factor> nonterminal of our arithmetic expression grammar must choose between its
two RHSs. It also includes error detection. In the function for <factor>, the reaction to detecting a syntax
error is simply to call the error function. In a real parser, a diagnostic message must be produced when an
error is detected. Furthermore, parsers must recover from the error so that the parsing process can
continue.
/* factor
Parses strings in the language generated by the rule:
<factor> -> id | int_constant | ( <expr )
*/
void factor() { printf("Enter <factor>\n");
/* Determine which RHS */
if (nextToken == IDENT || nextToken == INT_LIT)
/* Get the next token */ lex();
/* If the RHS is ( <expr> ), call lex to pass over the left
parenthesis, call expr, and check for the right parenthesis */
else {
if (nextToken == LEFT_PAREN) { lex();
expr();
if (nextToken == RIGHT_PAREN) lex();
else
error();
} /* End of if (nextToken == ... */
/* It was not an id, an integer literal, or a left parenthesis
*/
else error();

} /* End of else */ printf("Exit <factor>\n");;


} /* End of function factor */

Following is the trace of the parse of the example expression (sum + 47) / total, using the parsing functions
expr, term, and factor, and the function lex Note that the parse begins by calling lex and the start symbol
routine, in this case, expr.

Next token is: 25 Next lexeme is ( Enter <expr>


Enter <term> Enter <factor>
Next token is: 11 Next lexeme is sum Enter <expr>
Enter <term> Enter <factor>
Next token is: 21 Next lexeme is + Exit <factor>
Exit <term>
Next token is: 10 Next lexeme is 47 Enter <term>
Enter <factor>
Next token is: 26 Next lexeme is ) Exit <factor>
Exit <term> Exit <expr>
Next token is: 24 Next lexeme is / Exit <factor>
Next token is: 11 Next lexeme is total Enter <factor>
Next token is: -1 Next lexeme is EOF Exit <factor>
Exit <term> Exit <expr>
The parse tree traced by the parser for the preceding expression is shown below

Parse tree for (sum + 47)/ total


. Following is a grammatical description of the Java if statement: <ifstmt> → if (<boolexpr>) <statement>
[else <statement>] The recursive-descent subprogram for this rule follows:
/* Function ifstmt
Parses strings in the language generated by the rule:
<ifstmt> -> if (<boolexpr>) <statement>
[else <statement>]
*/
void ifstmt() {
/* Be sure the first token is 'if' */
if (nextToken = IF_CODE) error();
else {
/* Call lex to get to the next token */ lex();
/* Check for the left parenthesis */
if (nextToken = LEFT_PAREN) error();
else {
/* Parse the Boolean expression */ boolexpr();
/* Check for the right parenthesis */
if (nextToken = RIGHT_PAREN) error();
else {
/* Parse the then clause */
statement();
/* If an else is next, parse the else clause */
if (nextToken == ELSE_CODE) {
/* Call lex to get over the else */ lex();
statement();
} /* end of if (nextToken == ELSE_CODE ... */
} /* end of else of if (nextToken != RIGHT ... */
} /* end of else of if (nextToken != LEFT ... */
} /* end of else of if (nextToken != IF_CODE ... */
} /* end of ifstmt */

Notice that this function uses parser functions for statements and Boolean expressions
that are not described .

You might also like