Chapter 2
Chapter 2
Chapter 2
Syntax and Semantic
Zebiba N 1
Description of a Language
• Syntax: the form or structure of the
expressions, statements, and program units
Zebiba N 2
Describing Syntax and Semantics
• Syntax is defined using some kind of rules
– Specifying how statements, declarations, and other
language constructs are written
• Semantics is more complex and involved. It is
harder to define, e.g., natural language doc.
• Example: if statement
– Syntax: if (<expr>) <statement>
– Semantics: if <expr> is true, execute <statement>
• Detecting syntax error is easier, semantics error is
much harder
Zebiba N 3
What is a Language?
• In programming language terminologies, a
language is a set of sentences
• A sentence is a string of characters over some
alphabet
– The meaning of a “sentence” is very general. In
English, it may be an English sentence, a paragraph,
or all the text in a book, or hundreds of books, …
• Every C program, if can be compiled properly, is a
sentence of the C language
– No matter whether it is “hello world” or a program
with several million lines of code
Zebiba N 4
Definition of a Language
• The syntax of a language can be defined by a
set of syntax rules
• The syntax rules of a language specify which
sentences are in the language, i.e., which
sentences are legal sentences of the language
Zebiba N 5
Syntax Rules A hierarchical
structure of
language
• A more concise representation:
<sentence> <noun> <verb> <preposition> <noun>
<noun> place
<verb> “is” | “belongs” <preposition> “in” | “to”
• With these rules, we can generate followings:
A is in B
B is in A
B belongs to A
• They are all in language X
– Its alphabet includes “is”, “belongs”, “in”, “to”, place
Zebiba N 6
Checking Syntax of a Sentence
• How to check if the following sentence is in
the language X?
A belongs in B
• Idea: check if you can generate that sentence
This is called parsing
• How?
Try to match the input sentence with the
structure of the language
Zebiba N 7
Matching the Language Structure
<sentence>
Zebiba N 8
Formal Description of Syntax
Most widely known methods for describing
syntax:
• Context-Free Grammars
– Developed by Noam Chomsky
– Define a class of languages: context-free
languages
• Backus-Naur Form
– Invented by John Backus to describe ALGOL
– Equivalent to context-free grammars
Zebiba N 9
Backus-Naur Form
• Backus-Naur Form (BNF)
– Add recursion to regular expressions
• Nested constructions
– Equivalent to CFGs in power
– CFG
expression identifier | number | - expression
| ( expression )
| expression operator expression
operator + | - | * | /
– BNF
expression identifier | number | - expression
| ( expression )
| expression operator expression
operator + | - | * | /
Zebiba N 10
BNF
• BNF stands for either Backus-Naur Form or
Backus Normal Form
• BNF is a metalanguage used to describe the
grammar of a programming language
• BNF is formal and precise
– BNF is a notation for context-free grammars
• BNF is essential in compiler construction
• There are many dialects of BNF in use, but…
• …the differences are almost always minor
Zebiba N 11
BNF Terminologies
• A lexeme is the lowest level syntactic unit of a
language (e.g., A, B, is, in)
• A token is a category of lexemes (e.g., place)
• A BNF grammar consists of four parts:
– The set of tokens and lexemes (terminals)
– The set of non-terminals, e.g., <sentence>, <verb>
– The start symbol, e.g., <sentence>
– The set of production rules,
– e.g.,
<sentence> <noun> <verb> <preposition> <noun>
<noun> place
<verb> “is” | “belongs” <preposition> “in” | “to”
Zebiba N 12
BNF Terminologies
• Tokens and lexemes are smallest units of syntax
– Lexemes appear literally in program text
• Non-terminals stand for larger pieces of syntax
– Do NOT occur literally in program text
– The grammar says how they can be expanded into
strings of tokens or lexemes
• The start symbol is the particular non-terminal
that forms the starting point of generating a
sentence of the language
Zebiba N 13
BNF Rules
• A rule has a left-hand side (LHS) and a right-hand
side (RHS)
– LHS is a single non-terminal context-free
– RHS contains one or more terminals or non-terminals
– A rule tells how LHS can be replaced by RHS, or how
RHS is grouped together to form a larger syntactic unit
(LHS) traversing the parse tree up and down
– A nonterminal can have more than one RHS
– A syntactic list can be described using recursion
<ident_list> ident | ident,
<ident_list>
Zebiba N 14
BNF
• < > indicate a nonterminal that needs to be
further expanded, e.g. <variable>
• Symbols not enclosed in < > are terminals;
they represent themselves, e.g. if, while, (
• The symbol ::= means is defined as
• The symbol | means or; it separates
alternatives, e.g. <addop> ::= + | -
• This is all there is to “plain” BNF; but we will
discuss extended BNF (EBNF) later in this
lecture
Zebiba N 15
BNF uses recursion
• <integer> ::= <digit> | <integer> <digit>
or
<integer> ::= <digit> | <digit> <integer>
• Recursion is all that is needed (at least, in a
formal sense)
• "Extended BNF" allows repetition as well as
recursion
• Repetition is usually better when using BNF to
construct a compiler
Zebiba N 16
BNF Examples I
• <digit> ::=
0|1|2|3|4|5|6|7|8|9
Zebiba N 17
BNF Examples II
• <unsigned integer> ::=
<digit> | <unsigned integer> <digit>
• <integer> ::=
<unsigned integer>
| + <unsigned integer>
| - <unsigned integer>
Zebiba N 18
BNF Examples III
• <identifier> ::=
<letter>
| <identifier> <letter>
| <identifier> <digit>
• <block> ::= { <statement list> }
• <statement list> ::=
<statement>
| <statement list> <statement>
Zebiba N 19
BNF Examples IV
• <statement> ::=
<block>
| <assignment statement>
| <break statement>
| <continue statement>
| <do statement>
| <for loop>
| <goto statement>
| <if statement>
| ... Zebiba N 20
Limitations of BNF
• No easy way to impose length limitations, such
as maximum length of variable names
• No easy way to describe ranges, such as 1 to 31
• No way at all to impose distributed
requirements, such as, a variable must be
declared before it is used
• Describes only syntax, not semantics
• Nothing clearly better has been devised
Zebiba N 21
Grammar and Derivation
Grammar is a generative device for defining a language.
- The sentences of the language are generated through a sequence of applications of
the rules, beginning with a special nonterminal of the grammar called the start
symbol.
<program> <stmts>
<stmts> <stmt> | <stmt>
<stmt> <var> = <expr>
<var> a | b | c | d
<expr> <term> + <term> | <term> - <term>
<term> <var> | const
<program> is the start symbol non terminal.
a, b, c, const,+,-,;,= are the terminals
Zebiba N 22
• A derivation is a repeated application of rules,
starting with the start symbol and ending with a
sentence (all terminal symbols),
• e.g. a=b+const.
<program> => <stmts>
=> <stmt>
=> <var> = <expr>
=> a = <expr>
=> a = <term> + <term>
=> a = <var> + <term>
=> a = b + <term>
=> a = b + const
Zebiba N 23
Parse Tree
• Is hierarchical structure of the sequence of language.
• A hierarchical representation of a derivation
<program>
<stmts>
<stmt>
<var> = <expr>
a <term> + <term>
<var> const
b Zebiba N
a = b + const
24
Grammar and Parse Tree
• The grammar can be viewed as a set of rules
that say how to build a parse tree
• You put <S> at the root of the tree
• Add children to every non-terminal, following
any one of the rules for that non-terminal
• Done when all the leaves are tokens
• Read off leaves from left to right—that is the
string derived by the tree
Zebiba N 25
Ambiguity in Grammars
• If a sentential form can be generated by two or
more distinct parse trees, the grammar is said to
be ambiguous, because it has two or more
different meanings
• Problem with ambiguity:
– Consider the following grammar and the sentence
a+b*c
Zebiba N 26
An Ambiguous Grammar
• Two different parse trees for a+b*c
<exp> <exp>
a b b c
Zebiba N 27
Consequences
• The compiler will generate different codes,
depending on which parse tree it builds
– According to convention, we would like to use the
parse tree at the right, i.e., performing a+(b*c)
• Cause of the problem:
Grammar lacks semantic of operator precedence
– Applies when the order of evaluation is not
completely decided by parentheses
– Each operator has a precedence level, and those with
higher precedence are performed before those with
lower precedence, as if parenthesized
Zebiba N 28
Putting Semantics into Grammar
<exp> <exp> + <exp> | <exp> * <exp>
| (<exp>) | a | b | c
• To fix the precedence problem, we modify the
grammar so that it is forced to put * below +
in the parse tree
<exp> <exp> + <exp> | <mulexp>
<mulexp> <mulexp> * <mulexp>
| (<exp>)| a | b | c
<exp> + <exp>
a <mulexp> * <mulexp>
b c
Our new grammar generates same language as before, but no longer generates parse
trees with incorrect precedence.
Zebiba N 30
Semantics of Associativity
• Grammar can also handle the semantics of operator
associativity.
When an expression includes two operators that have the
same precedence (as * and / usually have)
—for example, A / B * C—a semantic rule is required
to specify which should have precedence.
<exp> <exp>
b c a b
Zebiba N 31
Operator Associativity
• Applies when the order of evaluation is not
decided by parentheses or by precedence
• Left-associative operators group operands left
to right: a+b+c+d = ((a+b)+c)+d
• Right-associative operators group operands
right to left: a+b+c+d = a+(b+(c+d))
• Most operators in most languages are left-
associative, but there are exceptions, e.g., C
a=b=0 — right-associative (assignment)
Zebiba N 32
Dangling Else in Grammars
<stmt> <if-stmt> | s1 | s2
<if-stmt> if <expr> then <stmt> else <stmt>
| if <expr> then <stmt>
<expr> e1 | e2
Zebiba N 33
<if-stmt>
Different
if <exp> then <stmt> else <stmt>
Parse Trees e1 s2
<if-stmt>
e2 s1
<if-stmt>
Most languages that have if <exp> then <stmt>
this problem choose this
e1 <if-stmt>
parse tree: else goes with
nearest unmatched then
if <exp> then <stmt> else <stmt>
e2 s1 s2
Zebiba N 34
Eliminating the Ambiguity
<stmt> <if-stmt> | s1 | s2
<if-stmt> if <expr> then <stmt> else <stmt>
| if <expr> then <stmt>
<expr> e1 | e2
If this expands into an if, that if must already have its own else.
First, we make a new non-terminal <full-stmt> that generates
everything <stmt> generates, except that it can not generate
if statements with no else:
<full-stmt> <full-if> | s1 | s2
<full-if> if <expr> then <full-stmt> else <full-stmt>
Zebiba N 35
Eliminating the Ambiguity
<stmt> <if-stmt> | s1 | s2
<if-stmt> if <expr> then <full-stmt> else <stmt>
| if <expr> then <stmt>
<expr> e1 | e2
The effect is that the new grammar can match an else part
with an if part only if all the nearer if parts are already
matched.
Zebiba N 36
Languages That Don’t Dangle
• Some languages define if-then-else in a way that forces the
programmer to be more clear
Zebiba N 37
Extended BNF
• The following are pretty standard:
– [ ] enclose an optional part of the rule
<if_stmt> → if (<expression>) <statement> [else <statement>]
Without the use of the brackets, the syntactic description of this statement
would require the following two rules:
<if_stmt> → if (<expression>) <statement>
| if (<expression>) <statement> else <statement>
{ } mean the enclosed can be repeated any number of times (including zero)
( ) - for a list of choices
Zebiba N 38
_ multiple-choice options. When a single element must be chosen from a group,
the options are placed in parentheses and separated by the OR operator, |.
BNF
<expr> <expr> + <term>
| <expr> - <term>
| <term>
<term> <term> * <factor>
| <term> / <factor>
| <factor>
EBNF
<expr> <term> {(+ | -) <term>}
<term> <factor> {(* | /) <factor>}
Zebiba N 39
EBNF Descriptions and Rules
• Each Description is a list of Rules
• Rule Form: LHS Ü RHS (read Ü as “is defined as”)
• Rule Names (LHS) are italicized, hyphenated words
• Control Forms in RHS
– Sequence Items appear left to right; order is important
– Choice Alternatives separated by | (stroke); exactly
one item is chosen from the alternatives
– Option Optional item enclosed between [ and ]; it
can be included or discarded
– Repetition Repeatable item enclosed between { and };
it can be repeated 0 or more times
Zebiba N 40
An EBNF Description of Integers
• A symbol (sequence of characters) is classified legal by an EBNF
rule if we can process all the characters in the symbol when we
reach the end of the right hand side of the EBNF rule.
digit Ü 0|1|2|3|4|5|6|7|8|9
integer Ü [+|-]digit{digit}
digit is defined as any of the alternatives 0 through 9
integer is defined as a sequence of three items:
(1)an optional sign (if it is included, it must be the alternative + or -),
followed by
(2) any digit, followed by
(3) a repetition of zero or more digits.
The integer RHS combines and illustrates all EBNF
control forms: sequence, option, alternative, repetition.
Zebiba N 41
Semantics
Describing the meaning of a program or of a statement or
group of statements.
There is no single widely acceptable notation or formalism for
describing semantics
Several needs for a methodology and notation for semantics.
We would need more formal methods of defining semantics for this, so we turn
to:
Operational Semantics
how the statement will be executed
Axiomatic Semantics
what results to expect from the statement
Denotational Semantics
functional way of mapping the affects of a statement
Zebiba N 42
Operational Semantics
• This can be thought of as “tracing” through a
program to see what affects an instruction will Example: C for-loop
Zebiba N 43
Axiomatic Semantics
• Used mainly to prove correctness of code
– Each statement in the language has associated assertions – what we
expect to be true before and after the statement executes
– We list these assertions as pre- and post-conditions that specify how
the machine changes (changes to variables)
– Given the state of the machine prior to executing a statement, we
can then determine what must be true afterward
• The basic form of an axiomatic semantic is {P} S {Q}
Zebiba N 44
Pre and Post-condition
• We will start with a given post-condition and derive the
weakest pre-condition
– We work backwards mainly because we will start with an
overall goal in mind for the given statement or program
– We want to derive the weakest pre-condition for a given post-
condition because this is the least restrictive pre-condition that
will guarantee validity
• Weakest means most general – what is the greatest range of values for a
given variable such that the result will be true?
• For example, consider the assignment statement
– sum = 2*x+1;
• with post-condition {sum > 1}
• Possible pre-conditions are {x > 10}, {x > 50} and {x > 1000}
• But the weakest pre-condition is {x > 0}
Zebiba N 45
Assignment Statement Rule
• We will use the following notation for an assignment
statement axiomatic rule:
– {QxE} x = E {Q}
• This is read as follows:
– If Q is true after the assignment, then Q xE is true prior
• The notation QxE means to replace all instances of x in Q with E
– Examples:
• a=b/2-1; {a < 10}
– We replace a in {a < 10} with b / 2 – 1 and solve for b, thus {QxE} is {b / 2 –
1 < 10} or {b < 22}
– So we have: {b < 22} a = b / 2 – 1; {a < 10} – that is, if b < 22 prior to the
assignment statement, then a will be less than 10 afterward
• x = 2 * y – 3; {x > 25}
– pre-condition is {2 * y – 3 > 25} or {y > 14}
• c = d * e – 4; {c > 0}
– pre-condition is {d * e – 4 > 0} or {d * e > 4}, we might want to list this as {d
> 4 / e} or {e > 4 / d}, or even
Zebiba{d
N > 4 / e & d != 0 & e != 0} 46
Sequences
• In general, a series of statements S1, S2, S3, ..., Sn can
be expressed as:
– {P} S1 {Q1}; {Q1} S2 {Q2} ; {Q2} S3 {Q3};
... {Qn} Sn {Q}
– This can be simplified to {P} S1, S2, S3, ..., Sn{Q}
– Therefore, we can combine rules to show the axiomatic
semantics of a block of code
• Example:
– y = 3 * x + 1;
– x = y + 3;
• If our post-condition is {x < 10} then our pre-condition between the
two statements is {y+3 < 10} or {y < 7} and our pre-condition before
the first statement is {3 * x + 1 < 7} or {x < 2}
Zebiba N 47
Selection Axiomatic Semantic
• Given a statement: if (B) S1; else S2;
• The semantic rule is: {B & P} S1 {Q}, {(!B) & P} S2 {Q}
– if Q is our post-condition, then we have two pre-conditions, if the if
statement’s condition is true (B) then B & P, and if the if statement’s
condition is false (Not B) then !B & P, so we must derive P that will
allow the same post-condition no matter if B or !B is true
• Example:
– if (x > 0) y--;
else y++;
• Suppose the post-condition is {y > 0}
– the pre-condition for the if-clause is {y > 1}
– the pre-condition for the else-clause is {y > -1}
– the condition {y > 1} is subsumed by the condition {y > -1} (that is, if {y
> 1} is true, then {y > -1} must also be true
• So, we select {y > 1} as our weakest pre-condition
– we cannot use {y > -1} because, if x > 0 and y = -1, our post-condition is
not true Zebiba N 48
Thank you
Zebiba N 49