SlideShare a Scribd company logo
Standing on the shoulders of giants:
Learn from LL(1) to PEG parser the hard way
Kir Chou
1
2
https://www.youtube.com/watch?v=DZTLgVBxET4
About me
Presented at PyCon TW/JP since 2017
https://note35.github.io/about/
https://github.com/note35/Parser-Learning
3
Agenda
● Motivation
● What is parser in CPython?
● Parser 101 - CFG
● Parser 101 - Traditional parser (LL(1) / LR(0))
● Parser 102 - PEG and PEG parser
● Parser 102 - Packrat parser
● CPython’s PEG parser
● Take away
4
Motivation
5
Motivation
What’s New In Python 3.9?
PEP 617, CPython now uses a new parser based on PEG;
“IIRC, I took a Compiler class in school…”
6
Motivation (Cont.)
School taught us the brief concept of the Compiler’s frontend and backend.
School’s parser assignment used Bison + YACC.
And...
7
My motivation = Talk objectives
What is PEG parser?
Why did python use LL(1) parser before?
Why did Guido choose PEG parser?
What other parsers do we have?
What’s the difference between those parsers?
How to implement those parsers?
8
What is parser in CPython?
CPython DevGuide - Design of CPython’s Compiler
9
Compilation
Steps
10
Source Code
Tokens
Abstract Syntax Tree
(AST)
Bytecode
Result
Lexer
Parser
Compiler
VM
Import
11
https://docs.python.org/3/library/tokenize.html#examples
Lexer
12
https://docs.python.org/3/library/ast.html
Parser
13
https://docs.python.org/3/library/dis.html#dis.disassemble
Compiler
= print(2*3+4)
14
Source Code
Tokens
Abstract Syntax Tree
(AST)
Bytecode
Result
Lexer
Parser
Compiler
VM
Import
Talk’s focus!
Parser 101 - CFG
Uncode - GATE Computer Science - Compiler Design Lecture
15
Grammar
Context Free Grammar (CFG)
16
Interpretation of this Grammar
“Both B and a can be derived from A”
Derivation
*some paper write <-
Non-terminal
AND
*support ambigious syntax
A -> B | a
Terminal
rule
What is “Context Free”?
Left-hand side in all the rules only contains 1 non-terminal.
Valid CFG Example:
Invalid CFG Example:
17
S -> aSb
xSy -> axSyb
Semantic Analysis: Parse Tree
Concret Syntax Tree (CST)
An ordered, rooted tree that represents
the syntactic structure of a string
according to some context-free
grammar.
Abstract Syntax Tree (AST)
A tree representation of the abstract
syntactic structure of source code
written in a programming language.
18
CFG Simplification
1. Ambiguous -> Unambiguous
2. Nondeterministic -> Deterministic
3. Left recursion -> No left recursion
19
Ambiguious Definition
A grammar contains rules that can generate more than one tree.
20
E -> E + E | E * E | Num
N N
N
E E
E
+
E
*
E
N
E
E
N N
E E
E
*
+
Ambiguious -> Unambiguous
21
N
E
E
N
N
T F
T
*
+
E -> E + T | T
T -> T * F | F
F -> Num
E -> E + E | E * E | Num
Step1
Rewrite Grammar
Step2
Make sure the
grammar only
generate one tree
T
F F
Non-deterministic -> Deterministic
A grammar contains rules that have common prefix.
22
A -> ab | ac
A -> aA’
A’ -> b | c
Rewrite Grammar
*A non-deterministic grammar can be rewritten into more than one
deterministic grammar.
Left recursion -> No left recursion
A grammar contains direct or indirect left recursion.
23
E -> E + T | T
T -> T * F | F
F -> Num
E -> TE’
E’ -> +TE’ | None
T -> FT’
T’ -> *FT’ | None
F -> Num
Rewrite Grammar
E in first E + T will recursively derives to second E + T,
E in second E + T will repeat it to third E + T,
and so on recursively.
Recap: CFG Simplification
24
Before After
Ambiguous
Non-deterministic
Left Recursion
Parser 101 - Traditional parser
Uncode - GATE Computer Science - Compiler Design Lecture
25
Parser classification
26
N
E
E
N N
E E
E
*
+
Top-down
Type
Bottom-up
Type N
E
N
E
N
E
+
N
E
E
N
E
E
N
E
E
+
N
E
E
N N
E E
E
*
+
LL / LR Parser
LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0)
LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0)
27
*Both LL/LR parser scan
input string from left to right
Input String: 2 + 3 * 4
LL / LR Parser
LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0)
LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0)
28
*The derivation time of
LL/LR parser is different.
N
E
E
N N
E E
E
*
+
N
E
E
N N
E E
E
*
+
+ → * * → +
LL / LR Parser
LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0)
LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0)
29
Input String: 2 + 3 * 4
I am "a token of number".
If I perform 1-token lookahead and
meet "a token of +",
what to do next?
Top Down - Recursive descent parser
30
LL(k) - Implementation
31
2 + 3 * 4
parse_E()
E -> TE’
E’ -> +TE’ | None
T -> FT’
T’ -> *FT’ | None
F -> Num
parse_Tp(parse_F())
parse_Ep( )
Step3
*recursively parse the input string
started from first rule parse_E()
Step2
*parse from left to right
*perform k-lookahead
parse_T()
Step1
write function for each non-terminal
32
Grammar
E -> TE’
E’ -> +TE’ | None
T -> FT’
T’ -> *FT’ | None
F -> Num
*perform 1-lookahead
LL(1) - Example code
Derivation
x
x
Top Down - Non recursive descent parser
33
LL(1) - Parsing table
34
Step1
Build first/follow table for each non-terminal
Note: $ means endmark
Step2
Build parsing table based on first/follow table
LL(1) - Implementation
35
Step3
Implement with stack
(take shift/reduce action
based on parsing table)
N
E
E
N N
E E
E
*
+
LL(1) - Example code
36
Grammar
E -> TE’
E’ -> +TE’ | None
T -> FT’
T’ -> *FT’ | None
F -> Num
Non-terminal stack
Reduce (Derivation)
Shift
Reduce (Derivation)
Bottom Up - LR(0) parser
37
LR(0) - Deterministic finite automaton
38
E’ -> .E --- (1)
E -> .E + T --- (2)
E -> .T --- (3)
T -> .T * Num --- (4)
T -> .Num --- (5)
Step1
Build Deterministic Finite Automaton(DFA)
E’ -> E.
E -> E. + T
E -> T.
T -> T. * Num
T -> Num.
E -> E + .T
T -> .T * Num
T -> .Num
T -> T * .Num
E -> E + T.
T -> T. * Num
T -> T * Num.
E
T
Num
*
+ T
Num
*
Num
S1 S2
S3
S4 S5
S6 S7
S8
Left recursion support
LR(0) - Parsing table
39
Step2
Build parsing table
(For parser like SLR(1), it
requires first/follow table)
Shift
acc
Reduce (Derivation)
acc
LR(0) - Implementation
40
Step3
Implement with stack
(take shift/reduce action based on parsing table)
N
E
E
N N
E E
E
*
+
LR(0) - Example code
41
Grammar
E -> E + T | T
T -> T * F | F
F -> Num Shift
Reduce (Derivation)
Parser 102 - PEG and PEG parser
42
Grammar
Parsing Expression Grammar (PEG)
43
*Difference from traditional CFG
A will try A -> B first.
Only after it fails at A -> B, A will only try A -> a.
Derivation
*some paper write <-
Non-Terminal
OR (if / elif / ...)
*disallow ambigious syntax
A -> B | a
Terminal
*Introduced in 2002 (Packrat Parsing: Simple, Powerful, Lazy, Linear Time)
rule
*support Regular Expression
(EBNF grammar) in another
paper
Example of difference
44
Grammar1: A -> a b | a
Grammar2: A -> a | a b
● LL/LR parser will fail to complete when the input grammar is ambiguous.
● PEG parser only tries the first PEG rule. The latter rule will never succeed.
“A PEG parser generator will resolve unintended ambiguities earliest-match-first, which may
be arbitrary and lead to surprising parses.” (source)
PEG Parser
PEG parser means “parser generated based on PEG”.
PEG parser can be a Packrat parser, or other traditional parser with k-lookahead
limitation. Mostly, PEG parser means Packrat parser.
45
CFG
EBNF
grammar
PEG
Packrat
parser
Traditional
parser
PEG Parser
Parser 102 - Packrat parser
46
Type of Packrat parser
47
Top-down
Type
N
E
E
N
E
E
N
E
E
+
N
E
E
N N
E E
E
*
+
Packrat parser is top-down type.
Packrat Parsing - Implementation
48
2 + 3 * 4
parse_E()
E -> E + T | T
T -> T * F | F
F -> Num
parse_T() and parse_F()
parse_E() and parse_T()
Step2
*parse from left to right
*perform infinite lookahead + memoization
Step1
*write function for each non-terminal
(PEG rule)
*Idea of memoization was Introduced in 1970
Step3
*recursively parse the input string
started from first rule parse_E()
Left recursion support
Packrat Parsing - Example code
49
Grammar
E -> E + T | T
T -> T * F | F
F -> Num
Derivation
Memoization
Packrat - what is memoization?
50
509. Fibonacci Number
4
3
2
2
1
fib(0) = 0
fib(1) = 1
fib(2) = fib(1) + fib(0) = 1
fib(3) = fib(2) + fib(1) = fib(1) + fib(0) + fib(1) = 2
...
1
0
1
0
if n = 4, we calculate
fib(2), fib(0) twice, fib(1) thrice, fib(4), fib(3) once
TIme Complexity: O(2^n)
Packrat - what is memoization? (Cont.)
51
509. Fibonacci Number
if n = 4, we…
calculate fib(4), fib(3), fib(2), fib(1), fib(0) once
Time Complexity: O(2^n) => O(n)
Space Complexity: O(1) => O(n)
Left recursion in Packrat parser
52
Approach 1
if (count of operator) < (count function call):
return False
Approach 2
reverse the call stack (adopted in CPython!)
Source: Guido's Medium (Left-recursive PEG Grammars)
53
Normal Memoization
54
Left-recursion
Memoization
*perform
infinite-lookahead
Traditional parser V.S Packrat parser
55
Traditional parser vs Packrat parser
56
Packrat Traditional
Scan Left-to-right (*Right-to-left memo) Left-to-right
Left Recursion Support (*Not support in first paper) LL needs to rewrite the grammar
Ambigious Disallowed (determinism) Allowed
Space Complexity O(Code Size) (space consumption) O(Depth of Parse Tree)
Worst Time
Complexity
Super linear time (statelessness)
*Because of feature like typedef in C
Expotenial time
Capability Basically covers all traditional cases
(infinite lookahead)
No left-recursion/ambigious for LL
Has k lookup limitations for both (e.g.
dangling else)
Red text: 3 highlighted characteristics of Packrat parser.
57
Parenthesized context managers
PEP 622/634/635/636 - Structural Pattern Matching
New rule in Python 3.10 based on PEG
CPython’s PEG parser
58
CPython Parser - Before/After
CPython3.8 and before use LL(1) parser written by Guido 30 years ago
The parser requires steps to generate CST and convert CST to AST.
CPython3.9 uses PEG (Packrat) parser (Infinite lookahead)
PEG rule supports left-recursion
No more CST to AST step - source
CPython3.10 drops LL(1) parser support
59
This answers
“Why PEG?”
CPython Parser - Workflow
60
Meta Grammar
Tools/peg_generator/
pegen/metagrammar.gram
Grammar
Grammar/python.gram
Token
Grammar/Tokens
my_parser.py
my_parser.c
pegen
(PEG Parser)
Tools/peg_generator/
*CPython contains a peg parser generator written in python3.8+ (because of warlus operator)
Input: Meta Grammar Example
Syntax Directed Translation (SDT)
61
rule
non-Terminal
return type
PEG rule divider
PEG rule
action
(python code)
Parser header
(python code)
Output: Generated PEG Parser
(Partial code)
62
Recap: Benefit / Performance
Benefit
Grammar is more flexible: from LL(1) to LL(∞) (infinite lookahead)
Hardware supports Packrat’s memory consumption now
Skip intermediate parse tree (CST) construction
Performance
Within 10% of LL(1) parser both in speed and memory consumption (PEP 617)
63
Take away
64
Recap
● Parser 101 (Compiler class in school)
○ CFG
○ Traditional Parser
■ Top-down: LL(1)
■ Bottom-up: LR(0)
● Parser 102
○ PEG
○ Packrat Parser
● CPython
○ Parser in CPython
○ CPython’s PEG parser
65
66
Need Answer? note35/Parser-Learning
You can implement traditional parser like LL(1) and LR(0)
parser, and Packrat parser from scratch!
Leetcode: 227. Basic Calculator II
Q. How to verify my understanding?
A. Get your hands dirty!
Q & A
67
Appendix
68
Related Articles
Guido van Rossum
PEG Parsing Series Overview
Bryan Ford
Packrat Parsing: Simple, Powerful, Lazy, Linear Time
Parsing Expression Grammars: A Recognition-Based Syntactic Foundation
69
Related Talks
Guido van Rossum @ North Bay Python 2019
Writing a PEG parser for fun and profit
Pablo Galindo and Lysandros Nikolaou @ Podcast.__init__
The Journey To Replace Python's Parser And What It Means For The Future
Emily Morehouse-Valcarcel @ PyCon 2018
The AST and Me
Alex Gaynor @ PyCon 2013
So you want to write an interpreter?
70
Thanks for your listening!
71

More Related Content

What's hot (20)

PDF
Asymptotic notation
mustafa sarac
 
PPT
Queue Data Structure
Zidny Nafan
 
PPTX
State space search and Problem Solving techniques
Kirti Verma
 
PDF
Introduction to Garbage Collection
Artur Mkrtchyan
 
PPTX
Theory of Computation Unit 3
Jena Catherine Bel D
 
PPT
Data Structures with C Linked List
Reazul Islam
 
PPTX
NFA Non Deterministic Finite Automata by Mudasir khushik
MudsaraliKhushik
 
PPTX
Pumping Lemma
eburhan
 
PPTX
BERT
Khang Pham
 
PDF
Lecture Notes-Are Natural Languages Regular.pdf
Deptii Chaudhari
 
PPTX
Regular expressions
Shiraz316
 
PPTX
Automata theory - CFG and normal forms
Akila Krishnamoorthy
 
PPT
Greedy algorithms
Rajendran
 
PPTX
PROLOG: Recursion And Lists In Prolog
DataminingTools Inc
 
PPTX
Divide and Conquer - Part 1
Amrinder Arora
 
PPTX
Parts of Speect Tagging
theyaseen51
 
PPTX
Prefix, Infix and Post-fix Notations
Afaq Mansoor Khan
 
PPTX
Understanding Natural Language Processing
International Institute of Information Technology (I²IT)
 
PPT
C programming language character set keywords constants variables data types
Sourav Ganguly
 
Asymptotic notation
mustafa sarac
 
Queue Data Structure
Zidny Nafan
 
State space search and Problem Solving techniques
Kirti Verma
 
Introduction to Garbage Collection
Artur Mkrtchyan
 
Theory of Computation Unit 3
Jena Catherine Bel D
 
Data Structures with C Linked List
Reazul Islam
 
NFA Non Deterministic Finite Automata by Mudasir khushik
MudsaraliKhushik
 
Pumping Lemma
eburhan
 
Lecture Notes-Are Natural Languages Regular.pdf
Deptii Chaudhari
 
Regular expressions
Shiraz316
 
Automata theory - CFG and normal forms
Akila Krishnamoorthy
 
Greedy algorithms
Rajendran
 
PROLOG: Recursion And Lists In Prolog
DataminingTools Inc
 
Divide and Conquer - Part 1
Amrinder Arora
 
Parts of Speect Tagging
theyaseen51
 
Prefix, Infix and Post-fix Notations
Afaq Mansoor Khan
 
Understanding Natural Language Processing
International Institute of Information Technology (I²IT)
 
C programming language character set keywords constants variables data types
Sourav Ganguly
 

Similar to Learn from LL(1) to PEG parser the hard way (20)

PDF
Left factor put
siet_pradeep18
 
PDF
Lecture8 syntax analysis_4
Mahesh Kumar Chelimilla
 
PPTX
Syntactic Analysis in Compiler Construction
voyoc79528
 
PDF
Ch01 basic concepts_nosoluiton
shin
 
PDF
calculus-4c-1.pdf
HUSSAINGHAZI1
 
PPTX
complier design unit 4 for helping students
aniketsugandhi1
 
PDF
Lecture6 syntax analysis_2
Mahesh Kumar Chelimilla
 
PPT
Master method
Rajendran
 
PPTX
Ch 2Algo Analysis.pptxCh 2Algo Analysis.pptx
abduulahikhmaies
 
PPTX
Time and Space Complexity Analysis.pptx
dudelover
 
PPTX
Analysis of Algorithms (1).pptx, asymptotic
Minakshee Patil
 
PDF
COMPILER DESIGN- Syntax Directed Translation
Jyothishmathi Institute of Technology and Science Karimnagar
 
PPTX
Analysis of Algorithms, recurrence relation, solving recurrences
Minakshee Patil
 
PPTX
Compiler Design_Intermediate code generation new ppt.pptx
RushaliDeshmukh2
 
PDF
Ecfft zk studyclub 9.9
Alex Pruden
 
PPT
Infix prefix postfix
Self-Employed
 
PPT
Lecture 1
guest6c6268
 
PPTX
2.pptx
MohAlyasin1
 
PDF
12IRGeneration.pdf
SHUJEHASSAN
 
PDF
Quick sort,bubble sort,heap sort and merge sort
abhinavkumar77723
 
Left factor put
siet_pradeep18
 
Lecture8 syntax analysis_4
Mahesh Kumar Chelimilla
 
Syntactic Analysis in Compiler Construction
voyoc79528
 
Ch01 basic concepts_nosoluiton
shin
 
calculus-4c-1.pdf
HUSSAINGHAZI1
 
complier design unit 4 for helping students
aniketsugandhi1
 
Lecture6 syntax analysis_2
Mahesh Kumar Chelimilla
 
Master method
Rajendran
 
Ch 2Algo Analysis.pptxCh 2Algo Analysis.pptx
abduulahikhmaies
 
Time and Space Complexity Analysis.pptx
dudelover
 
Analysis of Algorithms (1).pptx, asymptotic
Minakshee Patil
 
COMPILER DESIGN- Syntax Directed Translation
Jyothishmathi Institute of Technology and Science Karimnagar
 
Analysis of Algorithms, recurrence relation, solving recurrences
Minakshee Patil
 
Compiler Design_Intermediate code generation new ppt.pptx
RushaliDeshmukh2
 
Ecfft zk studyclub 9.9
Alex Pruden
 
Infix prefix postfix
Self-Employed
 
Lecture 1
guest6c6268
 
2.pptx
MohAlyasin1
 
12IRGeneration.pdf
SHUJEHASSAN
 
Quick sort,bubble sort,heap sort and merge sort
abhinavkumar77723
 
Ad

More from Kir Chou (20)

PDF
Time travel: Let’s learn from the history of Python packaging!
Kir Chou
 
PDF
Python パッケージの影響を歴史から理解してみよう!
Kir Chou
 
PDF
The str/bytes nightmare before python2 EOL
Kir Chou
 
PPTX
PyCon TW 2018 - A Python Engineer Under Giant Umbrella (巨大保護傘下的 Python 碼農辛酸史)
Kir Chou
 
PPTX
Introduction of CTF and CGC
Kir Chou
 
PPTX
PyCon TW 2017 - Why do projects fail? Let's talk about the story of Sinon.PY
Kir Chou
 
PPTX
GCC
Kir Chou
 
PPT
Spime - personal assistant
Kir Chou
 
PPTX
Ch9 package & port(2013 ncu-nos_nm)
Kir Chou
 
PPTX
Ch8 file system management(2013 ncu-nos_nm)
Kir Chou
 
PPTX
Ch7 user management(2013 ncu-nos_nm)
Kir Chou
 
PPTX
Ch10 firewall(2013 ncu-nos_nm)
Kir Chou
 
PDF
Knowledge Management in Distributed Agile Software Development
Kir Chou
 
PDF
Cms part2
Kir Chou
 
PDF
Cms part1
Kir Chou
 
PDF
Sitcon2014 community by server (kir)
Kir Chou
 
PDF
Webapp(2014 ncucc)
Kir Chou
 
PDF
廢除雙二一議題 保留方論點 (2013ncu全幹會)
Kir Chou
 
PPTX
Ch6 ssh(2013 ncu-nos_nm)
Kir Chou
 
PPTX
Ch5 network basic(2013 ncu-nos_nm)
Kir Chou
 
Time travel: Let’s learn from the history of Python packaging!
Kir Chou
 
Python パッケージの影響を歴史から理解してみよう!
Kir Chou
 
The str/bytes nightmare before python2 EOL
Kir Chou
 
PyCon TW 2018 - A Python Engineer Under Giant Umbrella (巨大保護傘下的 Python 碼農辛酸史)
Kir Chou
 
Introduction of CTF and CGC
Kir Chou
 
PyCon TW 2017 - Why do projects fail? Let's talk about the story of Sinon.PY
Kir Chou
 
Spime - personal assistant
Kir Chou
 
Ch9 package & port(2013 ncu-nos_nm)
Kir Chou
 
Ch8 file system management(2013 ncu-nos_nm)
Kir Chou
 
Ch7 user management(2013 ncu-nos_nm)
Kir Chou
 
Ch10 firewall(2013 ncu-nos_nm)
Kir Chou
 
Knowledge Management in Distributed Agile Software Development
Kir Chou
 
Cms part2
Kir Chou
 
Cms part1
Kir Chou
 
Sitcon2014 community by server (kir)
Kir Chou
 
Webapp(2014 ncucc)
Kir Chou
 
廢除雙二一議題 保留方論點 (2013ncu全幹會)
Kir Chou
 
Ch6 ssh(2013 ncu-nos_nm)
Kir Chou
 
Ch5 network basic(2013 ncu-nos_nm)
Kir Chou
 
Ad

Recently uploaded (20)

PPTX
PCC IT Forum 2025 - Legislative Technology Snapshot
Gareth Oakes
 
PDF
Ready Layer One: Intro to the Model Context Protocol
mmckenna1
 
PDF
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
PDF
Understanding the EU Cyber Resilience Act
ICS
 
PDF
How Attendance Management Software is Revolutionizing Education.pdf
Pikmykid
 
PDF
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
PPTX
TexSender Pro 8.9.1 Crack Full Version Download
cracked shares
 
PDF
Code and No-Code Journeys: The Maintenance Shortcut
Applitools
 
PDF
chapter 5.pdf cyber security and Internet of things
PalakSharma980227
 
PPTX
SAP Public Cloud PPT , SAP PPT, Public Cloud PPT
sonawanekundan2024
 
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PDF
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
PDF
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
PPTX
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
PDF
AI Software Engineering based on Multi-view Modeling and Engineering Patterns
Hironori Washizaki
 
PPTX
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
PDF
Message Level Status (MLS): The Instant Feedback Mechanism for UAE e-Invoicin...
Prachi Desai
 
PPTX
Transforming Lending with IntelliGrow – Advanced Loan Software Solutions
Intelli grow
 
PDF
Instantiations Company Update (ESUG 2025)
ESUG
 
PCC IT Forum 2025 - Legislative Technology Snapshot
Gareth Oakes
 
Ready Layer One: Intro to the Model Context Protocol
mmckenna1
 
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
Understanding the EU Cyber Resilience Act
ICS
 
How Attendance Management Software is Revolutionizing Education.pdf
Pikmykid
 
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
TexSender Pro 8.9.1 Crack Full Version Download
cracked shares
 
Code and No-Code Journeys: The Maintenance Shortcut
Applitools
 
chapter 5.pdf cyber security and Internet of things
PalakSharma980227
 
SAP Public Cloud PPT , SAP PPT, Public Cloud PPT
sonawanekundan2024
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
AI Software Engineering based on Multi-view Modeling and Engineering Patterns
Hironori Washizaki
 
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
Message Level Status (MLS): The Instant Feedback Mechanism for UAE e-Invoicin...
Prachi Desai
 
Transforming Lending with IntelliGrow – Advanced Loan Software Solutions
Intelli grow
 
Instantiations Company Update (ESUG 2025)
ESUG
 

Learn from LL(1) to PEG parser the hard way

  • 1. Standing on the shoulders of giants: Learn from LL(1) to PEG parser the hard way Kir Chou 1
  • 3. About me Presented at PyCon TW/JP since 2017 https://note35.github.io/about/ https://github.com/note35/Parser-Learning 3
  • 4. Agenda ● Motivation ● What is parser in CPython? ● Parser 101 - CFG ● Parser 101 - Traditional parser (LL(1) / LR(0)) ● Parser 102 - PEG and PEG parser ● Parser 102 - Packrat parser ● CPython’s PEG parser ● Take away 4
  • 6. Motivation What’s New In Python 3.9? PEP 617, CPython now uses a new parser based on PEG; “IIRC, I took a Compiler class in school…” 6
  • 7. Motivation (Cont.) School taught us the brief concept of the Compiler’s frontend and backend. School’s parser assignment used Bison + YACC. And... 7
  • 8. My motivation = Talk objectives What is PEG parser? Why did python use LL(1) parser before? Why did Guido choose PEG parser? What other parsers do we have? What’s the difference between those parsers? How to implement those parsers? 8
  • 9. What is parser in CPython? CPython DevGuide - Design of CPython’s Compiler 9
  • 10. Compilation Steps 10 Source Code Tokens Abstract Syntax Tree (AST) Bytecode Result Lexer Parser Compiler VM Import
  • 14. 14 Source Code Tokens Abstract Syntax Tree (AST) Bytecode Result Lexer Parser Compiler VM Import Talk’s focus!
  • 15. Parser 101 - CFG Uncode - GATE Computer Science - Compiler Design Lecture 15
  • 16. Grammar Context Free Grammar (CFG) 16 Interpretation of this Grammar “Both B and a can be derived from A” Derivation *some paper write <- Non-terminal AND *support ambigious syntax A -> B | a Terminal rule
  • 17. What is “Context Free”? Left-hand side in all the rules only contains 1 non-terminal. Valid CFG Example: Invalid CFG Example: 17 S -> aSb xSy -> axSyb
  • 18. Semantic Analysis: Parse Tree Concret Syntax Tree (CST) An ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. Abstract Syntax Tree (AST) A tree representation of the abstract syntactic structure of source code written in a programming language. 18
  • 19. CFG Simplification 1. Ambiguous -> Unambiguous 2. Nondeterministic -> Deterministic 3. Left recursion -> No left recursion 19
  • 20. Ambiguious Definition A grammar contains rules that can generate more than one tree. 20 E -> E + E | E * E | Num N N N E E E + E * E N E E N N E E E * +
  • 21. Ambiguious -> Unambiguous 21 N E E N N T F T * + E -> E + T | T T -> T * F | F F -> Num E -> E + E | E * E | Num Step1 Rewrite Grammar Step2 Make sure the grammar only generate one tree T F F
  • 22. Non-deterministic -> Deterministic A grammar contains rules that have common prefix. 22 A -> ab | ac A -> aA’ A’ -> b | c Rewrite Grammar *A non-deterministic grammar can be rewritten into more than one deterministic grammar.
  • 23. Left recursion -> No left recursion A grammar contains direct or indirect left recursion. 23 E -> E + T | T T -> T * F | F F -> Num E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num Rewrite Grammar E in first E + T will recursively derives to second E + T, E in second E + T will repeat it to third E + T, and so on recursively.
  • 24. Recap: CFG Simplification 24 Before After Ambiguous Non-deterministic Left Recursion
  • 25. Parser 101 - Traditional parser Uncode - GATE Computer Science - Compiler Design Lecture 25
  • 26. Parser classification 26 N E E N N E E E * + Top-down Type Bottom-up Type N E N E N E + N E E N E E N E E + N E E N N E E E * +
  • 27. LL / LR Parser LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0) LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0) 27 *Both LL/LR parser scan input string from left to right Input String: 2 + 3 * 4
  • 28. LL / LR Parser LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0) LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0) 28 *The derivation time of LL/LR parser is different. N E E N N E E E * + N E E N N E E E * + + → * * → +
  • 29. LL / LR Parser LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0) LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0) 29 Input String: 2 + 3 * 4 I am "a token of number". If I perform 1-token lookahead and meet "a token of +", what to do next?
  • 30. Top Down - Recursive descent parser 30
  • 31. LL(k) - Implementation 31 2 + 3 * 4 parse_E() E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num parse_Tp(parse_F()) parse_Ep( ) Step3 *recursively parse the input string started from first rule parse_E() Step2 *parse from left to right *perform k-lookahead parse_T() Step1 write function for each non-terminal
  • 32. 32 Grammar E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num *perform 1-lookahead LL(1) - Example code Derivation x x
  • 33. Top Down - Non recursive descent parser 33
  • 34. LL(1) - Parsing table 34 Step1 Build first/follow table for each non-terminal Note: $ means endmark Step2 Build parsing table based on first/follow table
  • 35. LL(1) - Implementation 35 Step3 Implement with stack (take shift/reduce action based on parsing table) N E E N N E E E * +
  • 36. LL(1) - Example code 36 Grammar E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num Non-terminal stack Reduce (Derivation) Shift Reduce (Derivation)
  • 37. Bottom Up - LR(0) parser 37
  • 38. LR(0) - Deterministic finite automaton 38 E’ -> .E --- (1) E -> .E + T --- (2) E -> .T --- (3) T -> .T * Num --- (4) T -> .Num --- (5) Step1 Build Deterministic Finite Automaton(DFA) E’ -> E. E -> E. + T E -> T. T -> T. * Num T -> Num. E -> E + .T T -> .T * Num T -> .Num T -> T * .Num E -> E + T. T -> T. * Num T -> T * Num. E T Num * + T Num * Num S1 S2 S3 S4 S5 S6 S7 S8 Left recursion support
  • 39. LR(0) - Parsing table 39 Step2 Build parsing table (For parser like SLR(1), it requires first/follow table) Shift acc Reduce (Derivation) acc
  • 40. LR(0) - Implementation 40 Step3 Implement with stack (take shift/reduce action based on parsing table) N E E N N E E E * +
  • 41. LR(0) - Example code 41 Grammar E -> E + T | T T -> T * F | F F -> Num Shift Reduce (Derivation)
  • 42. Parser 102 - PEG and PEG parser 42
  • 43. Grammar Parsing Expression Grammar (PEG) 43 *Difference from traditional CFG A will try A -> B first. Only after it fails at A -> B, A will only try A -> a. Derivation *some paper write <- Non-Terminal OR (if / elif / ...) *disallow ambigious syntax A -> B | a Terminal *Introduced in 2002 (Packrat Parsing: Simple, Powerful, Lazy, Linear Time) rule *support Regular Expression (EBNF grammar) in another paper
  • 44. Example of difference 44 Grammar1: A -> a b | a Grammar2: A -> a | a b ● LL/LR parser will fail to complete when the input grammar is ambiguous. ● PEG parser only tries the first PEG rule. The latter rule will never succeed. “A PEG parser generator will resolve unintended ambiguities earliest-match-first, which may be arbitrary and lead to surprising parses.” (source)
  • 45. PEG Parser PEG parser means “parser generated based on PEG”. PEG parser can be a Packrat parser, or other traditional parser with k-lookahead limitation. Mostly, PEG parser means Packrat parser. 45 CFG EBNF grammar PEG Packrat parser Traditional parser PEG Parser
  • 46. Parser 102 - Packrat parser 46
  • 47. Type of Packrat parser 47 Top-down Type N E E N E E N E E + N E E N N E E E * + Packrat parser is top-down type.
  • 48. Packrat Parsing - Implementation 48 2 + 3 * 4 parse_E() E -> E + T | T T -> T * F | F F -> Num parse_T() and parse_F() parse_E() and parse_T() Step2 *parse from left to right *perform infinite lookahead + memoization Step1 *write function for each non-terminal (PEG rule) *Idea of memoization was Introduced in 1970 Step3 *recursively parse the input string started from first rule parse_E() Left recursion support
  • 49. Packrat Parsing - Example code 49 Grammar E -> E + T | T T -> T * F | F F -> Num Derivation Memoization
  • 50. Packrat - what is memoization? 50 509. Fibonacci Number 4 3 2 2 1 fib(0) = 0 fib(1) = 1 fib(2) = fib(1) + fib(0) = 1 fib(3) = fib(2) + fib(1) = fib(1) + fib(0) + fib(1) = 2 ... 1 0 1 0 if n = 4, we calculate fib(2), fib(0) twice, fib(1) thrice, fib(4), fib(3) once TIme Complexity: O(2^n)
  • 51. Packrat - what is memoization? (Cont.) 51 509. Fibonacci Number if n = 4, we… calculate fib(4), fib(3), fib(2), fib(1), fib(0) once Time Complexity: O(2^n) => O(n) Space Complexity: O(1) => O(n)
  • 52. Left recursion in Packrat parser 52 Approach 1 if (count of operator) < (count function call): return False Approach 2 reverse the call stack (adopted in CPython!) Source: Guido's Medium (Left-recursive PEG Grammars)
  • 55. Traditional parser V.S Packrat parser 55
  • 56. Traditional parser vs Packrat parser 56 Packrat Traditional Scan Left-to-right (*Right-to-left memo) Left-to-right Left Recursion Support (*Not support in first paper) LL needs to rewrite the grammar Ambigious Disallowed (determinism) Allowed Space Complexity O(Code Size) (space consumption) O(Depth of Parse Tree) Worst Time Complexity Super linear time (statelessness) *Because of feature like typedef in C Expotenial time Capability Basically covers all traditional cases (infinite lookahead) No left-recursion/ambigious for LL Has k lookup limitations for both (e.g. dangling else) Red text: 3 highlighted characteristics of Packrat parser.
  • 57. 57 Parenthesized context managers PEP 622/634/635/636 - Structural Pattern Matching New rule in Python 3.10 based on PEG
  • 59. CPython Parser - Before/After CPython3.8 and before use LL(1) parser written by Guido 30 years ago The parser requires steps to generate CST and convert CST to AST. CPython3.9 uses PEG (Packrat) parser (Infinite lookahead) PEG rule supports left-recursion No more CST to AST step - source CPython3.10 drops LL(1) parser support 59 This answers “Why PEG?”
  • 60. CPython Parser - Workflow 60 Meta Grammar Tools/peg_generator/ pegen/metagrammar.gram Grammar Grammar/python.gram Token Grammar/Tokens my_parser.py my_parser.c pegen (PEG Parser) Tools/peg_generator/ *CPython contains a peg parser generator written in python3.8+ (because of warlus operator)
  • 61. Input: Meta Grammar Example Syntax Directed Translation (SDT) 61 rule non-Terminal return type PEG rule divider PEG rule action (python code) Parser header (python code)
  • 62. Output: Generated PEG Parser (Partial code) 62
  • 63. Recap: Benefit / Performance Benefit Grammar is more flexible: from LL(1) to LL(∞) (infinite lookahead) Hardware supports Packrat’s memory consumption now Skip intermediate parse tree (CST) construction Performance Within 10% of LL(1) parser both in speed and memory consumption (PEP 617) 63
  • 65. Recap ● Parser 101 (Compiler class in school) ○ CFG ○ Traditional Parser ■ Top-down: LL(1) ■ Bottom-up: LR(0) ● Parser 102 ○ PEG ○ Packrat Parser ● CPython ○ Parser in CPython ○ CPython’s PEG parser 65
  • 66. 66 Need Answer? note35/Parser-Learning You can implement traditional parser like LL(1) and LR(0) parser, and Packrat parser from scratch! Leetcode: 227. Basic Calculator II Q. How to verify my understanding? A. Get your hands dirty!
  • 69. Related Articles Guido van Rossum PEG Parsing Series Overview Bryan Ford Packrat Parsing: Simple, Powerful, Lazy, Linear Time Parsing Expression Grammars: A Recognition-Based Syntactic Foundation 69
  • 70. Related Talks Guido van Rossum @ North Bay Python 2019 Writing a PEG parser for fun and profit Pablo Galindo and Lysandros Nikolaou @ Podcast.__init__ The Journey To Replace Python's Parser And What It Means For The Future Emily Morehouse-Valcarcel @ PyCon 2018 The AST and Me Alex Gaynor @ PyCon 2013 So you want to write an interpreter? 70
  • 71. Thanks for your listening! 71