Lark Parser Readthedocs Io en Latest
Lark Parser Readthedocs Io en Latest
Erez Shinan
1 Philosophy 1
1.1 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Design Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Features 3
2.1 Main Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Extra features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Parsers 5
3.1 Earley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 LALR(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 CYK Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7 Recipes 21
7.1 Use a transformer to parse integer tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.2 Collect all comments with lexer_callbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.3 CollapseAmbiguities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.4 Keeping track of parents when visiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
i
8.2 Advanced Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9 Grammar Reference 49
9.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.2 Terminals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9.3 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.4 Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
11 API Reference 59
11.1 Lark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.2 Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.3 Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
11.4 Transformer, Visitor & Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
11.5 ForestVisitor, ForestTransformer, & TreeForestTransformer . . . . . . . . . . . . . . . . . . . . . . 63
11.6 UnexpectedInput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
11.7 ParserPuppet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
15 Install Lark 79
16 Syntax Highlighting 81
17 Resources 83
Index 85
ii
CHAPTER 1
Philosophy
Parsers are innately complicated and confusing. They’re difficult to understand, difficult to write, and difficult to use.
Even experts on the subject can become baffled by the nuances of these complicated state-machines.
Lark’s mission is to make the process of writing them as simple and abstract as possible, by following these design
principles:
1. Readability matters
2. Keep the grammar clean and simple
3. Don’t force the user to decide on things that the parser can figure out on its own
4. Usability is more important than performance
5. Performance is still very important
6. Follow the Zen of Python, whenever possible and applicable
In accordance with these principles, I arrived at the following design choices:
Grammars are the de-facto reference for your language, and for the structure of your parse-tree. For any non-trivial
language, the conflation of code and grammar always turns out convoluted and difficult to read.
The grammars in Lark are EBNF-inspired, so they are especially easy to read & work with.
1
Lark Documentation
The Earley algorithm can accept any context-free grammar you throw at it (i.e. any grammar you can write in EBNF,
it can parse). That makes it extremely friendly to beginners, who are not aware of the strange and arbitrary restrictions
that LALR(1) places on its grammars.
As the users grow to understand the structure of their grammar, the scope of their target language, and their perfor-
mance requirements, they may choose to switch over to LALR(1) to gain a huge performance boost, possibly at the
cost of some language features.
In short, “Premature optimization is the root of all evil.”
2 Chapter 1. Philosophy
CHAPTER 2
Features
3
Lark Documentation
• Import rules and tokens from other Lark grammars, for code reuse and modularity.
• Support for external regex module (see here)
• Import grammars from Nearley.js (read more)
• CYK parser
• Visualize your parse trees as dot or png files (see_example)
4 Chapter 2. Features
CHAPTER 3
Parsers
Lark implements the following parsing algorithms: Earley, LALR(1), and CYK
3.1 Earley
An Earley Parser is a chart parser capable of parsing any context-free grammar at O(n^3), and O(n^2) when the
grammar is unambiguous. It can parse most LR grammars at O(n). Most programming languages are LR, and can be
parsed at a linear time.
Lark’s Earley implementation runs on top of a skipping chart parser, which allows it to use regular expressions, instead
of matching characters one-by-one. This is a huge improvement to Earley that is unique to Lark. This feature is used
by default, but can also be requested explicitly using lexer='dynamic'.
It’s possible to bypass the dynamic lexing, and use the regular Earley parser with a traditional lexer, that tokenizes as
an independent first step. Doing so will provide a speed benefit, but will tokenize without using Earley’s ambiguity-
resolution ability. So choose this only if you know why! Activate with lexer='standard'
SPPF & Ambiguity resolution
Lark implements the Shared Packed Parse Forest data-structure for the Earley parser, in order to reduce the space and
computation required to handle ambiguous grammars.
You can read more about SPPF here
As a result, Lark can efficiently parse and store every ambiguity in the grammar, when using Earley.
Lark provides the following options to combat ambiguity:
1. Lark will choose the best derivation for you (default). Users can choose between different disambiguation
strategies, and can prioritize (or demote) individual rules over others, using the rule-priority syntax.
2. Users may choose to receive the set of all possible parse-trees (using ambiguity=’explicit’), and choose the best
derivation themselves. While simple and flexible, it comes at the cost of space and performance, and so it isn’t
recommended for highly ambiguous grammars, or very long inputs.
3. As an advanced feature, users may use specialized visitors to iterate the SPPF themselves.
5
Lark Documentation
lexer=”dynamic_complete”
Earley’s “dynamic” lexer uses regular expressions in order to tokenize the text. It tries every possible combination of
terminals, but it matches each terminal exactly once, returning the longest possible match.
That means, for example, that when lexer="dynamic" (which is the default), the terminal /a+/, when given the
text "aa", will return one result, aa, even though a would also be correct.
This behavior was chosen because it is much faster, and it is usually what you would expect.
Setting lexer="dynamic_complete" instructs the lexer to consider every possible regexp match. This ensures
that the parser will consider and resolve every ambiguity, even inside the terminals themselves. This lexer provides
the same capabilities as scannerless Earley, but with different performance tradeoffs.
Warning: This lexer can be much slower, especially for open-ended terminals such as /.*/
3.2 LALR(1)
LALR(1) is a very efficient, true-and-tested parsing algorithm. It’s incredibly fast and requires very little memory. It
can parse most programming languages (For example: Python and Java).
Lark comes with an efficient implementation that outperforms every other parsing library for Python (including PLY)
Lark extends the traditional YACC-based architecture with a contextual lexer, which automatically provides feedback
from the parser to the lexer, making the LALR(1) algorithm stronger than ever.
The contextual lexer communicates with the parser, and uses the parser’s lookahead prediction to narrow its choice of
terminals. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead
of all of the terminals. It’s surprisingly effective at resolving common terminal collisions, and allows one to parse
languages that LALR(1) was previously incapable of parsing.
(If you’re familiar with YACC, you can think of it as automatic lexer-states)
This is an improvement to LALR(1) that is unique to Lark.
6 Chapter 3. Parsers
CHAPTER 4
Lark is a parser - a program that accepts a grammar and text, and produces a structured tree that represents that text.
In this tutorial we will write a JSON parser in Lark, and explore Lark’s various features in the process.
It has 5 parts.
1. Writing the grammar
2. Creating the parser
3. Shaping the tree
4. Evaluating the tree
5. Optimizing
Knowledge assumed:
• Using Python
• A basic understanding of how to use regular expressions
Lark accepts its grammars in a format called EBNF. It basically looks like this:
7
Lark Documentation
In the case of JSON, the structure is simple: A json document is either a list, or a dictionary, or a string/number/etc.
The dictionaries and lists are recursive, and contain other json documents (or “values”).
Let’s write this structure in EBNF form:
value: dict
| list
| STRING
| NUMBER
| "true" | "false" | "null"
The arrow (->) renames the terminals. But that only adds obscurity in this case, so going forward we’ll just use their
original names.
We’ll also take care of the white-space, which is part of the text.
%import common.WS
%ignore WS
We tell our parser to ignore whitespace. Otherwise, we’d have to fill our grammar with WS terminals.
By the way, if you’re curious what these terminals signify, they are roughly equivalent to this:
NUMBER : /-?\d+(\.\d+)?([eE][+-]?\d+)?/
STRING : /".*?(?<!\\)"/
%ignore /[ \t\n\f\r]+/
Lark will accept this, if you really want to complicate your life :)
You can find the original definitions in common.lark. They don’t strictly adhere to json.org - but our purpose here is
to accept json, not validate it.
Notice that terminals are written in UPPER-CASE, while rules are written in lower-case. I’ll touch more on the
differences between rules and terminals later.
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS
""", start='value')
As promised, Lark automagically creates a tree that represents the parsed text.
But something is suspiciously missing from the tree. Where are the curly braces, the commas and all the other
punctuation literals?
Lark automatically filters out literals from the tree, based on the following criteria:
• Filter out string literals without a name, or with a name that starts with an underscore.
• Keep regexps, even unnamed ones, unless their name starts with an underscore.
Unfortunately, this means that it will also filter out literals like “true” and “false”, and we will lose that information.
The next section, “Shaping the tree” deals with this issue, and others.
We now have a parser that can create a parse tree (or: AST), but the tree has some issues:
1. “true”, “false” and “null” are filtered out (test it out yourself!)
2. Is has useless branches, like value, that clutter-up our view.
I’ll present the solution, and then explain it:
?value: dict
| list
| string
| SIGNED_NUMBER -> number
| "true" -> true
| "false" -> false
| "null" -> null
...
string : ESCAPED_STRING
1. Those little arrows signify aliases. An alias is a name for a specific part of the rule. In this case, we will name
the true/false/null matches, and this way we won’t lose the information. We also alias SIGNED_NUMBER to
mark it for later processing.
2. The question-mark prefixing value (“?value”) tells the tree-builder to inline this branch if it has only one member.
In this case, value will always have only one member, and will always be inlined.
3. We turned the ESCAPED_STRING terminal into a rule. This way it will appear in the tree as a branch. This is
equivalent to aliasing (like we did for the number), but now string can also be used elsewhere in the grammar
(namely, in the pair rule).
Here is the new grammar:
from lark import Lark
json_parser = Lark(r"""
?value: dict
| list
| string
| SIGNED_NUMBER -> number
| "true" -> true
| "false" -> false
| "null" -> null
string : ESCAPED_STRING
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS
""", start='value')
It’s nice to have a tree, but what we really want is a JSON object.
The way to do it is to evaluate the tree, using a Transformer.
A transformer is a class with methods corresponding to branch names. For each branch, the appropriate method will
be called with the children of the branch as its argument, and its return value will replace the branch in the tree.
So let’s write a partial transformer, that handles lists and dictionaries:
class MyTransformer(Transformer):
def list(self, items):
return list(items)
def pair(self, key_value):
k, v = key_value
return k, v
def dict(self, items):
return dict(items)
˓→Tree(true, [])]}
This is pretty close. Let’s write a full transformer that can handle the terminals too.
Also, our definitions of list and dict are a bit verbose. We can do better:
class TreeToJson(Transformer):
def string(self, s):
(s,) = s
return s[1:-1]
def number(self, n):
(n,) = n
return float(n)
(continues on next page)
list = list
pair = tuple
dict = dict
Magic!
By now, we have a fully working JSON parser, that can accept a string of JSON, and return its Pythonic representation.
But how fast is it?
Now, of course there are JSON libraries for Python written in C, and we can never compete with them. But since this
is applicable to any parser you would write in Lark, let’s see how far we can take this.
The first step for optimizing is to have a benchmark. For this benchmark I’m going to take data from json-
generator.com/. I took their default suggestion and changed it to 5000 objects. The result is a 6.6MB sparse JSON
file.
Our first program is going to be just a concatenation of everything we’ve done so far:
import sys
from lark import Lark, Transformer
json_grammar = r"""
?value: dict
| list
| string
| SIGNED_NUMBER -> number
| "true" -> true
| "false" -> false
| "null" -> null
string : ESCAPED_STRING
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
(continues on next page)
class TreeToJson(Transformer):
def string(self, s):
(s,) = s
return s[1:-1]
def number(self, n):
(n,) = n
return float(n)
list = list
pair = tuple
dict = dict
if __name__ == '__main__':
with open(sys.argv[1]) as f:
tree = json_parser.parse(f.read())
print(TreeToJson().transform(tree))
real 0m36.257s
user 0m34.735s
sys 0m1.361s
That’s unsatisfactory time for a 6MB file. Maybe if we were parsing configuration or a small DSL, but we’re trying to
handle large amount of data here.
Well, turns out there’s quite a bit we can do about it!
So far we’ve been using the Earley algorithm, which is the default in Lark. Earley is powerful but slow. But it just so
happens that our grammar is LR-compatible, and specifically LALR(1) compatible.
So let’s switch to LALR(1) and see what happens:
real 0m7.554s
user 0m7.352s
sys 0m0.148s
Ah, that’s much better. The resulting JSON is of course exactly the same. You can run it for yourself and see.
It’s important to note that not all grammars are LR-compatible, and so you can’t always switch to LALR(1). But
there’s no harm in trying! If Lark lets you build the grammar, it means you’re good to go.
So far, we’ve built a full parse tree for our JSON, and then transformed it. It’s a convenient method, but it’s not the
most efficient in terms of speed and memory. Luckily, Lark lets us avoid building the tree when parsing with LALR(1).
Here’s the way to do it:
if __name__ == '__main__':
with open(sys.argv[1]) as f:
print( json_parser.parse(f.read()) )
We’ve used the transformer we’ve already written, but this time we plug it straight into the parser. Now it can avoid
building the parse tree, and just send the data straight into our transformer. The parse() method now returns the
transformed JSON, instead of a tree.
Let’s benchmark it:
real 0m4.866s
user 0m4.722s
sys 0m0.121s
That’s a measurable improvement! Also, this way is more memory efficient. Check out the benchmark table at the end
to see just how much.
As a general practice, it’s recommended to work with parse trees, and only skip the tree-builder when your transformer
is already working.
PyPy is a JIT engine for running Python, and it’s designed to be a drop-in replacement.
Lark is written purely in Python, which makes it very suitable for PyPy.
Let’s get some free performance:
real 0m1.397s
user 0m1.296s
sys 0m0.083s
PyPy is awesome!
4.5.5 Conclusion
We’ve brought the run-time down from 36 seconds to 1.1 seconds, in a series of small and simple steps.
Now let’s compare the benchmarks in a nicely organized table.
I measured memory consumption using a little script called memusg
I added a few other parsers for comparison. PyParsing and funcparselib fair pretty well in their memory usage (they
don’t build a tree), but they can’t compete with the run-time speed of LALR(1).
These benchmarks are for Lark’s alpha version. I already have several optimizations planned that will significantly
improve run-time speed.
Once again, shout-out to PyPy for being so effective.
4.6 Afterword
This is the end of the tutorial. I hoped you liked it and learned a little about Lark.
To see what else you can do with Lark, check out the examples.
For questions or any other subject, feel free to email me at erezshin at gmail dot com.
4.6. Afterword 15
Lark Documentation
17
Lark Documentation
By default Lark silently resolves Shift/Reduce conflicts as Shift. To enable warnings pass debug=True. To get the
messages printed you have to configure the logger beforehand. For example:
import logging
from lark import Lark, logger
logger.setLevel(logging.DEBUG)
collision_grammar = '''
start: as as
as: a*
a: "a"
'''
p = Lark(collision_grammar, parser='lalr', debug=True)
Lark comes with an extensive set of tests. Many of the tests will run several times, once for each parser configuration.
To run the tests, just go to the lark project root, and run the command:
python -m tests
or
pypy -m tests
For a list of supported interpreters, you can consult the tox.ini file.
You can also run a single unittest using its class and method name, for example:
## test_package test_class_name.test_function_name
python -m tests TestLalrStandard.test_lexer_error_recovering
19
Lark Documentation
6.1.1 tox
To run all Unit Tests with tox, install tox and Python 2.7 up to the latest python interpreter supported (consult the file
tox.ini). Then, run the command tox on the root of this project (where the main setup.py file is on).
And, for example, if you would like to only run the Unit Tests for Python version 2.7, you can run the command tox
-e py27
6.1.2 pytest
pytest tests
Recipes
Transformers are the common interface for processing matched rules and tokens.
They can be used during parsing for better performance.
from lark import Lark, Transformer
class T(Transformer):
def INT(self, tok):
"Convert the value of `tok` from string to int, while maintaining line number
˓→& column."
return tok.update(value=int(tok))
parser = Lark("""
start: INT*
%import common.INT
%ignore " "
""", parser="lalr", transformer=T())
print(parser.parse('3 14 159'))
Prints out:
Tree(start, [Token(INT, 3), Token(INT, 14), Token(INT, 159)])
21
Lark Documentation
{TOKEN_TYPE: callback}
comments = []
parser = Lark("""
start: INT*
COMMENT: /#.*/
parser.parse("""
1 2 3 # hello
# world
4 5 6
""")
print(comments)
Prints out:
7.3 CollapseAmbiguities
Parsing ambiguous texts with earley and ambiguity='explicit' produces a single tree with _ambig nodes to
mark where the ambiguity occurred.
However, it’s sometimes more convenient instead to work with a list of all possible unambiguous trees.
Lark provides a utility transformer for that purpose:
grammar = """
!start: x y
22 Chapter 7. Recipes
Lark Documentation
"""
parser = Lark(grammar, ambiguity='explicit')
t = parser.parse('abcd')
for x in CollapseAmbiguities().transform(t):
print(x.pretty())
start
x
a
b
y
c
d
start
x ab
y cd
start
x abc
y d
While convenient, this should be used carefully, as highly ambiguous trees will soon create an exponential explosion
of such unambiguous derivations.
The following visitor assigns a parent attribute for every node in the tree.
If your tree nodes aren’t unique (if there is a shared Tree instance), the assert will fail.
class Parent(Visitor):
def __default__(self, tree):
for subtree in tree.children:
if isinstance(subtree, Tree):
assert not hasattr(subtree, 'parent')
subtree.parent = tree
24 Chapter 7. Recipes
CHAPTER 8
For example, the following will parse all the Python files in the standard library of your local installation:
A demonstration of parsing indentation (“whitespace significant” language) and the usage of the Indenter class.
Since indentation is context-sensitive, a postlex stage is introduced to manufacture INDENT/DEDENT tokens.
It is crucial for the indenter that the NL_type matches the spaces (and tabs) after the newline.
tree_grammar = r"""
?start: _NL* tree
25
Lark Documentation
class TreeIndenter(Indenter):
NL_type = '_NL'
OPEN_PAREN_types = []
CLOSE_PAREN_types = []
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
tab_len = 8
test_tree = """
a
b
c
d
e
f
g
"""
def test():
print(parser.parse(test_tree).pretty())
if __name__ == '__main__':
test()
import lark
from pathlib import Path
examples_path = Path(__file__).parent
lark_path = Path(lark.__file__).parent
grammar_files = [
examples_path / 'lark.lark',
examples_path / 'advanced/python2.lark',
examples_path / 'advanced/python3.lark',
examples_path / 'relative-imports/multiples.lark',
examples_path / 'relative-imports/multiple2.lark',
examples_path / 'relative-imports/multiple3.lark',
examples_path / 'tests/no_newline_at_end.lark',
examples_path / 'tests/negative_priority.lark',
lark_path / 'grammars/common.lark',
]
(continues on next page)
def test():
for grammar_file in grammar_files:
tree = parser.parse(open(grammar_file).read())
print("All grammars parsed successfully")
if __name__ == '__main__':
test()
A demonstration of ambiguity
This example shows how to use get explicit ambiguity from Lark’s Earley parser.
import sys
from lark import Lark, tree
grammar = """
sentence: noun verb noun -> simple
| noun verb "like" noun -> comparative
%import common.WS
%ignore WS
"""
def make_png(filename):
tree.pydot__tree_to_png( parser.parse(sentence), filename)
def make_dot(filename):
tree.pydot__tree_to_dot( parser.parse(sentence), filename)
if __name__ == '__main__':
print(parser.parse(sentence).pretty())
# make_png(sys.argv[1])
# make_dot(sys.argv[1])
# Output:
#
# _ambig
# comparative
# noun fruit
(continues on next page)
try:
input = raw_input # For Python2 compatibility
except NameError:
pass
calc_grammar = """
?start: sum
| NAME "=" sum -> assign_var
?sum: product
| sum "+" product -> add
| sum "-" product -> sub
?product: atom
| product "*" atom -> mul
| product "/" atom -> div
%ignore WS_INLINE
"""
def __init__(self):
self.vars = {}
def main():
while True:
try:
s = input('> ')
except EOFError:
break
print(calc(s))
def test():
print(calc("a = 1+2"))
print(calc("1+a*-3"))
if __name__ == '__main__':
# test()
main()
try:
input = raw_input # For Python2 compatibility
except NameError:
pass
import turtle
turtle_grammar = """
start: instruction+
(continues on next page)
MOVEMENT: "f"|"b"|"l"|"r"
COLOR: LETTER+
%import common.LETTER
%import common.INT -> NUMBER
%import common.WS
%ignore WS
"""
parser = Lark(turtle_grammar)
def run_instruction(t):
if t.data == 'change_color':
turtle.color(*t.children) # We just pass the color names as-is
def run_turtle(program):
parse_tree = parser.parse(program)
for inst in parse_tree.children:
run_instruction(inst)
def main():
while True:
code = input('> ')
try:
run_turtle(code)
(continues on next page)
def test():
text = """
c red yellow
fill { repeat 36 {
f200 l170
}}
"""
run_turtle(text)
if __name__ == '__main__':
# test()
main()
The code is short and clear, and outperforms every other parser (that’s written in Python). For an explanation, check
out the JSON parser tutorial at /docs/json_tutorial.md
import sys
json_grammar = r"""
?start: value
?value: object
| array
| string
| SIGNED_NUMBER -> number
| "true" -> true
| "false" -> false
| "null" -> null
string : ESCAPED_STRING
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS
"""
class TreeToJson(Transformer):
@v_args(inline=True)
def string(self, s):
(continues on next page)
array = list
pair = tuple
object = dict
number = v_args(inline=True)(float)
### Create the JSON parser with Lark, using the Earley algorithm
# json_parser = Lark(json_grammar, parser='earley', lexer='standard')
# def parse(x):
# return TreeToJson().transform(json_parser.parse(x))
### Create the JSON parser with Lark, using the LALR algorithm
json_parser = Lark(json_grammar, parser='lalr',
# Using the standard lexer isn't required, and isn't usually
˓→recommended.
# But, it's good enough for JSON, and it's slightly faster.
lexer='standard',
# Disabling propagate_positions and placeholders slightly improves
˓→speed
propagate_positions=False,
maybe_placeholders=False,
# Using an internal transformer is faster and more memory efficient
transformer=TreeToJson())
parse = json_parser.parse
def test():
test_json = '''
{
"empty_object" : {},
"empty_array" : [],
"booleans" : { "YES" : true, "NO" : false },
"numbers" : [ 0, 1, -2, 3.3, 4.4e5, 6.6e-7 ],
"strings" : [ "This", [ "And" , "That", "And a \\"b" ] ],
"nothing" : null
}
'''
j = parse(test_json)
print(j)
import json
assert j == json.loads(test_json)
if __name__ == '__main__':
# test()
with open(sys.argv[1]) as f:
print(parse(f.read()))
parser = Lark(r"""
start: _NL? section+
section: "[" NAME "]" _NL item+
item: NAME "=" VALUE? _NL
VALUE: /./+
sample_conf = """
[bla]
a=Hello
this="that",4
empty=
"""
print(parser.parse(sample_conf).pretty())
8.2.2 Templates
This example shows how to use Lark’s templates to achieve cleaner grammars
from lark import Lark
grammar = r"""
start: list | dict
_seperated{x, sep}: x (sep x)* // Define a sequence of 'x sep x sep x ...'
parser = Lark(grammar)
parser = Lark(r"""
start: _NL? section+
section: "[" NAME "]" _NL item+
item: NAME "=" VALUE? _NL
VALUE: /./+
def test():
sample_conf = """
[bla]
a=Hello
this="that",4
empty=
"""
r = parser.parse(sample_conf)
print (r.pretty())
if __name__ == '__main__':
test()
def ignore_errors(e):
if e.token.type == 'COMMA':
# Skip comma
return True
elif e.token.type == 'SIGNED_NUMBER':
# Try to feed a comma and retry the number
e.puppet.feed_token(Token('COMMA', ','))
e.puppet.feed_token(e.token)
return True
def main():
s = "[0 1, 2,, 3,,, 4, 5 6 ]"
res = json_parser.parse(s, on_error=ignore_errors)
print(res) # prints [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
main()
test_json = '''
{
"empty_object" : {},
"empty_array" : [],
"booleans" : { "YES" : true, "NO" : false },
"numbers" : [ 0, 1, -2, 3.3, 4.4e5, 6.6e-7 ],
"strings" : [ "This", [ "And" , "That", "And a \\"b" ] ],
(continues on next page)
def test_earley():
new_json = Reconstructor(json_parser).reconstruct(tree)
print (new_json)
print (json.loads(new_json) == json.loads(test_json))
def test_lalr():
new_json = Reconstructor(json_parser).reconstruct(tree)
print (new_json)
print (json.loads(new_json) == json.loads(test_json))
test_earley()
test_lalr()
class TypeLexer(Lexer):
def __init__(self, lexer_conf):
pass
parser = Lark("""
start: data_item+
data_item: STR INT*
class ParseToDict(Transformer):
@v_args(inline=True)
def data_item(self, name, *numbers):
return name.value, [n.value for n in numbers]
start = dict
def test():
data = ['alice', 1, 27, 3, 'bob', 4, 'carrie', 'dan', 8, 6]
print(data)
tree = parser.parse(data)
res = ParseToDict().transform(tree)
print('-->')
print(res) # prints {'alice': [1, 27, 3], 'bob': [4], 'carrie': [], 'dan': [8, 6]}
if __name__ == '__main__':
test()
class CustomTransformer(TreeForestTransformer):
@handles_ambiguity
def sentence(self, trees):
return next(tree for tree in trees if tree.data == 'simple')
grammar = """
sentence: noun verb noun -> simple
(continues on next page)
%import common.WS
%ignore WS
"""
tree = CustomTransformer(resolve_ambiguity=False).transform(forest)
print(tree.pretty())
# Output:
#
# simple
# noun Flies
# verb Like
# noun Bananas
# .
#
The code is short and clear, and outperforms every other parser (that’s written in Python). For an explanation, check
out the JSON parser tutorial at /docs/json_tutorial.md
(this is here for use by the other examples)
import sys
json_grammar = r"""
?start: value
?value: object
| array
| string
| SIGNED_NUMBER -> number
| "true" -> true
| "false" -> false
| "null" -> null
string : ESCAPED_STRING
%import common.ESCAPED_STRING
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS
"""
class TreeToJson(Transformer):
@v_args(inline=True)
def string(self, s):
return s[1:-1].replace('\\"', '"')
array = list
pair = tuple
object = dict
number = v_args(inline=True)(float)
### Create the JSON parser with Lark, using the LALR algorithm
json_parser = Lark(json_grammar, parser='lalr',
# Using the standard lexer isn't required, and isn't usually
˓→recommended.
# But, it's good enough for JSON, and it's slightly faster.
lexer='standard',
# Disabling propagate_positions and placeholders slightly improves
˓→speed
propagate_positions=False,
maybe_placeholders=False,
# Using an internal transformer is faster and more memory efficient
transformer=TreeToJson())
This example demonstrates how to subclass ForestVisitor to make a custom SPPF node prioritizer to be used in
conjunction with TreeForestTransformer.
Our prioritizer will count the number of descendants of a node that are tokens. By negating this count, our prioritizer
will prefer nodes with fewer token descendants. Thus, we choose the more specific parse.
class TokenPrioritizer(ForestVisitor):
grammar = """
start: hello " " world | hello_world
hello: "Hello"
world: "World"
hello_world: "Hello World"
"""
print("Default prioritizer:")
tree = TreeForestTransformer(resolve_ambiguity=True).transform(forest)
print(tree.pretty())
print("Custom prioritizer:")
tree = TreeForestTransformer(resolve_ambiguity=True, prioritizer=TokenPrioritizer()).
˓→transform(forest)
print(tree.pretty())
# Output:
#
# Default prioritizer:
# start
# hello Hello
#
# world World
#
# Custom prioritizer:
# start
# hello_world Hello World
A toy example that compiles Python directly to bytecode, without generating an AST. It currently only works for very
very simple Python code.
It requires the ‘bytecode’ library. You can get it using
$ pip install bytecode
class PythonIndenter(Indenter):
NL_type = '_NEWLINE'
OPEN_PAREN_types = ['LPAR', 'LSQB', 'LBRACE']
CLOSE_PAREN_types = ['RPAR', 'RSQB', 'RBRACE']
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
tab_len = 8
@v_args(inline=True)
class Compile(Transformer):
def number(self, n):
return [Instr('LOAD_CONST', int(n))]
def string(self, s):
return [Instr('LOAD_CONST', s[1:-1])]
def var(self, n):
return [Instr('LOAD_NAME', n)]
@v_args(inline=False)
def file_input(self, stmts):
return sum(stmts, []) + [Instr("RETURN_VALUE")]
def compile_python(s):
insts = python_parser3.parse(s+"\n")
return Bytecode(insts).to_code()
code = compile_python("""
a = 3
b = 5
print("Hello World!")
print(a+(b+2))
print((a+b)+2)
""")
exec(code)
# -- Output --
# Hello World!
# 10
# 10
# __path__ = os.path.dirname(__file__)
class PythonIndenter(Indenter):
NL_type = '_NEWLINE'
OPEN_PAREN_types = ['LPAR', 'LSQB', 'LBRACE']
CLOSE_PAREN_types = ['RPAR', 'RSQB', 'RBRACE']
INDENT_type = '_INDENT'
DEDENT_type = '_DEDENT'
tab_len = 8
def _get_lib_path():
if os.name == 'nt':
if 'PyPy' in sys.version:
return os.path.join(sys.prefix, 'lib-python', sys.winver)
else:
return os.path.join(sys.prefix, 'Lib')
else:
return [x for x in sys.path if x.endswith('%s.%s' % sys.version_info[:2])][0]
def test_python_lib():
path = _get_lib_path()
start = time.time()
files = glob.glob(path+'/*.py')
for f in files:
print( f )
chosen_parser.parse(_read(os.path.join(path, f)) + '\n')
end = time.time()
print( "test_python_lib (%d files), time: %s secs"%(len(files), end-start) )
def test_earley_equals_lalr():
path = _get_lib_path()
files = glob.glob(path+'/*.py')
for f in files:
print( f )
tree1 = python_parser2.parse(_read(os.path.join(path, f)) + '\n')
tree2 = python_parser2_earley.parse(_read(os.path.join(path, f)) + '\n')
assert tree1 == tree2
if __name__ == '__main__':
test_python_lib()
# test_earley_equals_lalr()
# python_parser3.parse(_read(sys.argv[1]) + '\n')
from _json_parser import json_grammar # Using the grammar from the json_parser
˓→example
class JsonSyntaxError(SyntaxError):
def __str__(self):
context, line, column = self.args
return '%s at line %s, column %s.\n\n%s' % (self.label, line, column, context)
class JsonMissingValue(JsonSyntaxError):
label = 'Missing Value'
class JsonMissingOpening(JsonSyntaxError):
label = 'Missing Opening'
class JsonMissingClosing(JsonSyntaxError):
label = 'Missing Closing'
class JsonMissingComma(JsonSyntaxError):
label = 'Missing Comma'
class JsonTrailingComma(JsonSyntaxError):
label = 'Trailing Comma'
def parse(json_text):
try:
j = json_parser.parse(json_text)
except UnexpectedInput as u:
exc_class = u.match_examples(json_parser.parse, {
JsonMissingOpening: ['{"foo": ]}',
'{"foor": }}',
'{"foo": }'],
JsonMissingClosing: ['{"foo": [}',
'{',
'{"a": 1',
'[1'],
JsonMissingComma: ['[1 2]',
'[false 1]',
'["b" 1]',
'{"a":true 1:4}',
'{"a":1 1:4}',
'{"a":"b" 1:4}'],
JsonTrailingComma: ['[,]',
'[1,]',
'[1,2,]',
'{"foo":1,}',
'{"foo":false,"bar":true,}']
}, use_accepts=True)
if not exc_class:
raise
raise exc_class(u.get_context(json_text), u.line, u.column)
try:
parse('{"example2": ] ')
except JsonMissingOpening as e:
print(e)
if __name__ == '__main__':
test()
This example shows how to write a syntax-highlighted editor with Qt and Lark
Requirements:
PyQt5==5.10.1 QScintilla==2.10.4
import sys
import textwrap
class LexerJson(QsciLexerCustom):
def create_styles(self):
deeppink = QColor(249, 38, 114)
khaki = QColor(230, 219, 116)
mediumpurple = QColor(174, 129, 255)
mediumturquoise = QColor(81, 217, 205)
yellowgreen = QColor(166, 226, 46)
lightcyan = QColor(213, 248, 232)
darkslategrey = QColor(39, 40, 34)
styles = {
0: mediumturquoise,
1: mediumpurple,
2: yellowgreen,
(continues on next page)
self.token_styles = {
"COLON": 5,
"COMMA": 5,
"LBRACE": 5,
"LSQB": 5,
"RBRACE": 5,
"RSQB": 5,
"FALSE": 0,
"NULL": 0,
"TRUE": 0,
"STRING": 4,
"NUMBER": 1,
}
def create_parser(self):
grammar = '''
anons: ":" "{" "}" "," "[" "]"
TRUE: "true"
FALSE: "false"
NULL: "NULL"
%import common.ESCAPED_STRING -> STRING
%import common.SIGNED_NUMBER -> NUMBER
%import common.WS
%ignore WS
'''
def language(self):
return "Json"
try:
for token in self.lark.lex(text):
ws_len = token.pos_in_stream - last_pos
if ws_len:
(continues on next page)
class EditorAll(QsciScintilla):
# self.setFolding(QsciScintilla.CircledFoldStyle)
lexer = LexerJson(self)
(continues on next page)
EXAMPLE_TEXT = textwrap.dedent("""\
{
"_id": "5b05ffcbcf8e597939b3f5ca",
"about": "Excepteur consequat commodo esse voluptate aute aliquip ad sint
˓→deserunt commodo eiusmod irure. Sint aliquip sit magna duis eu est culpa aliqua
˓→excepteur ut tempor nulla. Aliqua ex pariatur id labore sit. Quis sit ex aliqua
def main():
app = QApplication(sys.argv)
ex = EditorAll()
ex.setWindowTitle(__file__)
ex.setText(EXAMPLE_TEXT)
ex.resize(800, 600)
ex.show()
sys.exit(app.exec_())
if __name__ == "__main__":
main()
Grammar Reference
9.1 Definitions
a b? c -> (a c | a b c)
a: b* -> a: _b_tag
_b_tag: (_b_tag b)?
And so on.
Lark grammars are composed of a list of definitions and directives, each on its own line. A definition is either a named
rule, or a named terminal, with the following syntax, respectively:
49
Lark Documentation
Comments start with // and last to the end of the line (C++ style)
Lark begins the parse with the rule ‘start’, unless specified otherwise in the options.
Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has
practical effects, for the shape of the generated parse-tree, and the automatic construction of the lexer (aka tokenizer,
or scanner).
9.2 Terminals
Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals.
Syntax:
9.2.1 Templates
Use syntax:
Example:
_separated{x, sep}: x (sep x)* // Define a sequence of 'x sep x sep x ...'
num_list: "[" _separated{NUMBER, ","} "]" // Will match "[1, 2, 3]" etc.
9.2.2 Priority
Terminals can be assigned priority only when using a lexer (future versions may support Earley’s dynamic lexing).
Priority can be either positive or negative. If not specified for a terminal, it defaults to 1.
Highest priority terminals are always matched first.
SELECT: "select"i //# Will ignore case, and match SELECT or Select, etc.
MULTILINE_TEXT: /.+/s
SIGNED_INTEGER: /
[+-]? # the sign
(0|[1-9][0-9]*) # the digits
/x
Supported flags are one of: imslux. See Python’s regex documentation for more details on each one.
Regexps/strings of different flags can only be concatenated in Python 3.6+
When using a lexer (standard or contextual), it is the grammar-author’s responsibility to make sure the literals don’t
collide, or that if they do, they are matched in the desired order. Literals are matched according to the following
precedence:
1. Highest priority first (priority is specified as: TERM.number: . . . )
2. Length of match (for regexps, the longest theoretical match is used)
3. Length of literal / pattern definition
4. Name
Examples:
IF: "if"
INTEGER : /[0-9]+/
INTEGER2 : ("0".."9")+ //# Same as INTEGER
DECIMAL.2: INTEGER? "." INTEGER //# Will be matched before INTEGER
WHITESPACE: (" " | /\t/ )+
SQL_SELECT: "select"i
Each terminal is eventually compiled to a regular expression. All the operators and references inside it are mapped to
their respective expressions.
For example, in the following grammar, A1 and A2, are equivalent:
9.2. Terminals 51
Lark Documentation
This means that inside terminals, Lark cannot detect or resolve ambiguity, even when using Earley.
For example, for this grammar:
start : (A | B)+
A : "a" | "ab"
B : "b"
>>> p.parse("ab")
Tree(start, [Token(A, 'a'), Token(B, 'b')])
This is happening because Python’s regex engine always returns the first matching option.
If you find yourself in this situation, the recommended solution is to use rules instead.
Example:
9.3 Rules
Syntax:
four_words: word ~ 4
9.3.1 Priority
Rules can be assigned priority only when using Earley (future versions may support LALR as well).
Priority can be either positive or negative. In not specified for a terminal, it’s assumed to be 1 (i.e. the default).
9.4 Directives
9.4.1 %ignore
All occurrences of the terminal will be ignored, and won’t be part of the parse.
Using the %ignore directive results in a cleaner grammar.
It’s especially important for the LALR(1) algorithm, because adding whitespace (or comments, or other extraneous
elements) explicitly in the grammar, harms its predictive abilities, which are based on a lookahead of 1.
Syntax:
%ignore <TERMINAL>
Examples:
9.4.2 %import
9.4. Directives 53
Lark Documentation
%import <module>.<TERMINAL>
%import <module>.<rule>
%import <module>.<TERMINAL> -> <NEWTERMINAL>
%import <module>.<rule> -> <newrule>
%import <module> (<TERM1>, <TERM2>, <rule1>, <rule2>)
If the module path is absolute, Lark will attempt to load it from the built-in directory (which currently contains
common.lark, python.lark, and unicode.lark).
If the module path is relative, such as .path.to.file, Lark will attempt to load it from the current working
directory. Grammars must have the .lark extension.
The rule or terminal can be imported under another name with the -> syntax.
Example:
%import common.NUMBER
Note that %ignore directives cannot be imported. Imported rules will abide by the %ignore directives declared in
the main grammar.
9.4.3 %declare
Lark builds a tree automatically based on the structure of the grammar, where each rule that is matched becomes a
branch (node) in the tree, and its children are its matches, in the order of matching.
For example, the rule node: child1 child2 will create a tree node with two children. If it is matched as part
of another rule (i.e. if it isn’t the root), the new rule’s tree node will become its parent.
Using item+ or item* will result in a list of items, equivalent to writing item item item ...
Using item? will return the item if it matched, or nothing.
If maybe_placeholders=False (the default), then [] behaves like ()?.
If maybe_placeholders=True, then using [item] will return the item if it matched, or the value None, if it
didn’t.
10.1 Terminals
55
Lark Documentation
NAME: /\w+/
%ignore /\s+/
start
(Hello)
pname World
NAME: /\w+/
expr
expr
expr
"hello"
"world"
The brackets do not appear in the tree by design. The words appear because they are matched by a named terminal.
Users can alter the automatic construction of the tree using a collection of grammar features.
• Rules whose name begins with an underscore will be inlined into their containing rule.
Example:
start
"hello"
"world"
• Rules that receive a question mark (?) at the beginning of their definition, will be inlined if they have a single
child, after filtering.
Example:
start
greet
"hello"
"world"
"planet"
• Rules that begin with an exclamation mark will keep all their terminals (they won’t get filtered).
expr
(
expr
(
expr
hello
world
)
)
Using the ! prefix is usually a “code smell”, and may point to a flaw in your grammar design.
• Aliases - options in a rule can receive an alias. It will be then used as the branch name for the option, instead of
the rule name.
Example:
start
greet
planet
API Reference
11.1 Lark
Example
59
Lark Documentation
Example
Python’s builtin re module has a few persistent known bugs and also won’t parse advanced regex features such as
character classes. With pip install lark-parser[regex], the regex module will be installed alongside
lark and can act as a drop-in replacement to re.
Any instance of Lark instantiated with regex=True will use the regex module instead of re.
For example, we can use character classes to match PEP-3131 compliant Python identifiers:
>>> g.parse('')
''
11.2 Tree
11.2. Tree 61
Lark Documentation
Parameters
• data – The name of the rule or alias
• children – List of matched sub-rules and terminals
• meta – Line & Column numbers (if propagate_positions is enabled). meta at-
tributes: line, column, start_pos, end_line, end_column, end_pos
pretty(indent_str=’ ’)
Returns an indented string representation of the tree.
Great for debugging.
iter_subtrees()
Depth-first iteration.
Iterates over all the subtrees, never returning to the same node twice (Lark’s parse-tree is actually a DAG).
find_pred(pred)
Returns all nodes of the tree that evaluate pred(node) as true.
find_data(data)
Returns all nodes of the tree whose data equals the given data.
iter_subtrees_topdown()
Breadth-first iteration.
Iterates over all the subtrees, return nodes in order like pretty() does.
11.3 Token
class lark.Token
A string with meta-information, that is produced by the lexer.
When parsing text, the resulting chunks of the input that haven’t been discarded, will end up in the tree as Token
instances. The Token class inherits from Python’s str, so normal string comparisons and operations will work
as expected.
type
Name of the token (as specified in grammar)
value
Value of the token (redundant, as token.value == token will always be true)
pos_in_stream
The index of the token in the text
line
The line of the token in the text (starting with 1)
column
The column of the token in the text (starting with 1)
end_line
The line where the token ends
end_column
The next column after the end of the token. For example, if the token is a single character with a column
value of 4, end_column will be 5.
end_pos
the index where the token ends (basically pos_in_stream + len(token))
11.6 UnexpectedInput
class lark.exceptions.UnexpectedInput
UnexpectedInput Error.
Used as a base class for the following exceptions:
• UnexpectedToken: The parser received an unexpected token
• UnexpectedCharacters: The lexer encountered an unexpected string
After catching one of these exceptions, you may call the following helper methods to create a nicer error mes-
sage.
get_context(text, span=40)
Returns a pretty string pinpointing the error in the text, with span amount of context characters around it.
Note: The parser doesn’t hold a copy of the text it has to parse, so you have to provide it again
11.7 ParserPuppet
Transformers & Visitors provide a convenient interface to process the parse-trees that Lark returns.
They are used by inheriting from the correct class (visitor or transformer), and implementing methods corresponding
to the rule you wish to process. Each method accepts the children as an argument. That can be modified using the
v_args decorator, which allows one to inline the arguments (akin to *args), or add the tree meta property as an
argument.
See: visitors.py
12.1 Visitor
Visitors visit each node of the tree, and run the appropriate method on it according to the node’s data.
They work bottom-up, starting with the leaves and ending at the root of the tree.
There are two classes that implement the visitor interface:
• Visitor: Visit every node (without recursion)
• Visitor_Recursive: Visit every node using recursion. Slightly faster.
Example:
class IncreaseAllNumbers(Visitor):
def number(self, tree):
assert tree.data == "number"
tree.children[0] += 1
IncreaseAllNumbers().visit(parse_tree)
class lark.visitors.Visitor
Tree visitor, non-recursive (can handle huge trees).
Visiting a node calls its methods (provided by the user via inheritance) according to tree.data
65
Lark Documentation
visit(tree)
Visits the tree, starting with the leaves and finally the root (bottom-up)
visit_topdown(tree)
Visit the tree, starting at the root, and ending at the leaves (top-down)
__default__(tree)
Default function that is called if there is no attribute matching tree.data
Can be overridden. Defaults to doing nothing.
class lark.visitors.Visitor_Recursive
Bottom-up visitor, recursive.
Visiting a node calls its methods (provided by the user via inheritance) according to tree.data
Slightly faster than the non-recursive version.
visit(tree)
Visits the tree, starting with the leaves and finally the root (bottom-up)
visit_topdown(tree)
Visit the tree, starting at the root, and ending at the leaves (top-down)
__default__(tree)
Default function that is called if there is no attribute matching tree.data
Can be overridden. Defaults to doing nothing.
12.2 Interpreter
class lark.visitors.Interpreter
Interpreter walks the tree starting at the root.
Visits the tree, starting with the root and finally the leaves (top-down)
For each tree node, it calls its methods (provided by user via inheritance) according to tree.data.
Unlike Transformer and Visitor, the Interpreter doesn’t automatically visit its sub-branches. The user
has to explicitly call visit, visit_children, or use the @visit_children_decor. This allows the
user to implement branching and loops.
Example:
class IncreaseSomeOfTheNumbers(Interpreter):
def number(self, tree):
tree.children[0] += 1
IncreaseSomeOfTheNumbers().visit(parse_tree)
12.3 Transformer
class lark.visitors.Transformer(visit_tokens=True)
Transformers visit each node of the tree, and run the appropriate method on it according to the node’s data.
Calls its methods (provided by the user via inheritance) according to tree.data. The returned value replaces
the old one in the structure.
They work bottom-up (or depth-first), starting with the leaves and ending at the root of the tree. Transformers
can be used to implement map & reduce patterns. Because nodes are reduced from leaf to root, at any point the
callbacks may assume the children have already been transformed (if applicable).
Transformer can do anything Visitor can do, but because it reconstructs the tree, it is slightly less effi-
cient. It can be used to implement map or reduce patterns.
All these classes implement the transformer interface:
• Transformer - Recursively transforms the tree. This is the one you probably want.
• Transformer_InPlace - Non-recursive. Changes the tree in-place instead of returning new instances
• Transformer_InPlaceRecursive - Recursive. Changes the tree in-place instead of returning new
instances
Parameters visit_tokens (bool, optional) – Should the transformer visit tokens in addi-
tion to rules. Setting this to False is slightly faster. Defaults to True. (For processing ignored
tokens, use the lexer_callbacks options)
class EvalExpressions(Transformer):
def expr(self, args):
return eval(args[0])
Example:
class T(Transformer):
INT = int
NUMBER = float
def NAME(self, name):
(continues on next page)
12.3. Transformer 67
Lark Documentation
T(visit_tokens=True).transform(tree)
class lark.visitors.Transformer_NonRecursive(visit_tokens=True)
Same as Transformer but non-recursive.
Like Transformer, it doesn’t change the original tree.
Useful for huge trees.
class lark.visitors.Transformer_InPlace(visit_tokens=True)
Same as Transformer, but non-recursive, and changes the tree in-place instead of returning new instances
Useful for huge trees. Conservative in memory.
class lark.visitors.Transformer_InPlaceRecursive(visit_tokens=True)
Same as Transformer, recursive, but changes the tree in-place instead of returning new instances
12.4 v_args
Example
@v_args(inline=True)
class SolveArith(Transformer):
def add(self, left, right):
return left + right
class ReverseNotation(Transformer_InPlace):
@v_args(tree=True)
def tree_node(self, tree):
tree.children = tree.children[::-1]
12.5 Discard
class lark.visitors.Discard
When raising the Discard exception in a transformer callback, that node is discarded and won’t appear in the
parent.
12.5. Discard 69
Lark Documentation
When parsing with Earley, Lark provides the ambiguity='forest' option to obtain the shared packed parse
forest (SPPF) produced by the parser as an alternative to it being automatically converted to a tree.
Lark provides a few tools to facilitate working with the SPPF. Here are some things to consider when deciding whether
or not to use the SPPF.
Pros
• Efficient storage of highly ambiguous parses
• Precise handling of ambiguities
• Custom rule prioritizers
• Ability to handle infinite ambiguities
• Directly transform forest -> object instead of forest -> tree -> object
Cons
• More complex than working with a tree
• SPPF may contain nodes corresponding to rules generated internally
• Loss of Lark grammar features:
– Rules starting with ‘_’ are not inlined in the SPPF
– Rules starting with ‘?’ are never inlined in the SPPF
– All tokens will appear in the SPPF
13.1 SymbolNode
71
Lark Documentation
Symbol nodes are keyed by the symbol (s). For intermediate nodes s will be an LR0, stored as a tuple of (rule,
ptr). For completed symbol nodes, s will be a string representing the non-terminal origin (i.e. the left hand side
of the rule).
The children of a Symbol or Intermediate Node will always be Packed Nodes; with each Packed Node child
representing a single derivation of a production.
Hence a Symbol Node with a single child is unambiguous.
Variables
• s – A Symbol, or a tuple of (rule, ptr) for an intermediate node.
• start – The index of the start of the substring matched by this symbol (inclusive).
• end – The index of the end of the substring matched by this symbol (exclusive).
• is_intermediate – True if this node is an intermediate node.
• priority – The priority of the node’s symbol.
is_ambiguous
Returns True if this node is ambiguous.
children
Returns a list of this node’s children sorted from greatest to least priority.
13.2 PackedNode
13.3 ForestVisitor
class lark.parsers.earley_forest.ForestVisitor
An abstract base class for building forest visitors.
This class performs a controllable depth-first walk of an SPPF. The visitor will not enter cycles and will back-
track if one is encountered. Subclasses are notified of cycles through the on_cycle method.
Behavior for visit events is defined by overriding the visit*node* functions.
The walk is controlled by the return values of the visit*node_in methods. Returning a node(s) will sched-
ule them to be visited. The visitor will begin to backtrack if no nodes are returned.
visit_token_node(node)
Called when a Token is visited. Token nodes are always leaves.
visit_symbol_node_in(node)
Called when a symbol node is visited. Nodes that are returned will be scheduled to be visited. If
visit_intermediate_node_in is not implemented, this function will be called for intermediate
nodes as well.
visit_symbol_node_out(node)
Called after all nodes returned from a corresponding visit_symbol_node_in call have been visited.
If visit_intermediate_node_out is not implemented, this function will be called for intermediate
nodes as well.
visit_packed_node_in(node)
Called when a packed node is visited. Nodes that are returned will be scheduled to be visited.
visit_packed_node_out(node)
Called after all nodes returned from a corresponding visit_packed_node_in call have been visited.
on_cycle(node, path)
Called when a cycle is encountered.
Parameters
• node – The node that causes a cycle.
• path – The list of nodes being visited: nodes that have been entered but not exited. The
first element is the root in a forest visit, and the last element is the node visited most
recently. path should be treated as read-only.
get_cycle_in_path(node, path)
A utility function for use in on_cycle to obtain a slice of path that only contains the nodes that make
up the cycle.
13.4 ForestTransformer
class lark.parsers.earley_forest.ForestTransformer
The base class for a bottom-up forest transformation. Most users will want to use
TreeForestTransformer instead as it has a friendlier interface and covers most use cases.
Transformations are applied via inheritance and overriding of the transform*node methods.
transform_token_node receives a Token as an argument. All other methods receive the node that is
being transformed and a list of the results of the transformations of that node’s children. The return value of
these methods are the resulting transformations.
If Discard is raised in a node’s transformation, no data from that node will be passed to its parent’s transfor-
mation.
transform(root)
Perform a transformation on an SPPF.
transform_symbol_node(node, data)
Transform a symbol node.
transform_intermediate_node(node, data)
Transform an intermediate node.
transform_packed_node(node, data)
Transform a packed node.
13.4. ForestTransformer 73
Lark Documentation
transform_token_node(node)
Transform a Token.
13.5 TreeForestTransformer
class lark.parsers.earley_forest.TreeForestTransformer(tree_class=<class
’lark.tree.Tree’>, priori-
tizer=<lark.parsers.earley_forest.ForestSumVisitor
object>, re-
solve_ambiguity=True)
A ForestTransformer with a tree Transformer-like interface. By default, it will construct a tree.
Methods provided via inheritance are called based on the rule/symbol names of nodes in the forest.
Methods that act on rules will receive a list of the results of the transformations of the rule’s children. By default,
trees and tokens.
Methods that act on tokens will receive a token.
Alternatively, methods that act on rules may be annotated with handles_ambiguity. In this case, the
function will receive a list of all the transformations of all the derivations of the rule. By default, a list of trees
where each tree.data is equal to the rule name or one of its aliases.
Non-tree transformations are made possible by override of __default__, __default_token__, and
__default_ambig__.
Note: Tree shaping features such as inlined rules and token filtering are not built into the transformation.
Positions are also not propagated.
Parameters
• tree_class – The tree class to use for construction
• prioritizer – A ForestVisitor that manipulates the priorities of nodes in the
SPPF.
• resolve_ambiguity – If True, ambiguities will be resolved based on priorities.
__default__(name, data)
Default operation on tree (for override).
Returns a tree with name with data as children.
__default_ambig__(name, data)
Default operation on ambiguous rule (for override).
Wraps data in an ‘_ambig_’ node if it contains more than one element.
__default_token__(node)
Default operation on Token (for override).
Returns node.
13.6 handles_ambiguity
lark.parsers.earley_forest.handles_ambiguity(func)
Decorator for methods of subclasses of TreeForestTransformer. Denotes that the method should receive
a list of transformed derivations.
13.6. handles_ambiguity 75
Lark Documentation
Lark comes with a tool to convert grammars from Nearley, a popular Earley library for Javascript. It uses Js2Py to
convert and run the Javascript postprocessing code segments.
14.1 Requirements
14.2 Usage
The Nearley converter also supports an experimental converter for newer JavaScript (ES6+), using the --es6 flag:
77
Lark Documentation
14.3 Notes
Install Lark
79
Lark Documentation
Syntax Highlighting
81
Lark Documentation
Resources
• Philosophy
• Features
• Examples
• Online IDE
• Tutorials
– How to write a DSL - Implements a toy LOGO-like language with an interpreter
– JSON parser - Tutorial - Teaches you how to use Lark
– Unofficial
83
Lark Documentation
85
Lark Documentation
U
UnexpectedCharacters (class in lark.exceptions),
63
UnexpectedInput (class in lark.exceptions), 63
UnexpectedToken (class in lark.exceptions), 63
V
v_args() (in module lark.visitors), 68
value (lark.Token attribute), 62
visit() (lark.visitors.Visitor method), 65
86 Index