Generator USENIX
Generator USENIX
Hacking
David Beazley
http://www.dabeaz.com
Introduction
• At PyCon'2008 (Chicago), I gave a popular
tutorial on generator functions
http://www.dabeaz.com/generators
Support Files
Part I
Introduction to Iterators and Generators
• Constructors
list(s), tuple(s), set(s), dict(s)
• in operator
item in s
Iteration Protocol
• The reason why you can iterate over different
objects is that there is a specific protocol
>>> items = [1, 4, 5]
>>> it = iter(items)
>>> it.next()
1
>>> it.next()
4
>>> it.next()
5
>>> it.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
>>>
Supporting Iteration
• User-defined objects can support iteration
• Example: a "countdown" object
>>> for x in countdown(10):
... print x,
...
10 9 8 7 6 5 4 3 2 1
>>>
Iteration Example
• Example use:
>>> c = countdown(5)
>>> for i in c:
... print i,
...
5 4 3 2 1
>>>
def countdown_iter(object):
def __init__(self,count):
self.count = count
def next(self):
if self.count <= 0:
raise StopIteration
r = self.count
self.count -= 1
return r
Iteration Example
• Having a separate "iterator" allows for
nested iteration on the same object
>>> c = countdown(5)
>>> for i in c:
... for j in c:
... print i,j
...
5 5
5 4
5 3
5 2
...
1 3
1 2
1 1
>>>
Generators
• A generator is a function that produces a
sequence of results instead of a single value
def countdown(n):
while n > 0:
yield n
n -= 1
>>> for i in countdown(5):
... print i,
...
5 4 3 2 1
>>>
Generator Functions
• The function only executes on next()
>>> x = countdown(10)
>>> x
<generator object at 0x58490>
>>> x.next()
Counting down from 10 Function starts
10 executing here
>>>
Generator Functions
List Comprehensions
• General syntax
[expression for x in s if condition]
• What it means
result = []
for x in s:
if condition:
result.append(expression)
Generator Expressions
• A generated version of a list comprehension
>>> a = [1,2,3,4]
>>> b = (2*x for x in a)
>>> b
<generator object at 0x58760>
>>> for i in b: print b,
...
2 4 6 8
>>>
• Example:
>>> a = [1,2,3,4]
>>> b = [2*x for x in a]
>>> b
[2, 4, 6, 8]
>>> c = (2*x for x in a)
<generator object at 0x58760>
>>>
Generator Expressions
• General syntax
(expression for x in s if condition)
• What it means
for x in s:
if condition:
yield expression
Generator expression
Interlude
• There are two basic blocks for generators
• Generator functions:
def countdown(n):
while n > 0:
yield n
n -= 1
• Generator expressions
squares = (x*x for x in s)
Programming Problem
Find out how many bytes of data were
transferred by summing up the last column
of data in this Apache web server log
81.107.39.38 - ... "GET /ply/ HTTP/1.1" 200 7587
81.107.39.38 - ... "GET /favicon.ico HTTP/1.1" 404 133
81.107.39.38 - ... "GET /ply/bookplug.gif HTTP/1.1" 200 23903
81.107.39.38 - ... "GET /ply/ply.html HTTP/1.1" 200 97238
81.107.39.38 - ... "GET /ply/example.html HTTP/1.1" 200 2359
66.249.72.134 - ... "GET /index.html HTTP/1.1" 200 4447
A Non-Generator Soln
• Just do a simple for-loop
wwwlog = open("access-log")
total = 0
for line in wwwlog:
bytestr = line.rsplit(None,1)[1]
if bytestr != '-':
total += int(bytestr)
Generators as a Pipeline
• To understand the solution, think of it as a data
processing pipeline
Being Declarative
Performance
wwwlog = open("big-access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes = (int(x) for x in bytecolumn if x != '-')
Commentary
• Not only was it not slow, it was 5% faster
• And it was less code
• And it was relatively easy to read
• And frankly, I like it a whole better...
"Back in the old days, we used AWK for this and
we liked it. Oh, yeah, and get off my lawn!"
Programming Problem
You have hundreds of web server logs scattered
across various directories. In additional, some of
the logs are compressed. Modify the last program
so that you can easily read all of these logs
foo/
access-log-012007.gz
access-log-022007.gz
access-log-032007.gz
...
access-log-012008
bar/
access-log-092007.bz2
...
access-log-022008
find
• Generate all filenames in a directory tree
that match a given filename pattern
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
• Examples
pyfiles = gen_find("*.py","/")
logs = gen_find("access-log*","/usr/www/")
A File Opener
• Open a sequence of filenames
import gzip, bz2
def gen_open(filenames):
for name in filenames:
if name.endswith(".gz"):
yield gzip.open(name)
elif name.endswith(".bz2"):
yield bz2.BZ2File(name)
else:
yield open(name)
• Example:
lognames = gen_find("access-log*", "/usr/www")
logfiles = gen_open(lognames)
loglines = gen_cat(logfiles)
grep
• Generate a sequence of lines that contain
a given regular expression
import re
• Example:
lognames = gen_find("access-log*", "/usr/www")
logfiles = gen_open(lognames)
loglines = gen_cat(logfiles)
patlines = gen_grep(pat, loglines)
filenames = gen_find("access-log*",logdir)
logfiles = gen_open(filenames)
loglines = gen_cat(logfiles)
patlines = gen_grep(pat,loglines)
bytecolumn = (line.rsplit(None,1)[1] for line in patlines)
bytes = (int(x) for x in bytecolumn if x != '-')
Important Concept
• Generators decouple iteration from the
code that uses the results of the iteration
• In the last example, we're performing a
calculation on a sequence of lines
• It doesn't matter where or how those
lines are generated
• Thus, we can plug any number of
components together up front as long as
they eventually produce a line sequence
Programming Problem
Web server logs consist of different columns of
data. Parse each line into a useful data structure
that allows us to easily inspect the different fields.
logpat = re.compile(logpats)
Tuple Commentary
• I generally don't like data processing on tuples
('71.201.176.194', '-', '-', '26/Feb/2008:10:30:08 -0600',
'GET', '/ply/ply.html', 'HTTP/1.1', '200', '97238')
Field Conversion
• You might want to map specific dictionary fields
through a conversion function (e.g., int(), float())
def field_map(dictseq,name,func):
for d in dictseq:
d[name] = func(d[name])
yield d
colnames = ('host','referrer','user','datetime','method',
'request','proto','status','bytes')
Packaging
• Example : multiple pipeline stages inside a function
def lines_from_dir(filepat, dirname):
names = gen_find(filepat,dirname)
files = gen_open(names)
lines = gen_cat(files)
return lines
colnames = ('host','referrer','user','datetime','method',
'request','proto','status','bytes')
return log
Example Use
• It's easy
lines = lines_from_dir("access-log*","www")
log = apache_log(lines)
for r in log:
print r
A Query Language
• Now that we have our log, let's do some queries
• Find the set of all documents that 404
stat404 = set(r['request'] for r in log
if r['status'] == 404)
for r in large:
print r['request'], r['bytes']
A Query Language
• Find out who has been hitting robots.txt
addrs = set(r['host'] for r in log
if 'robots.txt' in r['request'])
import socket
for addr in addrs:
try:
print socket.gethostbyaddr(addr)[0]
except socket.herror:
print addr
Some Thoughts
• I like the idea of using generator expressions as a
pipeline query language
• You can write simple filters, extract data, etc.
• You you pass dictionaries/objects through the
pipeline, it becomes quite powerful
• Feels similar to writing SQL queries
Question
• Have you ever used 'tail -f' in Unix?
% tail -f logfile
...
... lines of output ...
...
Tailing a File
• A Python version of 'tail -f'
import time
def follow(thefile):
thefile.seek(0,2) # Go to the end of the file
while True:
line = thefile.readline()
if not line:
time.sleep(0.1) # Sleep briefly
continue
yield line
Example
• Turn the real-time log file into records
logfile = open("access-log")
loglines = follow(logfile)
log = apache_log(loglines)
Part 6
Decoding Binary Records
Struct Example
• Suppose you had a file of binary records
encoded as follows
Byte offsets Description Encoding
-------------- --------------- ----------------------
0-8 Stock name (8 byte string)
9-11 Price (32-bit float)
12-15 Change (32-bit float)
16-19 Volume (32-bit unsigned int)
Incremental Parsing
• Example:
from genrecord import *
f = open("stockdata.bin","rb")
records = gen_records("8sffi",f)
for name, price, change, volume in records:
# Process data
...
Buffered Reading
• A generator that reads large chunks of data
def chunk_reader(thefile, chunksize):
while True:
chunk = thefile.read(chunksize)
if not chunk: break
yield chunk
Part 7
Flipping Everything Around
(from generators to coroutines)
Yield as an Expression
• In Python 2.5, can use yield as an expression
• For example, on the right side of an assignment
def grep(pattern):
print "Looking for %s" % pattern
while True:
line = (yield)
if pattern in line:
print line,
Coroutine Execution
• Execution is the same as for a generator
• When you call a coroutine, nothing happens
• They only run in response to next() and send()
methods Notice that no
output was
produced
>>> g = grep("python")
>>> g.next()
Looking for python On first operation,
>>> coroutine starts
running
Using a Decorator
• Remembering to call .next() is easy to forget
• Solved by wrapping coroutines with a decorator
def coroutine(func):
def start(*args,**kwargs):
cr = func(*args,**kwargs)
cr.next()
return cr
return start
@coroutine
def grep(pattern):
...
Catching close()
• close() can be caught (GeneratorExit)
@coroutine
def grep(pattern):
print "Looking for %s" % pattern
try:
while True:
line = (yield)
if pattern in line:
print line,
except GeneratorExit:
print "Going away. Goodbye"
Processing Pipelines
Pipeline Sinks
• The pipeline must have an end-point (sink)
send() send()
coroutine sink
An Example
• Hooking it together
f = open("access-log")
follow(f, printer())
• A picture
send()
follow() printer()
A Filter Example
• A grep filter coroutine
@coroutine
def grep(pattern,target):
while True:
line = (yield) # Receive a line
if pattern in line:
target.send(line) # Send to next stage
• Hooking it up
f = open("access-log")
follow(f,
grep('python',
printer()))
• A picture
send() send()
follow() grep() printer()
coroutines
send() send()
source coroutine coroutine
Being Branchy
• With coroutines, you can send data to multiple
destinations
send()
coroutine coroutine
send()
send() send()
source coroutine coroutine
send() coroutine
Example : Broadcasting
• Example use:
f = open("access-log")
follow(f,
broadcast([grep('python',printer()),
grep('ply',printer()),
grep('swig',printer())])
)
grep('python') printer()
grep('swig') printer()
grep('python')
grep('swig')
Interlude
• Coroutines provide more powerful data routing
possibilities than simple iterators
• If you built a collection of simple data processing
components, you can glue them together into
complex arrangements of pipes, branches,
merging, etc.
• Although you might not want to make it
excessively complicated (although that might
increase/decrease one's job security)
Event Handling
Some XML
<?xml version="1.0"?>
<buses>
<bus>
! ! <id>7574</id>
!! <route>147</route>
!! <color>#3300ff</color>
!! <revenue>true</revenue>
!! <direction>North Bound</direction>
!! <latitude>41.925682067871094</latitude>
!<longitude>-87.63092803955078</longitude>
!<pattern>2499</pattern>
!<patternDirection>North Bound</patternDirection>
! <run>P675</run>
<finalStop><![CDATA[Paulina & Howard Terminal]]></finalStop>
<operator>42493</operator>
</bus>
<bus>
...
</bus>
</buses>
Copyright (C) 2009, David Beazley, http://www.dabeaz.com 120
XML Parsing
• There are many possible ways to parse XML
• An old-school approach: SAX
• SAX is an event driven interface
Handler Object
class Handler:
def startElement():
events ...
XML Parser def endElement():
...
def characters():
...
class MyHandler(xml.sax.ContentHandler):
def startElement(self,name,attrs):
print "startElement", name
def endElement(self,name):
print "endElement", name
def characters(self,text):
print "characters", repr(text)[:40]
xml.sax.parse("somefile.xml",MyHandler())
class EventHandler(xml.sax.ContentHandler):
def __init__(self,target):
self.target = target
def startElement(self,name,attrs):
self.target.send(('start',(name,attrs._attrs)))
def characters(self,text):
self.target.send(('text',text))
def endElement(self,name):
self.target.send(('end',name))
'start' ('direction',{})
'end' 'direction'
'text' 'North Bound'
Event type Event values
Event Processing
• To do anything interesting, you have to
process the event stream
• Example: Convert bus elements into
dictionaries (XML sucks, dictionaries rock)
<bus> {
! ! <id>7574</id> 'id' : '7574',
!! <route>147</route> 'route' : '147',
!! <revenue>true</revenue> 'revenue' : 'true',
!! <direction>North Bound</direction> 'direction' : 'North Bound'
!! ... ...
</bus> }
State Machines
• The previous code works by implementing a
simple state machine
('start',('bus',*))
A B
('end','bus')
Filtering Elements
• Let's filter on dictionary fields
@coroutine
def filter_on_field(fieldname,value,target):
while True:
d = (yield)
if d.get(fieldname) == value:
target.send(d)
• Examples:filter_on_field("route","22",target)
filter_on_field("direction","North Bound",target)
Hooking it Together
• Find all locations of the North Bound #22 bus
(the slowest moving object in the universe)
xml.sax.parse("allroutes.xml",
EventHandler(
buses_to_dicts(
filter_on_field("route","22",
filter_on_field("direction","North Bound",
bus_locations())))
))
def expat_parse(f,target):
parser = xml.parsers.expat.ParserCreate()
parser.buffer_size = 65536
parser.buffer_text = True
parser.returns_unicode = False
parser.StartElementHandler = \
lambda name,attrs: target.send(('start',(name,attrs)))
parser.EndElementHandler = \
lambda name: target.send(('end',name))
parser.CharacterDataHandler = \
lambda data: target.send(('text',data))
parser.ParseFile(f)
• Expat version
expat_parse(open("allroutes.xml"),
buses_to_dicts(
filter_on_field("route","22", 4.51s
filter_on_field("direction","North Bound", (83% speedup)
bus_locations()))))
Going Lower
• You can even drop send() operations into C
• A skeleton of how this works...
PyObject *
py_parse(PyObject *self, PyObject *args) {
PyObject *filename;
PyObject *target;
PyObject *send_method;
if (!PyArg_ParseArgs(args,"sO",&filename,&target)) {
return NULL;
}
send_method = PyObject_GetAttrString(target,"send");
...
/* Invoke target.send(item) */
args = Py_BuildValue("(O)",item);
result = PyEval_CallObject(send_meth,args);
...
Interlude
• Processing events is a situation that is well-
suited for coroutine functions
• With event driven systems, some kind of event
handling loop is usually in charge
• Because of that, it's really hard to twist it
around into a programming model based on
iteration
• However, if you just push events into
coroutines with send(), it works fine.
Basic Concurrency
• You can package generators inside threads or
subprocesses by adding extra layers
Thread Host
pipe
Subprocess
generator
Pickler/Unpickler
• Turn a generated sequence into pickled objects
def gen_sendto(source,outfile):
for item in source:
pickle.dump(item,outfile)
def gen_recvfrom(infile):
while True:
try:
item = pickle.load(infile)
yield item
except EOFError:
return
lines = follow(open("access-log"))
log = apache_log(lines)
gen_sendto(log,p.stdin)
• Example: Receiver
# netcons.py
import sys
for r in get_recvfrom(sys.stdin):
print r
A Subprocess Target
• Bridging coroutines over a pipe/socket
@coroutine
def co_sendto(f):
try:
while True:
item = (yield)
pickle.dump(item,f)
f.flush()
except StopIteration:
f.close()
def co_recvfrom(f,target):
try:
while True:
item = pickle.load(f)
target.send(item)
except EOFError:
target.close()
A Process Example
• A parent process
# Launch a child process
import subprocess
p = subprocess.Popen(['python','child.py'],
stdin=subprocess.PIPE)
• A child process
# child.py
import sys
...
co_recvfrom(sys.stdin,
filter_on_field("route","22",
filter_on_field("direction","North Bound",
bus_locations())))
Copyright (C) 2009, David Beazley, http://www.dabeaz.com 148
A Picture
• Here is an overview of the last example
Main Program
xml.sax.parse
EventHandler
subprocess
buses_to_dicts filter_on_field
filter_on_field
bus_locations
Wrap Up
Code Reuse
• There is an interesting reuse element
• You create a lot of small processing parts
and glue them together to build larger apps
• Personally, I like it a lot better than what I
see people doing with various OO patterns
involving callbacks (e.g., the strategy design
pattern and variants)
Shameless Plug
• Further details on useful applications of
generators and coroutines will be featured in
the "Python Essential Reference, 4th Edition"
• Look for it (Summer 2009)
• I also teach Python classes