The High Performance Python Landscape by Ian Ozsvald

www.morconsulting.c
The High Performance Python
Landscape
- profiling and fast calculation
Ian Ozsvald @IanOzsvald MorConsulting.com

Ian@MorConsulting.com @IanOzsvald
PyDataLondon February 2014
What is “high performance”?
●
Profiling to understand system behaviour
●
We often ignore this step...
●
Speeding up the bottleneck
●
Keeps you on 1 machine (if possible)
●
Keeping team speed high

“High Performance Python”
• “Practical Performant
Programming
for Humans”
• Please join the mailing
list via IanOzsvald.com

cProfile

line_profiler
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           @profile
    10                                           def calculate_z_serial_purepython(
                                                      maxiter, zs, cs):
    12         1         6870   6870.0      0.0      output = [0] * len(zs)
    13   1000001       781959      0.8      0.8      for i in range(len(zs)):
    14   1000000       767224      0.8      0.8          n = 0
    15   1000000       843432      0.8      0.8          z = zs[i]
    16   1000000       786013      0.8      0.8          c = cs[i]
    17  34219980     36492596      1.1     36.2          while abs(z) < 2
                                                               and n < maxiter:
    18  33219980     32869046      1.0     32.6              z = z * z + c
    19  33219980     27371730      0.8     27.2              n += 1
    20   1000000       890837      0.9      0.9          output[i] = n
    21         1            4      4.0      0.0      return output

memory_profiler
Line #    Mem usage    Increment   Line Contents
================================================
     9   89.934 MiB    0.000 MiB   @profile
    10                             def calculate_z_serial_purepython(
                                                     maxiter, zs, cs):

    12   97.566 MiB    7.633 MiB       output = [0] * len(zs)
    13  130.215 MiB   32.648 MiB       for i in range(len(zs)):
    14  130.215 MiB    0.000 MiB           n = 0
    15  130.215 MiB    0.000 MiB           z = zs[i]
    16  130.215 MiB    0.000 MiB           c = cs[i]
    17  130.215 MiB    0.000 MiB           while n < maxiter and abs(z) < 2:
    18  130.215 MiB    0.000 MiB               z = z * z + c
    19  130.215 MiB    0.000 MiB               n += 1
    20  130.215 MiB    0.000 MiB           output[i] = n
    21  122.582 MiB   7.633 MiB       return output

memory_profiler mprof
https://github.com/scikit-learn/scikit-l
earn/pull/2248
Before & After an improvement

Transforming memory_profiler
into a resource profiler?

Profiling possibilities
●
CPU (line by line or by function)
●
Memory (line by line)
●
Disk read/write (with some hacking)
●
Network read/write (with some hacking)
●
mmaps
●
File handles
●
Network connections
●
Cache utilisation via libperf?

Cython 0.20 (pyx annotations)
#cython: boundscheck=False
def calculate_z(int maxiter, zs, cs):
    """Calculate output list using Julia update rule"""
    cdef unsigned int i, n
    cdef double complex z, c
    output = [0] * len(zs)
    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
            z = z * z + c
            n += 1
        output[i] = n
    return output
Pure CPython lists code 12s
Cython lists runtime 0.19s
Cython numpy runtime 0.16s

Cython + numpy + OMP nogil
#cython: boundscheck=False
from cython.parallel import parallel, prange
import numpy as np
cimport numpy as np
def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
    cdef unsigned int i, length, n
    cdef double complex z, c
    cdef int[:] output = np.empty(len(zs), dtype=np.int32)
    length = len(zs)
    with nogil, parallel():
        for i in prange(length, schedule="guided"):
            z = zs[i]
            c = cs[i]
            n = 0
            while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
                z = z * z + c
                n = n + 1
            output[i] = n
    return output
Runtime 0.05s

ShedSkin 0.9.4 annotations
def calculate_z(maxiter, zs, cs):        # maxiter: [int], zs:
                           [list(complex)], cs: [list(complex)]
    output = [0] * len(zs)               # [list(int)]
    for i in range(len(zs)):             # [__iter(int)]
        n = 0                            # [int]
        z = zs[i]                        # [complex]
        c = cs[i]                        # [complex]
        while n < maxiter and (… <4):    # [complex]
            z = z * z + c                # [complex]
            n += 1                       # [int]
        output[i] = n                    # [int]
    return output                        # [list(int)]
Couldn't we generate Cython pyx? Runtime 0.22s

Pythran (0.40)
#pythran export calculate_z_serial_purepython(int,
complex list, complex list)
def calculate_z_serial_purepython(maxiter, zs, cs):
…
Support for OpenMP on numpy arrays
Author Serge made an overnight fix – superb
support!
List Runtime 0.4s
#pythran export calculate_z(int, complex[], complex[], int[])
…
#omp parallel for schedule(dynamic)
OMP numpy Runtime 0.10s

PyPy nightly (and numpypy)
●
“It just works” on Python 2.7 code
●
Clever list strategies (e.g. unboxed, uniform)
●
Little support for pre-existing C extensions (e.g.
the existing numpy)
●
multiprocessing, IPython etc all work fine
●
Python list code runtime: 0.3s
●
(pypy)numpy support is incomplete, bugs are
tackled (numpy runtime 5s [CPython+numpy 56s])

Numba 0.12
from numba import jit
@jit(nopython=True)
def calculate_z_serial_purepython(maxiter, zs, cs, output):
    # couldn't create output, had to pass it in
    # output = numpy.zeros(len(zs), dtype=np.int32)
    for i in xrange(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        #while n < maxiter and abs(z) < 2:  # abs unrecognised
        while n < maxiter and z.real * z.real + z.imag * z.imag < 4:
            z = z * z + c
            n += 1
        output[i] = n
    #return output
Runtime 0.4s
Some Python 3 support, some GPU
prange support missing (was in 0.11)?
0.12 introduces temp limitations

Tool Tradeoffs
●
PyPy no learning curve (pure Py only) easy win?
●
ShedSkin easy (pure Py only) but fairly rare
●
Cython pure Py hours to learn – team cost low (and lots of
online help)
●
Cython numpy OMP days+ to learn – heavy team cost?
●
Numba/Pythran hours to learn, install a bit tricky (Anaconda
easiest for Numba)
●
Pythran OMP very impressive result for little effort
●
Numba big toolchain which might hurt productivity?
●
(numexpr not covered – great for numpy and easy to use)

Wrap up
●
Our profiling options should be richer
●
4-12 physical CPU cores commonplace
●
Cost of hand-annotating code is reduced agility
●
JITs/AST compilers are getting fairly good, manual
intervention still gives best results
BUT! CONSIDER:
●
Automation should (probably) be embraced ($CPUs
< $humans) as team velocity is probably higher

Thank You
• Ian@IanOzsvald.com
• @IanOzsvald
• MorConsulting.com
• Annotate.io
• GitHub/IanOzsvald

The High Performance Python Landscape by Ian Ozsvald

Recommended

More Related Content

Viewers also liked (17)

Similar to The High Performance Python Landscape by Ian Ozsvald (20)

More from PyData (20)

Recently uploaded (20)

The High Performance Python Landscape by Ian Ozsvald