Python for Data Analysis 3rd Edition by Wes McKinney ISBN 9781098103989 109810398X instant download
Python for Data Analysis 3rd Edition by Wes McKinney ISBN 9781098103989 109810398X instant download
https://ebookball.com/product/python-for-data-analysis-3rd-
edition-by-wes-mckinney-isbn-9781098103989-109810398x-15690/
https://ebookball.com/product/outlines-and-highlights-for-data-
structures-and-algorithm-analysis-in-c-3rd-edition-by-mark-allen-
weiss-isbn-032144146x-9780321441461-16464/
https://ebookball.com/product/econometric-analysis-of-panel-
data-3rd-edition-by-badi-h-baltagi-
isbn-0470014563-9780470014561-10880/
Data Analysis with Microsoft Excel Updated for Office 2007 3rd Edition
by Kenneth N Berk, Patrick M Carey ISBN 0538494670 9780538494670
https://ebookball.com/product/data-analysis-with-microsoft-excel-
updated-for-office-2007-3rd-edition-by-kenneth-n-berk-patrick-m-
carey-isbn-0538494670-9780538494670-16136/
Data Structures and Algorithm Analysis in C++ 3rd edition by Clifford
Shaffer ISBN 048648582X 978-0486485829
https://ebookball.com/product/data-structures-and-algorithm-
analysis-in-c-3rd-edition-by-clifford-shaffer-
isbn-048648582x-978-0486485829-16486/
https://ebookball.com/product/data-structures-and-algorithm-
analysis-in-javatm-3rd-edition-by-mark-
weiss-9780133465013-0133465012-18710/
https://ebookball.com/product/a-practical-introduction-to-data-
structures-and-algorithm-analysis-3rd-edition-by-clifford-
shaffer-15310/
Starting Out with Python 3rd Edition by Tony Gaddis ISBN 1292065508
9781292065502
https://ebookball.com/product/starting-out-with-python-3rd-
edition-by-tony-gaddis-isbn-1292065508-9781292065502-15558/
https://ebookball.com/product/qualitative-data-analysis-a-
methods-sourcebook-3rd-edition-by-matthew-miles-michael-huberman-
johnny-saldana-1452257876-978-1452257877-17362/
www.it-ebooks.info
www.it-ebooks.info
Python for Data Analysis
Wes McKinney
www.it-ebooks.info
Python for Data Analysis
by Wes McKinney
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles (http://my.safaribooksonline.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or [email protected].
Editors: Julie Steele and Meghan Blanchette Indexer: BIM Publishing Services
Production Editor: Melanie Yarbrough Cover Designer: Karen Montgomery
Copyeditor: Teresa Exley Interior Designer: David Futato
Proofreader: BIM Publishing Services Illustrator: Rebecca Demarest
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of
O’Reilly Media, Inc. Python for Data Analysis, the cover image of a golden-tailed tree shrew, and related
trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a
trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and author assume
no responsibility for errors or omissions, or for damages resulting from the use of the information con-
tained herein.
ISBN: 978-1-449-31979-3
[LSI]
1349356084
www.it-ebooks.info
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Is This Book About? 1
Why Python for Data Analysis? 2
Python as Glue 2
Solving the “Two-Language” Problem 2
Why Not Python? 3
Essential Python Libraries 3
NumPy 4
pandas 4
matplotlib 5
IPython 5
SciPy 6
Installation and Setup 6
Windows 7
Apple OS X 9
GNU/Linux 10
Python 2 and Python 3 11
Integrated Development Environments (IDEs) 11
Community and Conferences 12
Navigating This Book 12
Code Examples 13
Data for Examples 13
Import Conventions 13
Jargon 13
Acknowledgements 14
2. Introductory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.usa.gov data from bit.ly 17
Counting Time Zones in Pure Python 19
iii
www.it-ebooks.info
Counting Time Zones with pandas 21
MovieLens 1M Data Set 26
Measuring rating disagreement 30
US Baby Names 1880-2010 32
Analyzing Naming Trends 36
Conclusions and The Path Ahead 43
iv | Table of Contents
www.it-ebooks.info
Operations between Arrays and Scalars 85
Basic Indexing and Slicing 86
Boolean Indexing 89
Fancy Indexing 92
Transposing Arrays and Swapping Axes 93
Universal Functions: Fast Element-wise Array Functions 95
Data Processing Using Arrays 97
Expressing Conditional Logic as Array Operations 98
Mathematical and Statistical Methods 100
Methods for Boolean Arrays 101
Sorting 101
Unique and Other Set Logic 102
File Input and Output with Arrays 103
Storing Arrays on Disk in Binary Format 103
Saving and Loading Text Files 104
Linear Algebra 105
Random Number Generation 106
Example: Random Walks 108
Simulating Many Random Walks at Once 109
Table of Contents | v
www.it-ebooks.info
Other pandas Topics 151
Integer Indexing 151
Panel Data 152
vi | Table of Contents
www.it-ebooks.info
8. Plotting and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
A Brief matplotlib API Primer 219
Figures and Subplots 220
Colors, Markers, and Line Styles 224
Ticks, Labels, and Legends 225
Annotations and Drawing on a Subplot 228
Saving Plots to File 231
matplotlib Configuration 231
Plotting Functions in pandas 232
Line Plots 232
Bar Plots 235
Histograms and Density Plots 238
Scatter Plots 239
Plotting Maps: Visualizing Haiti Earthquake Crisis Data 241
Python Visualization Tool Ecosystem 247
Chaco 248
mayavi 248
Other Packages 248
The Future of Visualization Tools? 249
www.it-ebooks.info
10. Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Date and Time Data Types and Tools 290
Converting between string and datetime 291
Time Series Basics 293
Indexing, Selection, Subsetting 294
Time Series with Duplicate Indices 296
Date Ranges, Frequencies, and Shifting 297
Generating Date Ranges 298
Frequencies and Date Offsets 299
Shifting (Leading and Lagging) Data 301
Time Zone Handling 303
Localization and Conversion 304
Operations with Time Zone−aware Timestamp Objects 305
Operations between Different Time Zones 306
Periods and Period Arithmetic 307
Period Frequency Conversion 308
Quarterly Period Frequencies 309
Converting Timestamps to Periods (and Back) 311
Creating a PeriodIndex from Arrays 312
Resampling and Frequency Conversion 312
Downsampling 314
Upsampling and Interpolation 316
Resampling with Periods 318
Time Series Plotting 319
Moving Window Functions 320
Exponentially-weighted functions 324
Binary Moving Window Functions 324
User-Defined Moving Window Functions 326
Performance and Memory Usage Notes 327
www.it-ebooks.info
Rolling Correlation and Linear Regression 350
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
Table of Contents | ix
www.it-ebooks.info
www.it-ebooks.info
Preface
The scientific Python ecosystem of open source libraries has grown substantially over
the last 10 years. By late 2011, I had long felt that the lack of centralized learning
resources for data analysis and statistical applications was a stumbling block for new
Python programmers engaged in such work. Key projects for data analysis (especially
NumPy, IPython, matplotlib, and pandas) had also matured enough that a book written
about them would likely not go out-of-date very quickly. Thus, I mustered the nerve
to embark on this writing project. This is the book that I wish existed when I started
using Python for data analysis in 2007. I hope you find it useful and are able to apply
these tools productively in your work.
xi
www.it-ebooks.info
This icon indicates a warning or caution.
xii | Preface
www.it-ebooks.info
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://oreil.ly/python_for_data_analysis.
To comment or ask technical questions about this book, send email to
[email protected].
For more information about our books, courses, conferences, and news, see our website
at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Preface | xiii
www.it-ebooks.info
www.it-ebooks.info
CHAPTER 1
Preliminaries
www.it-ebooks.info
Why Python for Data Analysis?
For many people (myself among them), the Python language is easy to fall in love with.
Since its first appearance in 1991, Python has become one of the most popular dynamic,
programming languages, along with Perl, Ruby, and others. Python and Ruby have
become especially popular in recent years for building websites using their numerous
web frameworks, like Rails (Ruby) and Django (Python). Such languages are often
called scripting languages as they can be used to write quick-and-dirty small programs,
or scripts. I don’t like the term “scripting language” as it carries a connotation that they
cannot be used for building mission-critical software. Among interpreted languages
Python is distinguished by its large and active scientific computing community. Adop-
tion of Python for scientific computing in both industry applications and academic
research has increased significantly since the early 2000s.
For data analysis and interactive, exploratory computing and data visualization, Python
will inevitably draw comparisons with the many other domain-specific open source
and commercial programming languages and tools in wide use, such as R, MATLAB,
SAS, Stata, and others. In recent years, Python’s improved library support (primarily
pandas) has made it a strong alternative for data manipulation tasks. Combined with
Python’s strength in general purpose programming, it is an excellent choice as a single
language for building data-centric applications.
Python as Glue
Part of Python’s success as a scientific computing platform is the ease of integrating C,
C++, and FORTRAN code. Most modern computing environments share a similar set
of legacy FORTRAN and C libraries for doing linear algebra, optimization, integration,
fast fourier transforms, and other such algorithms. The same story has held true for
many companies and national labs that have used Python to glue together 30 years’
worth of legacy software.
Most programs consist of small portions of code where most of the time is spent, with
large amounts of “glue code” that doesn’t run often. In many cases, the execution time
of the glue code is insignificant; effort is most fruitfully invested in optimizing the
computational bottlenecks, sometimes by moving the code to a lower-level language
like C.
In the last few years, the Cython project (http://cython.org) has become one of the
preferred ways of both creating fast compiled extensions for Python and also interfacing
with C and C++ code.
2 | Chapter 1: Preliminaries
www.it-ebooks.info
ideas to be part of a larger production system written in, say, Java, C#, or C++. What
people are increasingly finding is that Python is a suitable language not only for doing
research and prototyping but also building the production systems, too. I believe that
more and more companies will go down this path as there are often significant organ-
izational benefits to having both scientists and technologists using the same set of pro-
grammatic tools.
www.it-ebooks.info
NumPy
NumPy, short for Numerical Python, is the foundational package for scientific com-
puting in Python. The majority of this book will be based on NumPy and libraries built
on top of NumPy. It provides, among other things:
• A fast and efficient multidimensional array object ndarray
• Functions for performing element-wise computations with arrays or mathematical
operations between arrays
• Tools for reading and writing array-based data sets to disk
• Linear algebra operations, Fourier transform, and random number generation
• Tools for integrating connecting C, C++, and Fortran code to Python
Beyond the fast array-processing capabilities that NumPy adds to Python, one of its
primary purposes with regards to data analysis is as the primary container for data to
be passed between algorithms. For numerical data, NumPy arrays are a much more
efficient way of storing and manipulating data than the other built-in Python data
structures. Also, libraries written in a lower-level language, such as C or Fortran, can
operate on the data stored in a NumPy array without copying any data.
pandas
pandas provides rich data structures and functions designed to make working with
structured data fast, easy, and expressive. It is, as you will see, one of the critical in-
gredients enabling Python to be a powerful and productive data analysis environment.
The primary object in pandas that will be used in this book is the DataFrame, a two-
dimensional tabular, column-oriented data structure with both row and column labels:
>>> frame
total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.5 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
7 8.77 2 Male No Sun Dinner 2
8 26.88 3.12 Male No Sun Dinner 4
9 15.04 1.96 Male No Sun Dinner 2
10 14.78 3.23 Male No Sun Dinner 2
pandas combines the high performance array-computing features of NumPy with the
flexible data manipulation capabilities of spreadsheets and relational databases (such
as SQL). It provides sophisticated indexing functionality to make it easy to reshape,
slice and dice, perform aggregations, and select subsets of data. pandas is the primary
tool that we will use in this book.
4 | Chapter 1: Preliminaries
www.it-ebooks.info
For financial users, pandas features rich, high-performance time series functionality
and tools well-suited for working with financial data. In fact, I initially designed pandas
as an ideal tool for financial data analysis applications.
For users of the R language for statistical computing, the DataFrame name will be
familiar, as the object was named after the similar R data.frame object. They are not
the same, however; the functionality provided by data.frame in R is essentially a strict
subset of that provided by the pandas DataFrame. While this is a book about Python, I
will occasionally draw comparisons with R as it is one of the most widely-used open
source data analysis environments and will be familiar to many readers.
The pandas name itself is derived from panel data, an econometrics term for multidi-
mensional structured data sets, and Python data analysis itself.
matplotlib
matplotlib is the most popular Python library for producing plots and other 2D data
visualizations. It was originally created by John D. Hunter (JDH) and is now maintained
by a large team of developers. It is well-suited for creating plots suitable for publication.
It integrates well with IPython (see below), thus providing a comfortable interactive
environment for plotting and exploring data. The plots are also interactive; you can
zoom in on a section of the plot and pan around the plot using the toolbar in the plot
window.
IPython
IPython is the component in the standard scientific Python toolset that ties everything
together. It provides a robust and productive environment for interactive and explor-
atory computing. It is an enhanced Python shell designed to accelerate the writing,
testing, and debugging of Python code. It is particularly useful for interactively working
with data and visualizing data with matplotlib. IPython is usually involved with the
majority of my Python work, including running, debugging, and testing code.
Aside from the standard terminal-based IPython shell, the project also provides
• A Mathematica-like HTML notebook for connecting to IPython through a web
browser (more on this later).
• A Qt framework-based GUI console with inline plotting, multiline editing, and
syntax highlighting
• An infrastructure for interactive parallel and distributed computing
I will devote a chapter to IPython and how to get the most out of its features. I strongly
recommend using it while working through this book.
www.it-ebooks.info
SciPy
SciPy is a collection of packages addressing a number of different standard problem
domains in scientific computing. Here is a sampling of the packages included:
• scipy.integrate: numerical integration routines and differential equation solvers
• scipy.linalg: linear algebra routines and matrix decompositions extending be-
yond those provided in numpy.linalg.
• scipy.optimize: function optimizers (minimizers) and root finding algorithms
• scipy.signal: signal processing tools
• scipy.sparse: sparse matrices and sparse linear system solvers
• scipy.special: wrapper around SPECFUN, a Fortran library implementing many
common mathematical functions, such as the gamma function
• scipy.stats: standard continuous and discrete probability distributions (density
functions, samplers, continuous distribution functions), various statistical tests,
and more descriptive statistics
• scipy.weave: tool for using inline C++ code to accelerate array computations
Together NumPy and SciPy form a reasonably complete computational replacement
for much of MATLAB along with some of its add-on toolboxes.
6 | Chapter 1: Preliminaries
www.it-ebooks.info
• Scientific Python base: NumPy, SciPy, matplotlib, and IPython. These are all in-
cluded in EPDFree.
• IPython Notebook dependencies: tornado and pyzmq. These are included in EPD-
Free.
• pandas (version 0.8.2 or higher).
At some point while reading you may wish to install one or more of the following
packages: statsmodels, PyTables, PyQt (or equivalently, PySide), xlrd, lxml, basemap,
pymongo, and requests. These are used in various examples. Installing these optional
libraries is not necessary, and I would would suggest waiting until you need them. For
example, installing PyQt or PyTables from source on OS X or Linux can be rather
arduous. For now, it’s most important to get up and running with the bare minimum:
EPDFree and pandas.
For information on each Python package and links to binary installers or other help,
see the Python Package Index (PyPI, http://pypi.python.org). This is also an excellent
resource for finding new Python packages.
Windows
To get started on Windows, download the EPDFree installer from http://www.en
thought.com, which should be an MSI installer named like epd_free-7.3-1-win-
x86.msi. Run the installer and accept the default installation location C:\Python27. If
you had previously installed Python in this location, you may want to delete it manually
first (or using Add/Remove Programs).
Next, you need to verify that Python has been successfully added to the system path
and that there are no conflicts with any prior-installed Python versions. First, open a
command prompt by going to the Start Menu and starting the Command Prompt ap-
plication, also known as cmd.exe. Try starting the Python interpreter by typing
python. You should see a message that matches the version of EPDFree you installed:
C:\Users\Wes>python
Python 2.7.3 |EPD_free 7.3-1 (32-bit)| (default, Apr 12 2012, 14:30:37) on win32
Type "credits", "demo" or "enthought" for more information.
>>>
www.it-ebooks.info
If you see a message for a different version of EPD or it doesn’t work at all, you will
need to clean up your Windows environment variables. On Windows 7 you can start
typing “environment variables” in the programs search field and select Edit environ
ment variables for your account. On Windows XP, you will have to go to Control
Panel > System > Advanced > Environment Variables. On the window that pops up,
you are looking for the Path variable. It needs to contain the following two directory
paths, separated by semicolons:
C:\Python27;C:\Python27\Scripts
If you installed other versions of Python, be sure to delete any other Python-related
directories from both the system and user Path variables. After making a path alterna-
tion, you have to restart the command prompt for the changes to take effect.
Once you can launch Python successfully from the command prompt, you need to
install pandas. The easiest way is to download the appropriate binary installer from
http://pypi.python.org/pypi/pandas. For EPDFree, this should be pandas-0.9.0.win32-
py2.7.exe. After you run this, let’s launch IPython and check that things are installed
correctly by importing pandas and making a simple matplotlib plot:
C:\Users\Wes>ipython --pylab
Python 2.7.3 |EPD_free 7.3-1 (32-bit)|
Type "copyright", "credits" or "license" for more information.
In [2]: plot(arange(10))
If successful, there should be no error messages and a plot window will appear. You
can also check that the IPython HTML notebook can be successfully run by typing:
$ ipython notebook --pylab=inline
EPDFree on Windows contains only 32-bit executables. If you want or need a 64-bit
setup on Windows, using EPD Full is the most painless way to accomplish that. If you
would rather install from scratch and not pay for an EPD subscription, Christoph
Gohlke at the University of California, Irvine, publishes unofficial binary installers for
8 | Chapter 1: Preliminaries
www.it-ebooks.info
all of the book’s necessary packages (http://www.lfd.uci.edu/~gohlke/pythonlibs/) for 32-
and 64-bit Windows.
Apple OS X
To get started on OS X, you must first install Xcode, which includes Apple’s suite of
software development tools. The necessary component for our purposes is the gcc C
and C++ compiler suite. The Xcode installer can be found on the OS X install DVD
that came with your computer or downloaded from Apple directly.
Once you’ve installed Xcode, launch the terminal (Terminal.app) by navigating to
Applications > Utilities. Type gcc and press enter. You should hopefully see some-
thing like:
$ gcc
i686-apple-darwin10-gcc-4.2.1: no input files
Now you need to install EPDFree. Download the installer which should be a disk image
named something like epd_free-7.3-1-macosx-i386.dmg. Double-click the .dmg file to
mount it, then double-click the .mpkg file inside to run the installer.
When the installer runs, it automatically appends the EPDFree executable path to
your .bash_profile file. This is located at /Users/your_uname/.bash_profile:
# Setting PATH for EPD_free-7.3-1
PATH="/Library/Frameworks/Python.framework/Versions/Current/bin:${PATH}"
export PATH
Should you encounter any problems in the following steps, you’ll want to inspect
your .bash_profile and potentially add the above directory to your path.
Now, it’s time to install pandas. Execute this command in the terminal:
$ sudo easy_install pandas
Searching for pandas
Reading http://pypi.python.org/simple/pandas/
Reading http://pandas.pydata.org
Reading http://pandas.sourceforge.net
Best match: pandas 0.9.0
Downloading http://pypi.python.org/packages/source/p/pandas/pandas-0.9.0.zip
Processing pandas-0.9.0.zip
Writing /tmp/easy_install-H5mIX6/pandas-0.9.0/setup.cfg
Running pandas-0.9.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-H5mIX6/
pandas-0.9.0/egg-dist-tmp-RhLG0z
Adding pandas 0.9.0 to easy-install.pth file
Installed /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/
site-packages/pandas-0.9.0-py2.7-macosx-10.5-i386.egg
Processing dependencies for pandas
Finished processing dependencies for pandas
To verify everything is working, launch IPython in Pylab mode and test importing pan-
das then making a plot interactively:
www.it-ebooks.info
$ ipython --pylab
22:29 ~/VirtualBox VMs/WindowsXP $ ipython
Python 2.7.3 |EPD_free 7.3-1 (32-bit)| (default, Apr 12 2012, 11:28:34)
Type "copyright", "credits" or "license" for more information.
In [2]: plot(arange(10))
If this succeeds, a plot window with a straight line should pop up.
GNU/Linux
Linux details will vary a bit depending on your Linux flavor, but here I give details for
Debian-based GNU/Linux systems like Ubuntu and Mint. Setup is similar to OS X with
the exception of how EPDFree is installed. The installer is a shell script that must be
executed in the terminal. Depending on whether you have a 32-bit or 64-bit system,
you will either need to install the x86 (32-bit) or x86_64 (64-bit) installer. You will then
have a file named something similar to epd_free-7.3-1-rh5-x86_64.sh. To install it,
execute this script with bash:
$ bash epd_free-7.3-1-rh5-x86_64.sh
After accepting the license, you will be presented with a choice of where to put the
EPDFree files. I recommend installing the files in your home directory, say /home/wesm/
epd (substituting your own username for wesm).
Once the installer has finished, you need to add EPDFree’s bin directory to your
$PATH variable. If you are using the bash shell (the default in Ubuntu, for example), this
means adding the following path addition in your .bashrc:
export PATH=/home/wesm/epd/bin:$PATH
Obviously, substitute the installation directory you used for /home/wesm/epd/. After
doing this you can either start a new terminal process or execute your .bashrc again
with source ~/.bashrc.
10 | Chapter 1: Preliminaries
www.it-ebooks.info
You need a C compiler such as gcc to move forward; many Linux distributions include
gcc, but others may not. On Debian systems, you can install gcc by executing:
sudo apt-get install gcc
If you type gcc on the command line it should say something like:
$ gcc
gcc: no input files
If you installed EPDFree as root, you may need to add sudo to the command and enter
the sudo or root password. To verify things are working, perform the same checks as
in the OS X section.
www.it-ebooks.info
Community and Conferences
Outside of an Internet search, the scientific Python mailing lists are generally helpful
and responsive to questions. Some ones to take a look at are:
• pydata: a Google Group list for questions related to Python for data analysis and
pandas
• pystatsmodels: for statsmodels or pandas-related questions
• numpy-discussion: for NumPy-related questions
• scipy-user: for general SciPy or scientific Python questions
I deliberately did not post URLs for these in case they change. They can be easily located
via Internet search.
Each year many conferences are held all over the world for Python programmers. PyCon
and EuroPython are the two main general Python conferences in the United States and
Europe, respectively. SciPy and EuroSciPy are scientific-oriented Python conferences
where you will likely find many “birds of a feather” if you become more involved with
using Python for data analysis after reading this book.
I encourage you to download the data and use it to replicate the book’s code examples
and experiment with the tools presented in each chapter. I will happily accept contri-
butions, scripts, IPython notebooks, or any other materials you wish to contribute to
the book's repository for all to enjoy.
12 | Chapter 1: Preliminaries
www.it-ebooks.info
Code Examples
Most of the code examples in the book are shown with input and output as it would
appear executed in the IPython shell.
In [5]: code
Out[5]: output
At times, for clarity, multiple code examples will be shown side by side. These should
be read left to right and executed separately.
In [5]: code In [6]: code2
Out[5]: output Out[6]: output2
Import Conventions
The Python community has adopted a number of naming conventions for commonly-
used modules:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
This means that when you see np.arange, this is a reference to the arange function in
NumPy. This is done as it’s considered bad practice in Python software development
to import everything (from numpy import *) from a large package like NumPy.
Jargon
I’ll use some terms common both to programming and data science that you may not
be familiar with. Thus, here are some brief definitions:
Munge/Munging/Wrangling
Describes the overall process of manipulating unstructured and/or messy data into
a structured or clean form. The word has snuck its way into the jargon of many
modern day data hackers. Munge rhymes with “lunge”.
www.it-ebooks.info
Pseudocode
A description of an algorithm or process that takes a code-like form while likely
not being actual valid source code.
Syntactic sugar
Programming syntax which does not add new features, but makes something more
convenient or easier to type.
Acknowledgements
It would have been difficult for me to write this book without the support of a large
number of people.
On the O’Reilly staff, I’m very grateful for my editors Meghan Blanchette and Julie
Steele who guided me through the process. Mike Loukides also worked with me in the
proposal stages and helped make the book a reality.
I received a wealth of technical review from a large cast of characters. In particular,
Martin Blais and Hugh White were incredibly helpful in improving the book’s exam-
ples, clarity, and organization from cover to cover. James Long, Drew Conway, Fer-
nando Pérez, Brian Granger, Thomas Kluyver, Adam Klein, Josh Klein, Chang She, and
Stéfan van der Walt each reviewed one or more chapters, providing pointed feedback
from many different perspectives.
I got many great ideas for examples and data sets from friends and colleagues in the
data community, among them: Mike Dewar, Jeff Hammerbacher, James Johndrow,
Kristian Lum, Adam Klein, Hilary Mason, Chang She, and Ashley Williams.
I am of course indebted to the many leaders in the open source scientific Python com-
munity who’ve built the foundation for my development work and gave encouragement
while I was writing this book: the IPython core team (Fernando Pérez, Brian Granger,
Min Ragan-Kelly, Thomas Kluyver, and others), John Hunter, Skipper Seabold, Travis
Oliphant, Peter Wang, Eric Jones, Robert Kern, Josef Perktold, Francesc Alted, Chris
Fonnesbeck, and too many others to mention. Several other people provided a great
deal of support, ideas, and encouragement along the way: Drew Conway, Sean Taylor,
Giuseppe Paleologo, Jared Lander, David Epstein, John Krowas, Joshua Bloom, Den
Pilsworth, John Myles-White, and many others I’ve forgotten.
I’d also like to thank a number of people from my formative years. First, my former
AQR colleagues who’ve cheered me on in my pandas work over the years: Alex Reyf-
man, Michael Wong, Tim Sargen, Oktay Kurbanov, Matthew Tschantz, Roni Israelov,
Michael Katz, Chris Uga, Prasad Ramanan, Ted Square, and Hoon Kim. Lastly, my
academic advisors Haynes Miller (MIT) and Mike West (Duke).
On the personal side, Casey Dinkin provided invaluable day-to-day support during the
writing process, tolerating my highs and lows as I hacked together the final draft on
14 | Chapter 1: Preliminaries
www.it-ebooks.info
top of an already overcommitted schedule. Lastly, my parents, Bill and Kim, taught me
to always follow my dreams and to never settle for less.
Acknowledgements | 15
www.it-ebooks.info
www.it-ebooks.info
CHAPTER 2
Introductory Examples
This book teaches you the Python tools to work productively with data. While readers
may have many different end goals for their work, the tasks required generally fall into
a number of different broad groups:
Interacting with the outside world
Reading and writing with a variety of file formats and databases.
Preparation
Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and
transforming data for analysis.
Transformation
Applying mathematical and statistical operations to groups of data sets to derive
new data sets. For example, aggregating a large table by group variables.
Modeling and computation
Connecting your data to statistical models, machine learning algorithms, or other
computational tools
Presentation
Creating interactive or static graphical visualizations or textual summaries
In this chapter I will show you a few data sets and some things we can do with them.
These examples are just intended to pique your interest and thus will only be explained
at a high level. Don’t worry if you have no experience with any of these tools; they will
be discussed in great detail throughout the rest of the book. In the code examples you’ll
see input and output prompts like In [15]:; these are from the IPython shell.
17
www.it-ebooks.info
In the case of the hourly snapshots, each line in each file contains a common form of
web data known as JSON, which stands for JavaScript Object Notation. For example,
if we read just the first line of a file you may see something like
In [15]: path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
In [16]: open(path).readline()
Out[16]: '{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11
(KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1,
"tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l":
"orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r":
"http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u":
"http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc":
1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'
Python has numerous built-in and 3rd party modules for converting a JSON string into
a Python dictionary object. Here I’ll use the json module and its loads function invoked
on each line in the sample file I downloaded:
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
If you’ve never programmed in Python before, the last expression here is called a list
comprehension, which is a concise way of applying an operation (like json.loads) to a
collection of strings or other objects. Conveniently, iterating over an open file handle
gives you a sequence of its lines. The resulting object records is now a list of Python
dicts:
In [18]: records[0]
Out[18]:
{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like
Gecko) Chrome/17.0.963.78 Safari/535.11',
u'al': u'en-US,en;q=0.8',
u'c': u'US',
u'cy': u'Danvers',
u'g': u'A6qOVH',
u'gr': u'MA',
u'h': u'wfLQtf',
u'hc': 1331822918,
u'hh': u'1.usa.gov',
u'l': u'orofrog',
u'll': [42.576698, -70.954903],
u'nk': 1,
u'r': u'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
u't': 1331923247,
u'tz': u'America/New_York',
u'u': u'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}
1. http://www.usa.gov/About/developer-resources/1usagov.shtml
www.it-ebooks.info
Note that Python indices start at 0 and not 1 like some other languages (like R). It’s
now easy to access individual values within records by passing a string for the key you
wish to access:
In [19]: records[0]['tz']
Out[19]: u'America/New_York'
The u here in front of the quotation stands for unicode, a standard form of string en-
coding. Note that IPython shows the time zone string object representation here rather
than its print equivalent:
In [20]: print records[0]['tz']
America/New_York
KeyError: 'tz'
Oops! Turns out that not all of the records have a time zone field. This is easy to handle
as we can add the check if 'tz' in rec at the end of the list comprehension:
In [26]: time_zones = [rec['tz'] for rec in records if 'tz' in rec]
In [27]: time_zones[:10]
Out[27]:
[u'America/New_York',
u'America/Denver',
u'America/New_York',
u'America/Sao_Paulo',
u'America/New_York',
u'America/New_York',
u'Europe/Warsaw',
u'',
u'',
u'']
Just looking at the first 10 time zones we see that some of them are unknown (empty).
You can filter these out also but I’ll leave them in for now. Now, to produce counts by
time zone I’ll show two approaches: the harder way (using just the Python standard
library) and the easier way (using pandas). One way to do the counting is to use a dict
to store counts while we iterate through the time zones:
def get_counts(sequence):
counts = {}
www.it-ebooks.info
for x in sequence:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
return counts
If you know a bit more about the Python standard library, you might prefer to write
the same thing more briefly:
from collections import defaultdict
def get_counts2(sequence):
counts = defaultdict(int) # values will initialize to 0
for x in sequence:
counts[x] += 1
return counts
I put this logic in a function just to make it more reusable. To use it on the time zones,
just pass the time_zones list:
In [31]: counts = get_counts(time_zones)
In [32]: counts['America/New_York']
Out[32]: 1251
In [33]: len(time_zones)
Out[33]: 3440
If we wanted the top 10 time zones and their counts, we have to do a little bit of dic-
tionary acrobatics:
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
We have then:
In [35]: top_counts(counts)
Out[35]:
[(33, u'America/Sao_Paulo'),
(35, u'Europe/Madrid'),
(36, u'Pacific/Honolulu'),
(37, u'Asia/Tokyo'),
(74, u'Europe/London'),
(191, u'America/Denver'),
(382, u'America/Los_Angeles'),
(400, u'America/Chicago'),
(521, u''),
(1251, u'America/New_York')]
www.it-ebooks.info
If you search the Python standard library, you may find the collections.Counter class,
which makes this task a lot easier:
In [49]: from collections import Counter
In [51]: counts.most_common(10)
Out[51]:
[(u'America/New_York', 1251),
(u'', 521),
(u'America/Chicago', 400),
(u'America/Los_Angeles', 382),
(u'America/Denver', 191),
(u'Europe/London', 74),
(u'Asia/Tokyo', 37),
(u'Pacific/Honolulu', 36),
(u'Europe/Madrid', 35),
(u'America/Sao_Paulo', 33)]
In [292]: frame
Out[292]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3560 entries, 0 to 3559
Data columns:
_heartbeat_ 120 non-null values
a 3440 non-null values
al 3094 non-null values
c 2919 non-null values
cy 2919 non-null values
g 3440 non-null values
gr 2919 non-null values
h 3440 non-null values
hc 3440 non-null values
hh 3440 non-null values
kw 93 non-null values
l 3440 non-null values
ll 2919 non-null values
nk 3440 non-null values
r 3440 non-null values
t 3440 non-null values
tz 3440 non-null values
www.it-ebooks.info
u 3440 non-null values
dtypes: float64(4), object(14)
In [293]: frame['tz'][:10]
Out[293]:
0 America/New_York
1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz
The output shown for the frame is the summary view, shown for large DataFrame ob-
jects. The Series object returned by frame['tz'] has a method value_counts that gives
us what we’re looking for:
In [294]: tz_counts = frame['tz'].value_counts()
In [295]: tz_counts[:10]
Out[295]:
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
America/Sao_Paulo 33
Then, we might want to make a plot of this data using plotting library, matplotlib. You
can do a bit of munging to fill in a substitute value for unknown and missing time zone
data in the records. The fillna function can replace missing (NA) values and unknown
(empty strings) values can be replaced by boolean array indexing:
In [296]: clean_tz = frame['tz'].fillna('Missing')
In [299]: tz_counts[:10]
Out[299]:
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120
www.it-ebooks.info
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
Making a horizontal bar plot can be accomplished using the plot method on the
counts objects:
In [301]: tz_counts[:10].plot(kind='barh', rot=0)
See Figure 2-1 for the resulting figure. We’ll explore more tools for working with this
kind of data. For example, the a field contains information about the browser, device,
or application used to perform the URL shortening:
In [302]: frame['a'][1]
Out[302]: u'GoogleMaps/RochesterNY'
In [303]: frame['a'][50]
Out[303]: u'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'
In [304]: frame['a'][51]
Out[304]: u'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (K
Parsing all of the interesting information in these “agent” strings may seem like a
daunting task. Luckily, once you have mastered Python’s built-in string functions and
regular expression capabilities, it is really not so bad. For example, we could split off
the first token in the string (corresponding roughly to the browser capability) and make
another summary of the user behavior:
In [305]: results = Series([x.split()[0] for x in frame.a.dropna()])
In [306]: results[:5]
Out[306]:
0 Mozilla/5.0
1 GoogleMaps/RochesterNY
2 Mozilla/4.0
3 Mozilla/5.0
4 Mozilla/5.0
www.it-ebooks.info
In [307]: results.value_counts()[:8]
Out[307]:
Mozilla/5.0 2594
Mozilla/4.0 601
GoogleMaps/RochesterNY 121
Opera/9.80 34
TEST_INTERNET_AGENT 24
GoogleProducer 21
Mozilla/6.0 5
BlackBerry8520/5.0.0.681 4
Now, suppose you wanted to decompose the top time zones into Windows and non-
Windows users. As a simplification, let’s say that a user is on Windows if the string
'Windows' is in the agent string. Since some of the agents are missing, I’ll exclude these
from the data:
In [308]: cframe = frame[frame.a.notnull()]
In [310]: operating_system[:5]
Out[310]:
0 Windows
1 Not Windows
2 Windows
3 Not Windows
4 Windows
Name: a
Then, you can group the data by its time zone column and this new list of operating
systems:
In [311]: by_tz_os = cframe.groupby(['tz', operating_system])
The group counts, analogous to the value_counts function above, can be computed
using size. This result is then reshaped into a table with unstack:
In [312]: agg_counts = by_tz_os.size().unstack().fillna(0)
In [313]: agg_counts[:10]
Out[313]:
a Not Windows Windows
tz
245 276
Africa/Cairo 0 3
Africa/Casablanca 0 1
Africa/Ceuta 0 2
Africa/Johannesburg 0 1
Africa/Lusaka 0 1
America/Anchorage 4 1
America/Argentina/Buenos_Aires 1 0
www.it-ebooks.info
America/Argentina/Cordoba 0 1
America/Argentina/Mendoza 0 1
Finally, let’s select the top overall time zones. To do so, I construct an indirect index
array from the row counts in agg_counts:
# Use to sort in ascending order
In [314]: indexer = agg_counts.sum(1).argsort()
In [315]: indexer[:10]
Out[315]:
tz
24
Africa/Cairo 20
Africa/Casablanca 21
Africa/Ceuta 92
Africa/Johannesburg 87
Africa/Lusaka 53
America/Anchorage 54
America/Argentina/Buenos_Aires 57
America/Argentina/Cordoba 26
America/Argentina/Mendoza 55
I then use take to select the rows in that order, then slice off the last 10 rows:
In [316]: count_subset = agg_counts.take(indexer)[-10:]
In [317]: count_subset
Out[317]:
a Not Windows Windows
tz
America/Sao_Paulo 13 20
Europe/Madrid 16 19
Pacific/Honolulu 0 36
Asia/Tokyo 2 35
Europe/London 43 31
America/Denver 132 59
America/Los_Angeles 130 252
America/Chicago 115 285
245 276
America/New_York 339 912
Then, as shown in the preceding code block, this can be plotted in a bar plot; I’ll make
it a stacked bar plot by passing stacked=True (see Figure 2-2) :
In [319]: count_subset.plot(kind='barh', stacked=True)
The plot doesn’t make it easy to see the relative percentage of Windows users in the
smaller groups, but the rows can easily be normalized to sum to 1 then plotted again
(see Figure 2-3):
In [321]: normed_subset = count_subset.div(count_subset.sum(1), axis=0)
www.it-ebooks.info
Figure 2-2. Top time zones by Windows and non-Windows users
Figure 2-3. Percentage Windows and non-Windows users in top-occurring time zones
All of the methods employed here will be examined in great detail throughout the rest
of the book.
www.it-ebooks.info
early 2000s. The data provide movie ratings, movie metadata (genres and year), and
demographic data about the users (age, zip code, gender, and occupation). Such data
is often of interest in the development of recommendation systems based on machine
learning algorithms. While I will not be exploring machine learning techniques in great
detail in this book, I will show you how to slice and dice data sets like these into the
exact form you need.
The MovieLens 1M data set contains 1 million ratings collected from 6000 users on
4000 movies. It’s spread across 3 tables: ratings, user information, and movie infor-
mation. After extracting the data from the zip file, each table can be loaded into a pandas
DataFrame object using pandas.read_table:
import pandas as pd
You can verify that everything succeeded by looking at the first few rows of each Da-
taFrame with Python's slice syntax:
In [334]: users[:5]
Out[334]:
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
In [335]: ratings[:5]
Out[335]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
In [336]: movies[:5]
Out[336]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
www.it-ebooks.info
4 5 Father of the Bride Part II (1995) Comedy
In [337]: ratings
Out[337]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id 1000209 non-null values
movie_id 1000209 non-null values
rating 1000209 non-null values
timestamp 1000209 non-null values
dtypes: int64(4)
Note that ages and occupations are coded as integers indicating groups described in
the data set’s README file. Analyzing the data spread across three tables is not a simple
task; for example, suppose you wanted to compute mean ratings for a particular movie
by sex and age. As you will see, this is much easier to do with all of the data merged
together into a single table. Using pandas’s merge function, we first merge ratings with
users then merging that result with the movies data. pandas infers which columns to
use as the merge (or join) keys based on overlapping names:
In [338]: data = pd.merge(pd.merge(ratings, users), movies)
In [339]: data
Out[339]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id 1000209 non-null values
movie_id 1000209 non-null values
rating 1000209 non-null values
timestamp 1000209 non-null values
gender 1000209 non-null values
age 1000209 non-null values
occupation 1000209 non-null values
zip 1000209 non-null values
title 1000209 non-null values
genres 1000209 non-null values
dtypes: int64(6), object(4)
In [340]: data.ix[0]
Out[340]:
user_id 1
movie_id 1
rating 5
timestamp 978824268
gender F
age 1
occupation 10
zip 48067
title Toy Story (1995)
genres Animation|Children's|Comedy
Name: 0
www.it-ebooks.info
In this form, aggregating the ratings grouped by one or more user or movie attributes
is straightforward once you build some familiarity with pandas. To get mean movie
ratings for each film grouped by gender, we can use the pivot_table method:
In [341]: mean_ratings = data.pivot_table('rating', rows='title',
.....: cols='gender', aggfunc='mean')
In [342]: mean_ratings[:5]
Out[342]:
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
This produced another DataFrame containing mean ratings with movie totals as row
labels and gender as column labels. First, I’m going to filter down to movies that re-
ceived at least 250 ratings (a completely arbitrary number); to do this, I group the data
by title and use size() to get a Series of group sizes for each title:
In [343]: ratings_by_title = data.groupby('title').size()
In [344]: ratings_by_title[:10]
Out[344]:
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
1-900 (1994) 2
10 Things I Hate About You (1999) 700
101 Dalmatians (1961) 565
101 Dalmatians (1996) 364
12 Angry Men (1957) 616
In [346]: active_titles
Out[346]:
Index(['burbs, The (1989), 10 Things I Hate About You (1999),
101 Dalmatians (1961), ..., Young Sherlock Holmes (1985),
Zero Effect (1998), eXistenZ (1999)], dtype=object)
The index of titles receiving at least 250 ratings can then be used to select rows from
mean_ratings above:
In [347]: mean_ratings = mean_ratings.ix[active_titles]
In [348]: mean_ratings
Out[348]:
<class 'pandas.core.frame.DataFrame'>
Index: 1216 entries, 'burbs, The (1989) to eXistenZ (1999)
www.it-ebooks.info
Data columns:
F 1216 non-null values
M 1216 non-null values
dtypes: float64(2)
To see the top films among female viewers, we can sort by the F column in descending
order:
In [350]: top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)
In [351]: top_female_ratings[:10]
Out[351]:
gender F M
Close Shave, A (1995) 4.644444 4.473795
Wrong Trousers, The (1993) 4.588235 4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075
Schindler's List (1993) 4.562602 4.491415
Shawshank Redemption, The (1994) 4.539075 4.560625
Grand Day Out, A (1992) 4.537879 4.293255
To Kill a Mockingbird (1962) 4.536667 4.372611
Creature Comforts (1990) 4.513889 4.272277
Usual Suspects, The (1995) 4.513317 4.518248
Sorting by 'diff' gives us the movies with the greatest rating difference and which were
preferred by women:
In [353]: sorted_by_diff = mean_ratings.sort_index(by='diff')
In [354]: sorted_by_diff[:15]
Out[354]:
gender F M diff
Dirty Dancing (1987) 3.790378 2.959596 -0.830782
Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
Grease (1978) 3.975265 3.367041 -0.608224
Little Women (1994) 3.870588 3.321739 -0.548849
Steel Magnolias (1989) 3.901734 3.365957 -0.535777
Anastasia (1997) 3.800000 3.281609 -0.518391
Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885
Color Purple, The (1985) 4.158192 3.659341 -0.498851
Age of Innocence, The (1993) 3.827068 3.339506 -0.487561
Free Willy (1993) 2.921348 2.438776 -0.482573
French Kiss (1995) 3.535714 3.056962 -0.478752
Little Shop of Horrors, The (1960) 3.650000 3.179688 -0.470312
Guys and Dolls (1955) 4.051724 3.583333 -0.468391
Mary Poppins (1964) 4.197740 3.730594 -0.467147
Patch Adams (1998) 3.473282 3.008746 -0.464536
www.it-ebooks.info
Reversing the order of the rows and again slicing off the top 15 rows, we get the movies
preferred by men that women didn’t rate as highly:
# Reverse order of rows, take first 15 rows
In [355]: sorted_by_diff[::-1][:15]
Out[355]:
gender F M diff
Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
Dumb & Dumber (1994) 2.697987 3.336595 0.638608
Longest Day, The (1962) 3.411765 4.031447 0.619682
Cable Guy, The (1996) 2.250000 2.863787 0.613787
Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985
Hidden, The (1987) 3.137931 3.745098 0.607167
Rocky III (1982) 2.361702 2.943503 0.581801
Caddyshack (1980) 3.396135 3.969737 0.573602
For a Few Dollars More (1965) 3.409091 3.953795 0.544704
Porky's (1981) 2.296875 2.836364 0.539489
Animal House (1978) 3.628906 4.167192 0.538286
Exorcist, The (1973) 3.537634 4.067239 0.529605
Fright Night (1985) 2.973684 3.500000 0.526316
Barb Wire (1996) 1.585366 2.100386 0.515020
Suppose instead you wanted the movies that elicited the most disagreement among
viewers, independent of gender. Disagreement can be measured by the variance or
standard deviation of the ratings:
# Standard deviation of rating grouped by title
In [356]: rating_std_by_title = data.groupby('title')['rating'].std()
You may have noticed that movie genres are given as a pipe-separated (|) string. If you
wanted to do some analysis by genre, more work would be required to transform the
genre information into a more usable form. I will revisit this data later in the book to
illustrate such a transformation.
www.it-ebooks.info
US Baby Names 1880-2010
The United States Social Security Administration (SSA) has made available data on the
frequency of baby names from 1880 through the present. Hadley Wickham, an author
of several popular R packages, has often made use of this data set in illustrating data
manipulation in R.
In [4]: names.head(10)
Out[4]:
name sex births year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880
5 Margaret F 1578 1880
6 Ida F 1472 1880
7 Alice F 1414 1880
8 Bertha F 1320 1880
9 Sarah F 1288 1880
There are many things you might want to do with the data set:
• Visualize the proportion of babies given a particular name (your own, or another
name) over time.
• Determine the relative rank of a name.
• Determine the most popular names in each year or the names with largest increases
or decreases.
• Analyze trends in names: vowels, consonants, length, overall diversity, changes in
spelling, first and last letters
• Analyze external sources of trends: biblical names, celebrities, demographic
changes
Using the tools we’ve looked at so far, most of these kinds of analyses are very straight-
forward, so I will walk you through many of them. I encourage you to download and
explore the data yourself. If you find an interesting pattern in the data, I would love to
hear about it.
As of this writing, the US Social Security Administration makes available data files, one
per year, containing the total number of births for each sex/name combination. The
raw archive of these files can be obtained here:
http://www.ssa.gov/oact/babynames/limits.html
In the event that this page has been moved by the time you’re reading this, it can most
likely be located again by Internet search. After downloading the “National data” file
names.zip and unzipping it, you will have a directory containing a series of files like
yob1880.txt. I use the UNIX head command to look at the first 10 lines of one of the
files (on Windows, you can use the more command or open it in a text editor):
www.it-ebooks.info
In [367]: !head -n 10 names/yob1880.txt
Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288
In [370]: names1880
Out[370]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns:
name 2000 non-null values
sex 2000 non-null values
births 2000 non-null values
dtypes: int64(1), object(2)
These files only contain names with at least 5 occurrences in each year, so for simplic-
ity’s sake we can use the sum of the births column by sex as the total number of births
in that year:
In [371]: names1880.groupby('sex').births.sum()
Out[371]:
sex
F 90993
M 110493
Name: births
Since the data set is split into files by year, one of the first things to do is to assemble
all of the data into a single DataFrame and further to add a year field. This is easy to
do using pandas.concat:
# 2010 is the last available year right now
years = range(1880, 2011)
pieces = []
columns = ['name', 'sex', 'births']
frame['year'] = year
pieces.append(frame)
www.it-ebooks.info
# Concatenate everything into a single DataFrame
names = pd.concat(pieces, ignore_index=True)
There are a couple things to note here. First, remember that concat glues the DataFrame
objects together row-wise by default. Secondly, you have to pass ignore_index=True
because we’re not interested in preserving the original row numbers returned from
read_csv. So we now have a very large DataFrame containing all of the names data:
Now the names DataFrame looks like:
In [373]: names
Out[373]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1690784 entries, 0 to 1690783
Data columns:
name 1690784 non-null values
sex 1690784 non-null values
births 1690784 non-null values
year 1690784 non-null values
dtypes: int64(2), object(2)
With this data in hand, we can already start aggregating the data at the year and sex
level using groupby or pivot_table, see Figure 2-4:
In [374]: total_births = names.pivot_table('births', rows='year',
.....: cols='sex', aggfunc=sum)
In [375]: total_births.tail()
Out[375]:
sex F M
year
2006 1896468 2050234
2007 1916888 2069242
2008 1883645 2032310
2009 1827643 1973359
2010 1759010 1898382
Next, let’s insert a column prop with the fraction of babies given each name relative to
the total number of births. A prop value of 0.02 would indicate that 2 out of every 100
babies was given a particular name. Thus, we group the data by year and sex, then add
the new column to each group:
def add_prop(group):
# Integer division floors
births = group.births.astype(float)
www.it-ebooks.info
Other documents randomly have
different content
’Midst light thou dost create the baffling gloom;
And feelst it when created by thyself.
Yet then thou ne’er canst feel thyself create.
Thou wouldst forget thy longing to create,
Which reigns unconsciously within thy soul.
Because thou art afraid to ray out light.
Thou wouldst enjoy this light that is thine own.
Thou wouldst enjoy therein thyself alone.
Thou seekst thyself, and seekest to forget.
Thou let’st thyself sink dreaming in thyself.
Ahriman:
Aye, list to her; thy riddles she can solve
But her solution solves them not for thee.
She gives thee wisdom—so that with its aid
Thou canst direct thy steps to foolishness.
Wisdom were good for thee—at other times,
When on thee spirit-day doth brightly shine.
But when Maria speaks thus in thy dreams
She slays thy riddle’s answer by her words.
Aye, list to her.
Strader:
What mean such words as these?
Maria, are they born from out the light?
From out my light? Or is my darkness that
From which they sound? O Benedictus, speak;
Who brought me counsel from the dark abyss?
Benedictus:
At thine abyss’s edge she sought thee out.
Thus spirits seek out men to shelter them,
From those who fashion phantoms for men’s souls
And so conceal the cosmic-spirit’s sway
With mazy darkness, that they only know
Themselves in truth in their own being’s net.
Look further yet within thy dark abyss.
Strader:
What now lives in the depths of mine abyss?
Benedictus:
Gaze on these shades; upon the right, blue-red
Enticing Felix—and the others see—
There on the left—where red with yellow blends;
Who are intent to reach Capesius.
They both do feel the might of these same shades;—
And each in loneliness creates the light
Which foils the shades who would deceive men’s souls.
Ahriman:
He would do better did he show to thee
Thy shades—yet this thing could he scarcely do;—
He hath the best intentions certainly.
He only sees not where to seek those shades.
They stand behind thee, critically near,—
Yet thou thyself dost hide them now from him.
Strader:
So now I hear in mine abyss these words
Which once I thought the prating of a fool,
When Hilary’s adviser uttered them.…
Maria:
Sire Felix tempers for himself the blade
That rids him of his danger; one who treads
The path thy soul takes needs another kind.
The sword Capesius doth fashion here,
And bravely wields in battle with his foes,
Would be for Strader but a shadow sword
Should he commence therewith the spirit-fight
Which powers of destiny ordain for souls
Who must change spirit-being, ripe for deeds
With mighty power, to earth activity.
Thou canst not use their weapons in thy fight;
Yet thou must know them, so that thou mayst forge
Thine own from out soul-substance thoughtfully.
Felix Balde:
Dear Strader, even now the spirit drove
Thee far from us—thus it appeared to me.
(He pauses a while in the expectation that Strader will say something, but
since the latter remains silent Felix continues.)
Dame Balde:
I surely have been silent long enough.
But speak I will, if thou art going to cast
Thy mystic mood upon my fairy sprites.
They would indeed enjoy to have their power
Drawn out of them, that they might be brought up
And suckled fresh with mysticism’s milk.
I honour mysticism; but I fain
Would keep it distant from my fairy realms.
Capesius:
Felicia, was it not thy fairy-tales
That set my feet first on the spirit-path?
Those stories of the air and water-sprites,
Called up so oft before my thirsting soul,
Were messengers to me from yonder world
Whereto I now the mystic entrance seek.
Dame Balde:
But since thou cam’st with this new mystic art
Into our house thou hast but seldom asked
What my fair magic beings are about.
More often thou hast only thought of worth
What wears a solemn air of dignity;
While those who caper out of sheer delight
Are uncongenial to thy mystic ways.
Capesius:
I do not doubt, Felicia, that I
Shall one day comprehend the meaning hid
Deep in the being of those wondrous elves
Who show their wisdom through a merry mask.
Yet now my power hath not advanced so far.
Felix Balde:
Felicia, thou knowest how I love
Those fairy beings who do visit thee;
But to conceive them as mechanical
Embodied dolls—this goes against the grain.
Dame Balde:
As yet I have not brought them to thee thus;
Thy fancy flies—too high; but I was glad
When Strader’s plan was told me, and, I heard,
Thomasius also strives to represent
The spirit cased in matter visible.
I saw in spirit dancing merrily
My fairy princes and my souls of fire
In thousand doll-games, beautified by art;
And there I left them, happy in the thought,
To find their own way to the nurseries.
Curtain
Scene 4
The Same.
Manager:
Thou know’st the mystic friends of Hilary,
And I perceive in thee a clever man
With power to give at all times judgment sure
Both in life’s work and in the mystic arts:
And so I value thy considered thought.
But how shall I make sense of what thou sayst?
That Strader’s friends should stay in spirit-realms
And not as yet use their clairvoyant powers
Upon the fashioning of things of sense
Seems right to thee. But will the selfsame path
For Strader not be just as dangerous?
His spirit methods seem to prove to me
That nature-spirits always blind his eyes
As soon as strong desire for personal deeds
Drives him to seek some outer work in life.
Within oneself, as all true mystics know,
Those forces must develop in their strength
In order to oppose these enemies;
But Strader’s sight, it seems, is not yet ripe
To see such foes upon his spirit-path.
Romanus:
Yet those good spirits who conduct such men,
As stand outside the spirit-realms entire,
Have not yet left his side, but guide his steps.
These spirits ever pass those mystics by
Who make a pact with beings to secure
Their service for their personal spirit mood.
In Strader’s methods I can plainly feel
How nature-spirits still give to his self
The fruits of their benign activity.
Manager:
So ’tis by feeling only thou art led
To think good spirits work in Strader’s case;
Thou off’rest little and dost ask full much.
These are the spirits I must henceforth ask
If I continue active in this place
Where for so long I have been privileged
To serve the work-plans and that spirit true
Which Hilary’s own father ever loved;
And which I still hear speaking from his grave,
E’en if his son hath no more ears for it.
What saith this spirit of that brave strong man
When he perceives these crazy spirits now
Which his son tries to bring within his house?
I know that spirit who for ninety years
Lived in his body. He it was who taught
To me the truest secrets of my work
In those old days when he could work himself,
The while his son crept off to mystic fanes.
Romanus:
My friend, canst thou indeed be unaware
How highly this same spirit I revere?
His servant certainly was that old man
Whom for a pattern thou didst rightly choose.
And I myself have striv’n to serve him too
From childhood’s days up to the present time.
But I too crept away to mystic fanes.
I planted truly deep within my soul
What they were willing to bestow on me.
But reason swept aside the temple mood
When at the door it entered into life.
I knew that in this way I best could bring
This mood’s strong forces into earthly life.
From out the temple none the less I brought
My soul into my work. And it is well
That soul by reason should not be disturbed.
Manager:
And dost thou find that Strader’s spirit-way
Is even distantly akin to thine?
I find myself at thy side ever free
From spirit-beings Strader brings to me.
I clearly feel, e’en in his random speech,
How elemental spirits, quick with life,
By word and nature pour themselves through him
Revealing things the senses cannot grasp.
It is just this that keeps me off from him.
Romanus:
This speech, my friend, doth strike me to the heart.
Since I drew nigh to Strader I have felt
Those very thoughts which come to me through him
To be endowed with quite peculiar power.
They cleft me just as if they were mine own.
And one day I reflected: What if I
Owe to his soul not to myself the power
Which let me ripen to maturity!
Hard on this feeling came a second one;
What if for all that makes me of some use
In life and work and service for mankind
I am indebted to some past earth-life?
Manager:
I feel precisely thus about him too.
When one draws near to him, the spirit which
Doth work through him moves powerfully one’s soul.
And if thy strong soul must succumb to him,
How shall I manage to protect mine own
If I unite with him in this his work?
Romanus:
It will depend on thee alone to find
The right relation ’twixt thyself and him.
I think that Strader’s power will not harm me
Since in my thought I have conceived a way
In which he may have made that power his own.
Manager:
Have made—his own—such power—and over thee—
A dreamer—over the—the man of deeds!
Romanus:
If one might dare to make a guess that now
Some spirit lives its life in Strader’s frame
Who in some earlier earth-life had attained
To most unusual altitude of soul;
Who knew much which the men of his own time
Were still too undeveloped to conceive.
Then it were possible that in those days
Thoughts in his spirit did originate
Which by degrees could make their way to earth
And mingle in the common life of men;
And that from this source people like myself
Have drawn their capability for work—
The thoughts which in my youth I seized upon,
And which I found in my environment,
Might well have been this spirit’s progeny!
Manager:
And dost thou think it justifiable
To trace back thoughts to Strader and none else
That hold a value for mankind’s whole life?
Romanus:
I were a dreamer if I acted thus.
I spin no dreams about mankind’s whole life
With eyes fast closed. I ne’er had use for thoughts
That show themselves and forthwith fade away.
I look at Strader with wide-open eyes;
And see what this man’s nature proves to be,
What qualities he hath and how he acts,
And that wherein he fails;—and then I know
I have no option left me but to judge
Of his endowments as I have just done.
As if this man had stood before mine eyes
Already many hundred years ago,
So do I feel him in my spirit now.
And that I am awake—I know full well.
I shall lend my support to Hilary;
For that which must will surely come to pass.
So think his project over once again.
Manager:
It will be of more benefit to me
If I think over that which thou hast said.
Johannes:
I was astonished when Capesius
Made known to me how my soul’s inner self
Revealed itself unto his spirit’s eye.
I could so utterly forget a fact
Which years ago was clear as day to me:—
That all that lives within the human soul
Works further in the outer spirit-realms;
Long have I known it, yet I could forget.
When Benedictus was directing me
To my first spirit-vision, I beheld
Capesius and Strader by this means,
Clear as a picture, in another age.
I saw the potent pictures of their thoughts
Send circling ripples through the world’s expanse.
Well do I know all this—and knew it not
When I beheld it through Capesius.
The part of me which knows was not awake;
That in an earth-life of the distant past
Capesius and I were closely knit:
That also for a long time have I known,—
Yet at that instant I did know it not.
How can I keep my knowledge all the time?
Johannes:
‘And clairvoyant dreams
Make clear unto souls
The magical web
That forms their own life.’
The Double:
Johannes, thine awakening is but false
Until thou shalt thyself set free the shade
Whom thine offence doth lend a magic life.
Johannes:
This is the second time thou speakest thus.
I will obey thee. Point me out the way.
The Double:
Johannes, give life in the shadow-realm
To what is lost to thee in thine own self.
From out thy spirit’s light pour light on him
So that he will not have to suffer pain.
Johannes:
The shadow-being in me I have stunned
But not o’erthrown: wherefore he must remain
A shade enchanted amongst the other shades
Till I can re-unite myself with him.
The Double:
Then give to me that which thou owest him:
The power of love, that drives thee forth to him,
The heart’s hope, that was first begot by him,
The fresh life, that lies hidden deep in him,
The fruits of earth-lives in the distant past,
Which with his being now are lost to thee;
Oh, give them me; I’ll bring them safe to him.
Johannes:
Thou knowest the way to him?—Oh, show it me.
The Double:
I could get to him in the shadow-realm
When thou didst raise thyself to spirit-spheres;
But since, desire-powers tempting thee, thou didst
Avert thy mind to follow after him,
When now I seek him my strength ever fails.
But if thou wilt abide by my advice
My strength can then create itself anew.
Johannes:
I vowed to thee that I would follow thee—
And now, O spirit-counsellor, again
With all my soul’s strength I renew that vow.
But if thou canst thus find the way to him,
Then show it to me in this hour of fate.
The Double:
I find it now but cannot lead the way.
I can alone show to thine inward eye
The being whom thy longing now doth seek.
Johannes:
That spirit-counsellor—mine other self?
The Double:
Now follow me—thou hast so vowed to me—
For I must now conduct thee to my lord.
(The Guardian of the Threshold appears and stands beside the Double.)
The Guardian:
Johannes, wouldst thou tear this shade away
From those enchanted regions of the soul,
Then slay desire, which leads thee aye astray.
The trace which thou dost follow disappears
So long as thou dost seek it with desire.
It leads thee to my threshold and beyond.
But here, obeying lofty Being’s will,
I do confuse the inward sight of those
Within whose spirit-glance lives vain desire;
All these must meet me ere they are allowed
To penetrate to Truth’s pure radiant light.
I hold thyself fast prisoned in thy sight
So long as thou approachest with desire.
Myself too as illusion dost thou see,
So long as vain desire is joined with sight
And spirit-peacefulness of soul hath not
Become as yet thy being’s vehicle.
Make strong those words of power which thou dost know,
Their spirit-power will conquer fantasy.
Then recognise me, free from all desire,
And thou shalt see me as I really am.
And then I need no longer hinder thee
From gazing freely on the spirit-realm.
Johannes:
But as illusion dost thou too appear?
Thou too … whom I must ever see the first,
Of all the beings in the spirit-land.
How shall I know the truth when I must find
One truth alone confront mine onward steps—
That ever denser grows illusion’s veil.
Ahriman:
Let not thyself be quite confused by him.
He guards the threshold faithfully indeed
E’en if today thou see’st him wear the clothes
Which for thyself thou didst patch up before
Within thy spirit from old odds and ends.
And least of all shouldst thou behold in him
An actor in a poor dramatic show.
But thou wilt make it better later on.
Yet e’en this clownish form can serve thy soul.
It doth not have to spend much energy
In showing thee that which it now still is.
Pay close attention to the Guardian’s speech:
Its tone is mournful and its pathos marked,
Allow not this: for then he will disclose
From whom today he borrows to excess.
Johannes:
Then e’en the content of his speech deceives?
The Double:
Ask not of Ahriman, since he doth find
In contradictions aye his chief delight.
Johannes:
Of whom then shall I ask?
The Double:
Why, ask thyself.
With my power will I fortify thee well
So that awake thou mayst find the place
Whence thou canst gaze untramelled by desire.
Increase thy power.
Johannes:
‘The magical web
That forms their own life.’
O magical web that forms mine own life
Make known to me where desire doth not burn.
Maria:
Myself too as illusion dost thou see
Since vain desire is still allied with sight.
Benedictus:
And spirit-peacefulness of soul hath not
Become as yet thy being’s vehicle.
Johannes:
Maria, Benedictus,—Guardians!
How can they as the Guardian come to me?
’Tis true I have spent many years with thee
And this forbids me now to seek thine aid—
The magical web that forms mine own self.
(Exit, right.)
Strader:
Thou gav’st, when joined in spirit unto me
Before the dark abyss of mine own self,
Wise counsel to direct mine inward sight,
Which at that time I could not understand,
But which will work such changes in my soul
As certainly will solve life’s problems, when
They seek to hinder what I strive to do.
I feel in me the power which thou dost give
To thy disciples on the spirit-path.
And so I shall be able to perform
The service thou dost ask for in this work
That Hilary to mankind will devote;
We shall, however, lack Capesius.
Whatever strength the rest bring to the work
Will not replace his keen activity;
But that which must will surely come to pass.
Benedictus:
Yea, that which must will surely come to pass.
This phrase expresseth thine own stage of growth.
But it awakes no answering response
In souls of all our other spirit-friends.
Thomasius is not as yet prepared
To carry spirit-power to worlds of sense,
So he too will withdraw from this same work.
Through him doth destiny give us a sign
That we must all now seek another plan
Strader:
Will not Maria and thyself be there?
Benedictus:
Maria must Johannes take with her
If she would ever find in truth the road,
Which leads from spirit to the world of sense.
Thus wills the Guardian who with earnest eye
Unceasing guards the borders of both realms.
She cannot lend her aid to thee as yet.
And this may serve thee as a certain sign
That thou canst not at this time truly find
The way into the realm of earthly things.
Strader:
So I and all my aims are left alone!
O loneliness, didst thou then seek me out
When I did stand at Felix Balde’s side?
Benedictus:
The thing which hath just happened in our group
Hath taught me, as I look on thy career,
To read a certain word in spirit-light
Which hitherto hath hid itself from me.
I saw that thou wast bound to certain kinds
Of beings, who, if they should take a part
Creatively in mankind’s life today,
Would surely work for evil; now they live
As germs in certain souls, and will grow ripe
In future days to work upon the earth.
Such germs have I seen living in thy soul.
That thou dost know them not is for thy good.
Through thee they will first learn to know themselves.
But now the road is still close barred for them
Which leads into the realm of earthly things.
Strader:
Whatever else thy words may say to me,
They show me that my lot is loneliness.
And this it is must truly forge my sword.
Maria told me this at mine abyss.
(Benedictus and Maria retire a little way; Strader remains alone; the soul
of Theodora appears.)
Theodora’s Soul:
And Theodora in the worlds of light
Will make warmth for thee that thy spirit-sword
May keenly smite the foes of thine own soul.
Maria:
My learned teacher, ne’er yet did I hear
Thee tell disciples, who had reached the stage
Of Strader, in such tones the words of fate.
Will his soul run its course so speedily
That these words’ power will prove of use to him?
Benedictus:
Fate gave the order, and it was fulfilled.
Maria:
And if the power should prove no use to him,
Will not its evils also fall on thee?
Benedictus:
’Twill not be evil; yet I do not know
In what way it will manifest in him.
My gaze at present penetrates to realms
Where such advice illuminates my soul;
But I see not the scene of its result.
And if I try to see, my vision dies.
Maria:
Thy vision dies,—my guide and leader, thine?—
Who stays for thee thy seership’s certain gaze?
Benedictus:
Johannes flees therewith to cosmic space;
We must pursue;—for I can hear him call.
Maria:
He calls,—from spirit-space his call rings out;
There sounds within his tone a distant fear.
Benedictus:
So from the ever empty fields of ice
Our mystic friend’s call sounds in cosmic space.
Maria:
The ice’s cold is burning in my self,
And kindling tongues of flame in my soul-depths;
The flames are scorching all my power of thought.
Benedictus:
In thy soul-depths the fire doth blaze, which now
Johannes kindles in the cosmic frost.
Maria:
The flames fly off,—they fly off with my thought.
And there on distant cosmic shore of souls
A furious fight—my power of thought doth fight—
In stormy chaos—and cold spirit-light—
My thought-power reels;—the cold light—hammers out
Hot waves of darkness from my failing thought.
What now emergeth from this darkling heat?
Clad in red flames my self storms—to the light;—
To the cold light—of cosmic fields of ice.
Curtain
Scene 5
The Spirit Realm. The scene is set in floods of significant colour, reddish
deepening into fiery red above, blue merging into dark blue and violet
below. In the lower part there is an earth-globe which has the effect of
being a symbol. The figures that appear seem to blend into a complete
whole with the colours. On the left of the stage the group of gnomes as in
Scene 2, in front of them Hilary, and in the immediate foreground the soul-
forces.
Felix Balde’s Soul: (Seated at the extreme right of stage, having the form of
a penitent, but arrayed in a light violet robe girdled with gold.)
Felix Balde’s Soul: (gazing at the group of gnomes. From this moment, the
gnomes becoming conscious, keep swaying up and down, slightly raising
and lowering themselves, as if the group was breathing from above.)
Ahriman:
Thy speech is good. Swift will I seize thy words
That I may keep them for myself unharmed.
Thou canst not yet develop them thyself.
But on the earth they would fill thee with hate.
Strader’s Soul: (Toward the left of stage; only his head is visible; it is in a
yellowish-green aura with red and orange stars. At this moment on
Strader’s immediate left appears the Soul of Capesius. Similarly only his
head is to be seen. It is in a blue aura with red and yellow stars.)
The Other Philia: (Arrayed like a copy of Lucifer, though the radiance is
lacking. Instead of the sword she has a sort of dagger, and in place of the
planet a red ball like a fruit.)
Philia: (Figure like an angel, yellow merging into a sort of white, with
wings of a bright violet, a lighter shade than Maria has later on.—All three
soul-figures are near Strader’s soul and stand in the centre of the stage.)
Astrid: (Figure like an angel, robed in bright violet, with blue wings.)
Luna: (Figure like an angel, robe of blue and red, with orange wings.)
Strader’s Soul:
The three were speaking to me sunshine’s words,
They work for me where I can see them work.
Full many figures are they fashioning;
I feel an impulse by soul-power to change
Them with design, and make them one with me.
Awake in me, O royal solar power
That by resistance I may dim thy might;
Desire brought from moon ages moves me thus.
A golden glow now stirs, I feel its warmth,
And silver sheen, forth-spraying though yet cold;
Awake, Mercurial longing, once again
And wed my severed cosmic self to me.
Well do I feel that once again a part
Is formed from out that picture, which I here
From cosmic spirit forces must create.
Capesius’ Soul:
On that far shore of souls I see emerge
A picture that ne’er touched my being yet
Since I escaped the clutch of earthly life.
It rays out grace and soothes with soft appeal.
The warming glow of wisdom streams therefrom,
And clarifying light gives to my soul.
Could I but make this picture one with me
I should attain what I am thirsting for.
Yet know I not the power which could avail
To make this picture active in my sphere.
Luna:
That which two earth-lives gave thee thou must feel.
One, many years ago, slid gently by
In earnest effort; later on thou hadst
One by ambition soiled; which must be fed
With strengthening grace descending from the first,
That Jupiter’s fire-souls may be revealed
Within the circle of thy spirit-sight.
Then shalt thou feel that wisdom strengthens thee.
Then will the picture, which thou see’st afar
Upon the borders of thy soul’s expanse,
Be set at liberty to come to thee.
Capesius’ Soul:
I needs must be indebted to the soul
That now prepares for being, since it shows
A warning picture in my soul’s expanse.
Astrid:
Thou art indeed; but not as yet doth it
Demand a payment in thy next earth-life.
This picture serves to give thee powers of thought
That thou as man mayst recognize the man
Who shows his earthly future to thee here.
Capesius’ Soul:
I feel before what I shall owe to it
When I shall will to bring it near to me,
Yet can assert that I am free therefrom.
From Philia’s domain I now behold
In picture-sequences the energy
Which I shall gather from its near approach.
Philia:
When Saturn soon his many-coloured light
Shall ray on thee, use well the favour’d hour.
Then through his power in thy soul’s vehicle
That which in spirit is akin to thee
Will plant the roots of thought, which will disclose
The meaning of the cyclic life of earth
When thou dost tread again this star thyself.
Capesius:
Thy counsel shall become my monitor
As soon as Saturn pours his light on me.
Lucifer:
One more thing will I waken in these souls;
The view of worlds whose light will cause them pain,
Ere they can leave this sun-time fortified
With powers for later life upon the earth.
Pain must through doubt mature their fruit in them,
So will I summon up those spheres of soul
Which they have not the strength to look upon.
(The souls of Benedictus and Maria appear in the middle of the region.
Benedictus as a figure reproducing in miniature the configuration of the
entire scenery. Below, his robe, becoming broader, shades into blue-green;
around his head is an aura of red, yellow and blue; the blue blends into the
blue-green of the entire robe. Maria on his right as an angelic figure;
yellow shading into gold, without feet and with bright violet wings.)
Benedictus’ Soul:
Thou dost weigh heavy on my cosmic task
With these opaque earth-laden spheres of thine.
If thou dost give thine own self further power
Then wilt thou find that in this spirit-life
Mine own sun-nature will not shine on thee.
Maria:
He was unknown to thee, when thou didst last
A robe, of earthly matter woven, wear;
Yet doth it still bear fruit in thy soul sheath—
The sunshine’s word of power, with which he fed
Thee kindly in far distant times on earth.
Search out thy nature’s deepest impulses
And thou shalt feel him near thee then with power.
Strader’s Soul:
On spirit-shores illumination works,
Yet howsoe’er I strive to understand
The sense of these light-forces, they are dumb.
Dame Balde’s Soul: (Figure of a penitent with white coif, like that of a nun;
robe yellow-orange, with silver girdle; she appears quite close to Maria; on
her right and near Felix Balde.)
Capesius’ Soul:
The starry writing! this word wakens thoughts,
And bears them on the waves of soul to me.
Thoughts which in earth-lives in the distant past
Were to my being wondrously revealed
They lighten still, yet—as they grow, they fade;
Oblivion sheds its gloomy shade around.
Scene 6
A similar scene
The same characters are still in their places. The lighting is full of warm
shades, but not too bright. Toward the right of stage the sylphs keep
swaying to and fro. In front Philia, Astrid, and Luna.
Romanus’ Soul: (A figure showing all the upper part of the body down to
the hips; it has mighty red wings which extend round its head in such a way
as to change into a red aura, running into blue on the outer edge; it stands
on the left of Capesius’ soul, whilst close are the souls of Bellicosus and
Torquatus further still to left of stage, facing audience.)
Wake in thyself
The picture of the Jew who heard naught else
But hate and ridicule on every side,
Yet truly served the mystic brotherhood
Of which thou wast a member once on earth.
Capesius’ Soul:
Thought-pictures now begin to dawn in me,
And seek to seize me in their powerful grasp.
See Simon’s image rise from my soul-waves—
And see, another joins him—some soul-shape—
A penitent;—would I might keep him far!
Romanus’ Soul:
That which he here must do can but be done
In cosmic sunshine-time; in solitude
And robed in darkness he must wend his way
Whilst Saturn doth light up this spirit-realm.
Capesius’ Soul:
How doth this penitent bewilder me!
His soul’s irradiations burn and bore
Their way into mine own Soul’s inmost core—
So work these souls who have attained the power
To see the inmost depths of other souls.
Felix Balde’s Soul: (From the extreme right of stage with hollow veiled
voice.)
Capesius’ Soul:
Myself—my very words—from out his mouth
Re-echoed—ringing out—in spirit-realms!
Here is a soul that I must try to meet.
It knows me well,—through it I’ll find myself.
(Capesius’ soul disappears; the ‘other Philia’ comes into view on the right
of stage with Theodora’s soul; behind her Dame Balde’s soul.)
Romanus’ Soul:
Two souls do there draw nigh the penitent;
The spirit whom through love souls ever choose
To be their leader goes ahead of them.
The light of meekness pours from one of them
And flows into the other, who appears
To us as penitent. The picture glows
With beauty’s light, which here as wisdom lives.
Torquatus’ Soul: (Figure visible as far as the breast, blue aura, green
wings.)
Bellicosus’ Soul: (Figure visible like that of Torquatus’ soul, but with blue-
violet aura and blue-green wings.)
Theodora’s Soul: (Angelic figure; white with yellow wings and blue-yellow
aura.)
Theodora’s Soul:
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookball.com