NumPy: Beginner's Guide - Third Edition - Sample Chapter
NumPy: Beginner's Guide - Third Edition - Sample Chapter
ee
NumPy
Third Edition
In today's world of science and technology, it's all about speed and flexibility. When it comes to scientific
computing, NumPy tops the list. NumPy will give you both speed and high productivity. This book will walk
you through NumPy with clear, step-by-step examples and just the right amount of theory. The book focuses
on the fundamentals of NumPy, including array objects, functions, and matrices, each of them explained
with practical examples. You will then learn about different NumPy modules while performing mathematical
operations such as calculating the Fourier transform, finding the inverse of a matrix, and determining eigenvalues,
among many others. This book is a one-stop solution to knowing the ins and outs of the vast NumPy library,
empowering you to use its wide range of mathematical features to build efficient, high-speed programs.
Sa
pl
e
P U B L I S H I N G
Third Edition
experimentation
$ 44.99 US
29.99 UK
NumPy
Ivan Idris
useful tasks
Beginner's Guide
Beginner's Guide
P U B L I S H I N G
Ivan Idris
Preface
Scientists, engineers, and quantitative data analysts face many challenges nowadays. Data
scientists want to be able to perform numerical analysis on large datasets with minimal
programming eort. They also want to write readable, ecient, and fast code that is as close
as possible to the mathematical language they are used to. A number of accepted solutions
are available in the scientific computing world.
The C, C++, and Fortran programming languages have their benefits, but they are not
interactive and considered too complex by many. The common commercial alternatives,
such as MATLAB, Maple, and Mathematica, provide powerful scripting languages that are
even more limited than any general-purpose programming language. Other open source
tools similar to MATLAB exist, such as R, GNU Octave, and Scilab. Obviously, they too lack
the power of a language such as Python.
Python is a popular general-purpose programming language that is widely used in the
scientific community. You can access legacy C, Fortran, or R code easily from Python. It
is object-oriented and considered to be of a higher level than C or Fortran. It allows you
to write readable and clean code with minimal fuss. However, it lacks an out-of-the-box
MATLAB equivalent. That's where NumPy comes in. This book is about NumPy and related
Python libraries, such as SciPy and matplotlib.
What is NumPy?
NumPy (short for numerical Python) is an open source Python library for scientific
computing. It lets you work with arrays and matrices in a natural way. The library contains
a long list of useful mathematical functions, including some functions for linear algebra,
Fourier transformation, and random number generation routines. LAPACK, a linear algebra
library, is used by the NumPy linear algebra module if you have it installed on your system.
Otherwise, NumPy provides its own implementation. LAPACK is a well-known library,
originally written in Fortran, on which MATLAB relies as well. In a way, NumPy replaces some
of the functionality of MATLAB and Mathematica, allowing rapid interactive prototyping.
Preface
We will not be discussing NumPy from a developing contributor's perspective, but from more
of a user's perspective. NumPy is a very active project and has a lot of contributors. Maybe,
one day you will be one of them!
History
NumPy is based on its predecessor Numeric. Numeric was first released in 1995 and has
deprecated status now. Neither Numeric nor NumPy made it into the standard Python library
for various reasons. However, you can install NumPy separately, which will be explained in
Chapter 1, NumPy Quick Start.
In 2001, a number of people inspired by Numeric created SciPy, an open source scientific
computing Python library that provides functionality similar to that of MATLAB, Maple, and
Mathematica. Around this time, people were growing increasingly unhappy with Numeric.
Numarray was created as an alternative to Numeric. That is also deprecated now. It was
better in some areas than Numeric, but worked very dierently. For that reason, SciPy kept
on depending on the Numeric philosophy and the Numeric array object. As is customary
with new latest and greatest software, the arrival of Numarray led to the development of
an entire ecosystem around it, with a range of useful tools.
In 2005, Travis Oliphant, an early contributor to SciPy, decided to do something about this
situation. He tried to integrate some of Numarray's features into Numeric. A complete
rewrite took place, and it culminated in the release of NumPy 1.0 in 2006. At that time,
NumPy had all the features of Numeric and Numarray, and more. Tools were available to
facilitate the upgrade from Numeric and Numarray. The upgrade is recommended since
Numeric and Numarray are not actively supported any more.
Originally, the NumPy code was a part of SciPy. It was later separated and is now used by
SciPy for array and matrix processing.
Preface
The drawback of NumPy arrays is that they are more specialized than plain lists. Outside the
context of numerical computations, NumPy arrays are less useful. The technical details of
NumPy arrays will be discussed in later chapters.
Large portions of NumPy are written in C. This makes NumPy faster than pure Python code.
A NumPy C API exists as well, and it allows further extension of functionality with the help
of the C language. The C API falls outside the scope of the book. Finally, since NumPy is open
source, you get all the related advantages. The price is the lowest possibleas free as a
beer. You don't have to worry about licenses every time somebody joins your team or you
need an upgrade of the software. The source code is available for everyone. This of course is
beneficial to code quality.
Limitations of NumPy
If you are a Java programmer, you might be interested in Jython, the Java implementation of
Python. In that case, I have bad news for you. Unfortunately, Jython runs on the Java Virtual
Machine and cannot access NumPy because NumPy's modules are mostly written in C. You
could say that Jython and Python are two totally dierent worlds, though they do implement
the same specifications. There are some workarounds for this discussed in NumPy Cookbook
- Second Edition, Packt Publishing, written by Ivan Idris.
Preface
Chapter 6, Moving Further with NumPy Modules, discusses a number of basic modules
of universal functions. These functions can typically be mapped to their mathematical
counterparts, such as addition, subtraction, division, and multiplication.
Chapter 7, Peeking into Special Routines, describes some of the more specialized NumPy
functions. As NumPy users, we sometimes find ourselves having special requirements.
Fortunately, NumPy satisfies most of our needs.
Chapter 8, Assuring Quality with Testing, teaches you how to write NumPy unit tests.
Chapter 9, Plotting with matplotlib, covers matplotlib in depth, a very useful Python plotting
library. NumPy cannot be used on its own to create graphs and plots. matplotlib integrates
nicely with NumPy and has plotting capabilities comparable to MATLAB.
Chapter 10, When NumPy Is Not Enough SciPy and Beyond, covers more details about
SciPy. We know that SciPy and NumPy are historically related. SciPy, as mentioned in the
History section, is a high-level Python scientific computing framework built on top of NumPy.
It can be used in conjunction with NumPy.
Chapter 11, Playing with Pygame, is the dessert of this book. You learn how to create fun
games with NumPy and Pygame. You also get a taste of artificial intelligence in this chapter.
Appendix A, Pop Quiz Answers, has the answers to all the pop quiz questions within
the chapters.
Appendix B, Additional Online Resources, contains links to Python, mathematics, and
statistics websites.
Appendix C, NumPy Functions' References, lists some useful NumPy functions and
their descriptions.
File I/O
First, we will learn about file I/O with NumPy. Data is usually stored in files. You would not
get far if you were not able to read from and write to files.
[ 53 ]
1.
The identity matrix is a square matrix with ones on the main diagonal and zeros for
the rest (see https://www.khanacademy.org/math/precalculus/precalcmatrices/zero-identity-matrix-tutorial/v/identity-matrix).
The identity matrix can be created with the eye() function. The only argument that
we need to give the eye() function is the number of ones. So, for instance, for a
two-by-two matrix, write the following code:
i2 = np.eye(2)
print(i2)
2.
Save the data in a plain text file with the savetxt() function. Specify the name of
the file that we want to save the data in and the array containing the data itself:
np.savetxt("eye.txt", i2)
A file called eye.txt should have been created in the same directory as the Python script.
[ 54 ]
Chapter 3
You can check for yourself whether the contents are as expected. The code for this example
can be downloaded from the book support website: https://www.packtpub.com/
books/content/support (see save.py)
import numpy as np
i2 = np.eye(2)
print(i2)
np.savetxt("eye.txt", i2))
For now, we are only interested in the close price and volume. In the preceding sample, that
will be 336.1 and 21144800. Store the close price and volume in two arrays as follows:
c,v=np.loadtxt('data.csv', delimiter=',', usecols=(6,7), unpack=True)
As you can see, data is stored in the data.csv file. We have set the delimiter to, (comma),
since we are dealing with a CSV file. The usecols parameter is set through a tuple to get
the seventh and eighth fields, which correspond to the close price and volume. The unpack
argument is set to True, which means that data will be unpacked and assigned to the c and
v variables that will hold the close price and volume, respectively.
[ 55 ]
often used in algorithmic trading and is calculated using volume values as weights.
1.
2.
[ 56 ]
Chapter 3
1 n
ai
n i =1
It sums the values in an array a and divides the sum by the number
of elements n (see https://www.khanacademy.org/math/
probability/descriptive-statistics/central_
tendency/e/mean_median_and_mode).
weighted average
waverage
average
avg
[ 57 ]
Value range
Usually, we don't only want to know the average or arithmetic mean of a set of values,
which are in the middle, to know we also want the extremes, the full rangethe highest and
lowest values. The sample data that we are using here already has those values per daythe
high and low price. However, we need to know the highest value of the high price and the
lowest price value of the low price.
1.
First, read our file again and store the values for the high and low prices into arrays:
h,l=np.loadtxt('data.csv', delimiter=',', usecols=(4,5),
unpack=True)
The only thing that changed is the usecols parameter, since the high and low
prices are situated in dierent columns.
2.
Now, it's easy to get a midpoint, so it is left as an exercise for you to attempt.
3.
NumPy allows us to compute the spread of an array with a function called ptp().
The ptp() function returns the dierence between the maximum and minimum
values of an array. In other words, it is equal to max(array)min(array). Call
the ptp() function:
print("Spread high price", np.ptp(h))
print("Spread low price", np.ptp(l))
[ 58 ]
Chapter 3
Statistics
Stock traders are interested in the most probable close price. Common sense says that
this should be close to some kind of an average as the price dances around a mean, due to
random fluctuations. The arithmetic mean and weighted average are ways to find the center
of a distribution of values. However, neither are robust and both are sensitive to outliers.
Outliers are extreme values that are much bigger or smaller than the typical values in a
dataset. Usually, outliers are caused by a rare phenomenon or a measurement error. For
instance, if we have a close price value of a million dollars, this will influence the outcome
of our calculations.
other half is above it. For example, if we have the values of 1, 2, 3, 4, and 5, then the median
will be 3, since it is in the middle.
[ 59 ]
1.
Create a new Python script and call it simplestats.py. You already know how to
load the data from a CSV file into an array. So, copy that line of code and make sure
that it only gets the close price. The code should appear like this:
c=np.loadtxt('data.csv', delimiter=',', usecols=(6,), unpack=True)
2.
The function that will do the magic for us is called median(). We will call it and
print the result immediately. Add the following line of code:
print("median =", np.median(c))
3.
Since it is our first time using the median() function, we would like to check
whether this is correct. Obviously, we can do it by just going through the file and
finding the correct value, but that is no fun. Instead, we will just mimic the median
algorithm by sorting the close price array and printing the middle value of the sorted
array. The msort() function does the first part for us. Call the function, store the
sorted array, and then print it:
sorted_close = np.msort(c)
print("sorted =", sorted_close)
Yup, it works! Let's now get the middle value of the sorted array:
N = len(c)
print "middle =", sorted[(N - 1)/2]
4.
Hey, that's a dierent value than the one the median() function gave us. How
come? Upon further investigation, we find that the median() function return value
doesn't even appear in our file. That's even stranger! Before filing bugs with the
NumPy team, let's have a look at the documentation:
$ python
>>> import numpy as np
>>> help(np.median)
[ 60 ]
Chapter 3
This mystery is easy to solve. It turns out that our naive algorithm only works for
arrays with odd lengths. For even-length arrays, the median is calculated from the
average of the two array values in the middle. Therefore, type the following code:
print("average middle =", (sorted[N /2] + sorted[(N - 1) / 2]) /
2)
5.
Another statistical measure that we are concerned with is variance. Variance tells
us how much a variable varies (see https://www.khanacademy.org/math/
probability/descriptive-statistics/variance_std_deviation/e/
variance). In our case, it also tells us how risky an investment is, since a stock
6.
Not that we don't trust NumPy or anything, but let's double-check using the
definition of variance, as found in the documentation. Mind you, this definition
might be dierent than the one in your statistics book, but that is quite common
in the field of statistics.
The population variance is defined as the mean
of the square of deviations from the mean, divided by the
number of elements in the array:
1 n
2
a
mean
(
)
i
n i =1
Some books tell us to divide by the number of elements in the array minus one (this
is called a sample variance):
print("variance from definition =", np.mean((c - c.mean())**2))
[ 61 ]
Stock returns
In academic literature, it is more common to base analysis on stock returns and log returns
of the close price. Simple returns are just the rate of change from one value to the next.
Logarithmic returns, or log returns, are determined by taking the log of all the prices and
calculating the dierences between them. In high school, we learned that:
a
log ( a ) log ( b ) = log
b
Log returns, therefore, also measure the rate of change. Returns are dimensionless, since,
in the act of dividing, we divide dollar by dollar (or some other currency). Anyway, investors
are most likely to be interested in the variance or standard deviation of the returns, as this
represents risk.
[ 62 ]
Chapter 3
1.
First, let's calculate simple returns. NumPy has the diff() function that returns an
array that is built up of the dierence between two consecutive array elements. This
is sort of like dierentiation in calculus (the derivative of price with respect to time).
To get the returns, we also have to divide by the value of the previous day. We must
be careful though. The array returned by diff() is one element shorter than the
close prices array. After careful deliberation, we get the following code:
returns = np.diff( arr ) / arr[ : -1]
Notice that we don't use the last value in the divisor. The standard deviation is
equal to the square root of variance. Compute the standard deviation using the
std() function:
print("Standard deviation =", np.std(returns))
2.
The log return or logarithmic return is even easier to calculate. Use the log()
function to get the natural logarithm of the close price and then unleash the
diff() function on the result:
logreturns = np.diff(np.log(c))
Normally, we have to check that the input array doesn't have zeros or negative
numbers. If it does, we will get an error. Stock prices are, however, always positive,
so we didn't have to check.
3.
Quite likely, we will be interested in days when the return is positive. In the current
setup, we can get the next best thing with the where() function, which returns the
indices of an array that satisfies a condition. Just type the following code:
posretindices = np.where(returns > 0)
print("Indices with positive returns", posretindices)
This gives us a number of indices for the array elements that are positive as a tuple,
recognizable by the round brackets on both sides of the printout:
Indices with positive returns (array([ 0, 1, 4, 5,
10, 11, 12, 16, 17, 18, 19, 21, 22, 23, 25, 28]),)
[ 63 ]
6,
7,
9,
4.
Take notice of the division within the sqrt() function. Since, in Python, integer
division works dierently than float division, we needed to use floats to make
sure that we get the proper results. The monthly volatility is similarly given by
the following code:
print("Monthly volatility", annual_volatility * np.sqrt(1./12.))
[ 64 ]
Chapter 3
Dates
Do you sometimes have the Monday blues or Friday fever? Ever wondered whether
the stock market suers from these phenomena? Well, I think this certainly warrants
extensive research.
1.
Obviously, NumPy tried to convert the dates into floats. What we have to do is tell
NumPy explicitly how to convert the dates. The loadtxt() function has a special
parameter for this purpose. The parameter is called converters and is a dictionary
that links columns with the so-called converter functions. It is our responsibility to
write the converter function. Write the function down:
# Monday 0
# Tuesday 1
# Wednesday 2
# Thursday 3
# Friday 4
# Saturday 5
# Sunday 6
def datestr2num(s):
return datetime.datetime.strptime(s, "%d-%m-%Y").date().
weekday()
[ 65 ]
method is called on the date to return a number. As you can read in the comments,
the number is between 0 and 6. 0 is, for instance, Monday, and 6 is Sunday. The actual
number, of course, is not important for our algorithm; it is only used as identification.
2.
2.
No Saturdays and Sundays, as you can see. Exchanges are closed over the weekend.
3.
We will now make an array that has five elements for each day of the week. Initialize
the values of the array to 0:
averages = np.zeros(5)
4.
We already learned about the where() function that returns indices of the array
for elements that conform to a specified condition. The take() function can use
these indices and takes the values of the corresponding array items. We will use the
take() function to get the close prices for each weekday. In the following loop,
we go through the date values 0 to 4, better known as Monday to Friday. We get
the indices with the where() function for each day and store it in the indices
array. Then, we retrieve the values corresponding to the indices, using the take()
function. Finally, compute an average for each weekday and store it in the averages
array, like this:
for i in range(5):
indices = np.where(dates == i)
prices = np.take(close, indices)
avg = np.mean(prices)
print("Day", i, "prices", prices, "Average", avg)
averages[i] = avg
[ 66 ]
Chapter 3
351.88
359.18
353.21
355.36]] Average
355.2
359.9
338.61
349.31
355.76]]
358.16
363.13
342.62
352.12
352.47]]
354.54
358.3
342.88
359.56
346.67]]
356.85
350.56
348.16
360.
5.
If you want, you can go ahead and find out which day has the highest average, and
which the lowest. However, it is just as easy to find this out with the max() and
min() functions, as shown here:
top = np.max(averages)
print("Highest average", top)
print("Top day of the week", np.argmax(averages))
bottom = np.min(averages)
print("Lowest average", bottom)
print("Bottom day of the week", np.argmin(averages))
top = np.max(averages)
print("Highest average", top)
print("Top day of the week", np.argmax(averages))
bottom = np.min(averages)
print("Lowest average", bottom)
print("Bottom day of the week", np.argmin(averages))
[ 68 ]
Chapter 3
1.
To learn about the datetime64 data type, start a Python shell and import NumPy
as follows:
$ python
>>> import numpy as np
Create a datetime64 from a string (you can use another date if you like):
>>> np.datetime64('2015-04-22')
numpy.datetime64('2015-04-22')
In the preceding code, we created a datetime64 for April 22, 2015, which happens
to be Earth Day. We used the YYYY-MM-DD format, where Y corresponds to the year,
M corresponds to the month, and D corresponds to the day of the month. NumPy
uses the ISO 8601 standard (see http://en.wikipedia.org/wiki/ISO_8601).
This is an international standard to represent dates and times. ISO 8601 allows the
YYYY-MM-DD, YYYY-MM, and YYYYMMDD formats. Check for yourself, as follows:
>>> np.datetime64('2015-04-22')
numpy.datetime64('2015-04-22')
>>> np.datetime64('2015-04')
numpy.datetime64('2015-04')
2.
By default, ISO 8601 uses the local time zone. Times can be specified using the
format T[hh:mm:ss]. For example, define January 1, 1677 at 8:19 p.m. as follows:
>>> local = np.datetime64('1677-01-01T20:19')
>>> local
numpy.datetime64('1677-01-01T20:19Z')
Additionally, a string in the format [hh:mm] specifies an oset that is relative to the
UTC time zone. Create a datetime64 with 9 hours oset, as follows:
>>> with_offset = np.datetime64('1677-01-01T20:19-0900')
>>> with_offset
numpy.datetime64('1677-01-02T05:19Z')
The Z at the end stands for Zulu time, which is how UTC is sometimes referred to.
Subtract the two datetime64 objects from each other:
>>> local - with_offset
numpy.timedelta64(-540,'m')
[ 69 ]
The subtraction creates a NumPy timedelta64 object, which in this case, indicates
a 540 minute dierence. We can also add or subtract a number of days to a
datetime64 object. For instance, April 22, 2015 happens to be a Wednesday. With
the arange() function, create an array holding all the Wednesdays from April 22,
2015 until May 22, 2015 as follows:
>>> np.arange('2015-04-22', '2015-05-22', 7, dtype='datetime64')
array(['2015-04-22', '2015-04-29', '2015-05-06', '2015-05-13',
'2015-05-20'], dtype='datetime64[D]')
Note that in this case, it is mandatory to specify the dtype argument, otherwise
NumPy thinks that we are dealing with strings.
Weekly summary
The data that we used in the previous Time for action section is end-of-day data. In essence,
it is summarized data compiled from the trade data for a certain day. If you are interested in
the market and have decades of data, you might want to summarize and compress the data
even further. Let's summarize the data of Apple stocks to give us weekly summaries.
1.
To simplify, just have a look at the first three weeks in the sample later, you can
have a go at improving this:
close = close[:16]
dates = dates[:16]
We will be building on the code from the previous Time for action section.
[ 70 ]
Chapter 3
2.
Commencing, we will find the first Monday in our sample data. Recall that Mondays
have the code 0 in Python. This is what we will put in the condition of the where()
function. Then, we will need to extract the first element that has index 0. The result
will be a multidimensional array. Flatten this with the ravel() function:
# get first Monday
first_monday = np.ravel(np.where(dates == 0))[0]
print("The first Monday index is", first_monday)
3.
The next logical step is to find the Friday before last Friday in the sample. The
logic is similar to the one for finding the first Monday, and the code for Friday is 4.
Additionally, we are looking for the second to last element with index 2:
# get last Friday
last_friday = np.ravel(np.where(dates == 4))[-2]
print("The last Friday index is", last_friday)
4.
Next, create an array with the indices of all the days in the three weeks:
weeks_indices = np.arange(first_monday, last_friday + 1)
print("Weeks indices initial", weeks_indices)
5.
6.
7,
In NumPy, array dimensions are called axes. Now, we will get fancy with the apply_
along_axis() function. This function calls another function, which we will provide,
to operate on each of the elements of an array. Currently, we have an array with three
elements. Each array item corresponds to one week in our sample and contains indices
of the corresponding items. Call the apply_along_axis() function by supplying
the name of our function, called summarize(), which we will define shortly.
Furthermore, specify the axis or dimension number (such as 1), the array to operate
on, and a variable number of arguments for the summarize() function, if any:
weeksummary = np.apply_along_axis(summarize, 1,
weeks_indices, open, high, low, close)
print("Week summary", weeksummary)
[ 71 ]
7.
For each week, the summarize() function returns a tuple that holds the open,
high, low, and close price for the week, similar to end-of-day data:
def summarize(a, o, h, l, c):
monday_open = o[a[0]]
week_high = np.max( np.take(h, a) )
week_low = np.min( np.take(l, a) )
friday_close = c[a[-1]]
return("APPL", monday_open, week_high,
week_low, friday_close)
Notice that we used the take() function to get the actual values from indices.
Calculating the high and low values for the week was easily done with the max()
and min() functions. The open for the week is the open for the first day in the
weekMonday. Likewise, the close is the close for the last day of the weekFriday:
Week summary [['APPL' '335.8' '346.7' '334.3' '346.5']
['APPL' '347.89' '360.0' '347.64' '356.85']
['APPL' '356.79' '364.9' '349.52' '350.56']]
8.
As you can see, have specified a filename, the array we want to store, a delimiter
(in this case a comma), and the format we want to store floating point numbers in.
The format string starts with a percent sign. Second is an optional flag. Theflag
means left justify, 0 means left pad with zeros, and + means precede with + or -.
Third is an optional width. The width indicates the minimum number of characters.
Fourth, a dot is followed by a number linked to precision. Finally, there comes a
character specifier; in our example, the character specifier is a string. The character
codes are described as follows:
Character code
c
Description
d or i
e or E
g,G
signed octal
character
[ 72 ]
Chapter 3
Character code
s
Description
x,X
string of characters
View the generated file in your favorite editor or type at the command line:
$ cat weeksummary.csv
APPL,335.8,346.7,334.3,346.5
APPL,347.89,360.0,347.64,356.85
APPL,356.79,364.9,349.52,350.56
[ 73 ]
[ 74 ]
Chapter 3
1.
The ATR is based on the low and high price of N days, usually the last 20 days.
N = 5
h = h[-N:]
l = l[-N:]
2.
The dierence between the previous close and the low price:
previousclose l
3.
The max() function returns the maximum of an array. Based on those three values,
we calculate the so-called true range, which is the maximum of these values. We are
now interested in the element-wise maxima across arraysmeaning the maxima of
the first elements in the arrays, the second elements in the arrays, and so on. Use
the NumPy maximum() function instead of the max() function for this purpose:
truerange = np.maximum(h - l, h - previousclose, previousclose l)
4.
5.
The first value of the array is just the average of the truerange array:
atr[0] = np.mean(truerange)
( ( N 1) PATR + TR )
N
[ 75 ]
[ 76 ]
Chapter 3
for i in range(1, N):
atr[i] = (N - 1) * atr[i - 1] + truerange[i]
atr[i] /= N
print("ATR", atr)
In the following sections, we will learn better ways to calculate moving averages.
( f g )( t ) = f ( ) g ( t ) d = f ( t ) g ( ) d
Convolution is described on Wikipedia at https://en.wikipedia.org/
wiki/Convolution. Khan Academy also has a tutorial on convolution
at https://www.khanacademy.org/math/differentialequations/laplace-transform/convolution-integral/v/
introduction-to-the-convolution.
[ 77 ]
1.
Use the ones() function to create an array of size N and elements initialized to 1,
and then, divide the array by N to give us the weights:
N = 5
weights = np.ones(N) / N
print("Weights", weights)
2.
0.2
0.2
0.2
0.2]
3.
From the array returned by convolve(), we extracted the data in the center of size
N. The following code makes an array of time values and plots with matplotlib
that we will cover in a later chapter:
c = np.loadtxt('data.csv', delimiter=',', usecols=(6,),
unpack=True)
sma = np.convolve(weights, c)[N-1:-N+1]
t = np.arange(N - 1, len(c))
plt.plot(t, c[N-1:], lw=1.0, label="Data")
plt.plot(t, sma, '--', lw=2.0, label="Moving average")
plt.title("5 Day Moving Average")
plt.xlabel("Days")
plt.ylabel("Price ($)")
plt.grid()
plt.legend()
plt.show()
[ 78 ]
Chapter 3
In the following chart, the smooth dashed line is the 5 day SMA and the jagged thin
line is the close price:
1.
2.71828183
7.3890561
20.08553692
54.59815003]
The linspace() function takes as parameters a start value, a stop value, and optionally an
array size. It returns an array of evenly spaced numbers. This is an example:
print("Linspace", np.linspace(-1, 0, 5))
-0.75 -0.5
-0.25
0.
1.
Now, back to the weights, calculate them with exp() and linspace():
N = 5
weights = np.exp(np.linspace(-1., 0., N))
2.
[ 80 ]
Chapter 3
3.
0.14644403
0.18803785
0.24144538
After this, use the convolve() function that we learned about in the SMA section
and also plot the results:
c = np.loadtxt('data.csv', delimiter=',', usecols=(6,),
unpack=True)
ema = np.convolve(weights, c)[N-1:-N+1]
t = np.arange(N - 1, len(c))
plt.plot(t, c[N-1:], lw=1.0, label='Data')
plt.plot(t, ema, '--', lw=2.0, label='Exponential Moving Average')
plt.title('5 Days Exponential Moving Average')
plt.xlabel('Days')
plt.ylabel('Price ($)')
plt.legend()
plt.grid()
plt.show()
This gives us a nice chart where, again, the close price is the thin jagged line and the
EMA is the smooth dashed line:
[ 81 ]
Bollinger Bands
Bollinger Bands are yet another technical indicator. Yes, there are thousands of them. This
one is named after its inventor and indicates a range for the price of a financial security. It
consists of three parts:
1. A Simple Moving Average.
[ 82 ]
Chapter 3
1.
Starting with an array called sma that contains the moving average values, we
will loop through all the datasets corresponding to those values. After forming
the dataset, calculate the standard deviation. Note that at a certain point, it
will be necessary to calculate the dierence between each data point and the
corresponding average value. If we do not have NumPy, we will loop through these
points and subtract each of the values one-by-one from the corresponding average.
However, the NumPy fill() function allows us to construct an array that has
elements set to the same value. This enables us to save on one loop and subtract
arrays in one go:
deviation = []
C = len(c)
for i in range(N - 1, C):
if i + N < C:
dev = c[i: i + N]
else:
dev = c[-N:]
averages = np.zeros(N)
averages.fill(sma[i - N - 1])
dev = dev - averages
dev = dev ** 2
dev = np.sqrt(np.mean(dev))
deviation.append(dev)
deviation = 2 * np.array(deviation)
print(len(deviation), len(sma))
upperBB = sma + deviation
lowerBB = sma - deviation
[ 83 ]
2.
To plot, we will use the following code (don't worry about it now; we will see how
this works in Chapter 9, Plotting with matplotlib):
t = np.arange(N - 1, C)
plt.plot(t, c_slice, lw=1.0, label='Data')
plt.plot(t, sma, '--', lw=2.0, label='Moving Average')
plt.plot(t, upperBB, '-.', lw=3.0, label='Upper Band')
plt.plot(t, lowerBB, ':', lw=4.0, label='Lower Band')
plt.title('Bollinger Bands')
plt.xlabel('Days')
plt.ylabel('Price ($)')
plt.grid()
plt.legend()
plt.show()
Following is a chart showing the Bollinger Bands for our data. The jagged thin line in
the middle represents the close price, and the dashed, smoother line crossing it is
the moving average:
[ 84 ]
Chapter 3
Linear model
Many phenomena in science have a related linear relationship model. The NumPy linalg
package deals with linear algebra computations. We will begin with the assumption that a
price value can be derived from N previous prices based on a linear relationship relation.
pt = b + i =1 at i pt i
N
[ 86 ]
Chapter 3
In linear algebra terms, this boils down to finding a least-squares method (see https://
www.khanacademy.org/math/linear-algebra/alternate_bases/orthogonal_
projections/v/linear-algebra-least-squares-approximation).
Independently of each other, the astronomers Legendre and
Gauss created the least squares method around 1805 (see
http://en.wikipedia.org/wiki/Least_squares).
The method was initially used to analyze the motion of celestial
bodies. The algorithm minimizes the sum of the squared residuals
(the dierence between measured and predicted values):
n
( measured
i =1
predictedi )
1.
2.
346.67
352.47
355.76
355.36]
3.
0.
[ 0.
0.
0.
0.
0.]
[ 0.
0.
0.
0.
0.]
[ 0.
0.
0.
0.
0.]
[ 0.
0.
0.
0.
0.]]
0.
0.
0.]
Third, fill the matrix A with N preceding price values for each value in b:
for i in range(N):
A[i, ] = c[-N - 1 - i: - 1 - i]
print("A", A)
[ 87 ]
4.
355.36
355.76
352.47
346.67]
[ 359.56
360.
355.36
355.76
352.47]
[ 352.12
359.56
360.
355.36
355.76]
[ 349.31
352.12
359.56
360.
355.36]
[ 353.21
349.31
352.12
359.56
360.
]]
The objective is to determine the coecients that satisfy our linear model by solving
the least squares problem. Employ the lstsq() function of the NumPy linalg
package to do this:
(x, residuals, rank, s) = np.linalg.lstsq(A, b)
print(x, residuals, rank, s)
The tuple returned contains the coecient x that we were after, an array comprising
residuals, the rank of matrix A, and the singular values of A.
5.
Once we have the coecients of our linear model, we can predict the next
price value. Compute the dot product (with the NumPy dot() function) of the
coecients and the last known N prices:
print(np.dot(b, x))
The dot product (see https://www.khanacademy.org/math/linearalgebra/vectors_and_spaces/dot_cross_products/v/vector-dotproduct-and-vector-length) is the linear combination of the coecients b
and the prices x. As a result, we get:
357.939161015
I looked it up; the actual close price of the next day was 353.56. So, our estimate with N = 5
was not that far o.
[ 88 ]
Chapter 3
Trend lines
A trend line is a line among a number of the so-called pivot points on a stock chart. As the
name suggests, the line's trend portrays the trend of the price development. In the past,
traders drew trend lines on paper but nowadays, we can let a computer draw it for us. In this
section, we shall use a very simple approach that probably won't be very useful in real life,
but should clarify the principle well.
[ 89 ]
1.
First, we need to determine the pivot points. We shall pretend they are equal to the
arithmetic mean of the high, low, and close price:
h, l, c = np.loadtxt('data.csv', delimiter=',', usecols=(4, 5,
6), unpack=True)
pivots = (h + l + c) / 3
print("Pivots", pivots)
From the pivots, we can deduce the so-called resistance and support levels.
The support level is the lowest level at which the price rebounds. The resistance
level is the highest level at which the price bounces back. These are not natural
phenomena, they are merely estimates. Based on these estimates, it is possible to
draw support and resistance trend lines. We will define the daily spread to be the
dierence of the high and low price.
2.
Define a function to fit line to data to a line where y = at + b. The function should
return a and b. This is another opportunity to apply the lstsq() function of the
NumPy linalg package. Rewrite the line equation to y = Ax, where A = [t 1]
and x = [a b]. Form A with the NumPy ones_like(), which creates an array,
where all the values are equal to 1, using an input array as a template for the
array dimensions:
def fit_line(t, y):
A = np.vstack([t, np.ones_like(t)]).T
return np.linalg.lstsq(A, y)[0]
3.
Assuming that support levels are one daily spread below the pivots, and that
resistance levels are one daily spread above the pivots, fit the support and
resistance trend lines:
t = np.arange(len(c))
sa, sb = fit_line(t, pivots - (h - l))
ra, rb = fit_line(t, pivots + (h - l))
support = sa * t + sb
resistance = ra * t + rb
[ 90 ]
Chapter 3
4.
At this juncture, we have all the necessary information to draw the trend lines;
however, it is wise to check how many points fall between the support and resistance
levels. Obviously, if only a small percentage of the data is between the trend lines,
then this setup is of no use to us. Make up a condition for points between the bands
and select with the where() function, based on the following condition:
condition = (c > support) & (c < resistance)
print("Condition", condition)
between_bands = np.where(condition)
True
True
True
True
True
True
True
The array returned by the where() function has rank 2, so call the ravel()
function before calling the len() function:
between_bands = len(np.ravel(between_bands))
print("Number points between bands", between_bands)
print("Ratio between bands", float(between_bands)/len(c))
As an extra bonus, we gained a predictive model. Extrapolate the next day resistance
and support levels:
print("Tomorrows support", sa * (t[-1] + 1) + sb)
print("Tomorrows resistance", ra * (t[-1] + 1) + rb)
[ 91 ]
Another approach to figure out how many points are between the support and
resistance estimates is to use [] and intersect1d(). Define selection criteria
in the [] operator and intersect the results with the intersect1d() function:
a1 = c[c > support]
a2 = c[c < resistance]
print("Number of points between bands 2nd approach" ,len(np.
intersect1d(a1, a2)))
5.
In the following plot, we have the price data and the corresponding support and
resistance lines:
[ 92 ]
Chapter 3
[ 93 ]
Methods of ndarray
The NumPy ndarray class has a lot of methods that work on the array. Most of the time,
these methods return an array. You may have noticed that many of the functions part of the
NumPy library have a counterpart with the same name and functionality in the ndarray
class. This is mostly due to the historical development of NumPy.
The list of ndarray methods is pretty long, so we cannot cover them all. The mean(),
var(), sum(), std(), argmax(), argmin(), and mean() functions that we saw earlier
are also ndarray methods.
1.
The clip() method returns a clipped array, so that all values above a maximum
value are set to the maximum and values below a minimum are set to the minimum
value. Clip an array with values 0 to 4 to 1 and 2:
a = np.arange(5)
print("a =", a)
print("Clipped", a.clip(1, 2))
[ 94 ]
Chapter 3
2.
Factorial
Many programming books have an example of calculating the factorial. We should not break
with this tradition.
1.
Calculate the factorial of 8. To do this, generate an array with values 1 to 8 and call
the prod() function on it:
b = np.arange(1, 9)
print("b =", b)
print("Factorial", b.prod())
This is nice, but what if we want to know all the factorials from 1 to 8?
[ 95 ]
2.
No problem! Call the cumprod() method, which computes the cumulative product
of an array:
print("Factorials", b.cumprod())
24
120
720
5040 40320]
Chapter 3
1.
2.
Loop through the values and generate a new dataset by setting one value to NaN at
each iteration of the loop. For each new set of values, compute the estimates:
for i in xrange(len(c)):
a = c.copy()
a[i] = np.nan
estimates[i,] = [np.nanmean(a), np.nanvar(a), np.nanstd(a)]
3.
Print the variance for each estimate (you can also print the mean or standard
deviation if you prefer):
print("Estimates variance", estimates.var(axis=0))
3.63062943
0.01868965]
[ 97 ]
Summary
This chapter informed us about a great number of common NumPy functions. A few
common statistics functions were also mentioned.
After this tour through the common NumPy functions, we will continue covering
convenience NumPy functions such as polyfit(), sign(), and piecewise()
in the next chapter.
[ 98 ]
www.PacktPub.com
Stay Connected: