Fast Python High Performance Techniques For Large Datasets Meap V10 All 10 Chapters Tiago Rodrigues Antao pdf download
Fast Python High Performance Techniques For Large Datasets Meap V10 All 10 Chapters Tiago Rodrigues Antao pdf download
https://ebookbell.com/product/fast-python-high-performance-
techniques-for-large-datasets-meap-v10-all-10-chapters-tiago-
rodrigues-antao-47525652
https://ebookbell.com/product/building-python-web-apis-with-fastapi-a-
fastpaced-guide-to-building-highperformance-robust-web-apis-
abdulazeez-abdulazeez-adeshina-44468546
https://ebookbell.com/product/python-programming-how-to-code-python-
fast-in-just-24-hours-with-7-simple-steps-jason-scotts-46888312
Python For Finance A Crash Course Modern Guide Learn Python Fast
Bisette
https://ebookbell.com/product/python-for-finance-a-crash-course-
modern-guide-learn-python-fast-bisette-56074420
Data Science Solutions With Python Fast And Scalable Models Using
Keras Pyspark Mllib H2o Xgboost And Scikitlearn 1st Edition Tshepo
Chris Nokeri
https://ebookbell.com/product/data-science-solutions-with-python-fast-
and-scalable-models-using-keras-pyspark-mllib-h2o-xgboost-and-
scikitlearn-1st-edition-tshepo-chris-nokeri-35650824
Handson Python Programming For Beginners Learn Practical Python Fast
Ai Publishing
https://ebookbell.com/product/handson-python-programming-for-
beginners-learn-practical-python-fast-ai-publishing-43271100
https://ebookbell.com/product/learn-python-by-coding-video-games-
beginner-a-stepbystep-guide-to-coding-in-python-fast-patrick-
felicia-47434782
https://ebookbell.com/product/python-python-for-beginners-crash-
course-master-python-programming-fast-and-easy-today-raj-ali-
ali-35439916
https://ebookbell.com/product/building-serverless-applications-with-
python-develop-fast-scalable-and-costeffective-web-applications-that-
are-always-available-jalem-raj-rohit-10826092
Python Quick Reference Guide The Cheat Sheet For Fast Learning Learn
Pythons Key Concepts And Boost Your Coding Productivity Jordan Loopman
https://ebookbell.com/product/python-quick-reference-guide-the-cheat-
sheet-for-fast-learning-learn-pythons-key-concepts-and-boost-your-
coding-productivity-jordan-loopman-231608212
Fast Python for Data Science MEAP V10
1. MEAP_VERSION_10
2. Welcome
3. 1_The_need_for_efficient_computing_and_data_storage
4. 2_Extracting_maximum_performance_from_built-in_features
5. 3_Concurrency,_parallelism_and_asynchronous_processing
6. 4_High_performance_NumPy
7. 5_Re-implementing_critical_code_with_Cython
8. 6_Memory_hierarchy,_storage_and_networking
9. 7_High_performance_Pandas_and_Apache_Arrow
10. 8_Storing_big_data
11. 9_Data_Analysis_using_GPU_computing
12. 10_Analyzing_big_data_with_Dask
13. Appendix_A._Setting_up_the_environment
14. Appendix_B._Using_Numba_to_generate_efficient_low_level_code
Cover
MEAP VERSION 10
Welcome
Thank you for purchasing the MEAP for Fast Python. This is an advanced
book written for Python programmers who already have some practical
experience under their belt. You are probably already dealing with some
large problems and you would like to know how to produce solutions that
are more efficient: you want a faster solution, that uses less CPU resources,
less storage and less network. You want to dig deeper and understand a bit
more how Python works: you are at a stage where you need to dig deeper in
order to write more efficient solutions.
You know all the basic Python language features: most of its syntax and a
few of its built-in libraries. You are using, or have heard of libraries like
NumPy, Pandas or SciPy. You might have dabbled with the multiprocessing
module, but you would definitely like to know more. You know that you
can rewrite parts of your Python code in a lower level language or system
like Cython, Numba or C. You are keen on exploring new ways to make
your code more efficient like offloading code to GPUs
This book is concerned with writing Python code that delivers more
performance. Performance here means several things: it is speed of
execution, but it is also being as IO frugal as possible, and surely is
reducing the overall financial cost of our code by using less computers, less
storage, less time. There are ways of achieving this, and I believe that we
can do this in an elegant way – more efficient code doesn’t mean uglier
code or less maintainable code.
The approach we will be taking is muti-faceted. We tackle pure-Python
code, multiprocessing or writing critical parts in faster languages. Adding to
this we will be looking at the libraries that are the bread and butter of data
analysis in Python: How can use libraries like NumPy or Pandas in a more
performant way? And because IO is a big bottleneck in our big-data world
we will pay close attention to persistence: we will transform data into more
efficient representations and introduce modern libraries to do storage and
IO.
It is quite important for me that all the above topics are contextualized in
their environment: The best solution to be run on a single computer is
probably very different from the best solution to run on the cloud.. There is
no single solution to rule them all. Therefore we will be also discussing the
impact of CPU, disks, network and cloud architectures. You will have to
think differently as your platform changes and this book, hopefully, will
help you with that.
The topics covered are complex and I know that your feedback will be
fundamental to improve this work quite substantially. Please be sure to post
any questions, comments, or suggestions you have about the book in the
liveBook discussion forum.
—Tiago Antão
In this book
It is difficult to think of a more common cliche than the one about how we
live in "a data deluge," but it happens that this cliche is also very true.
Software development professionals are tasked with dealing with immense
amounts of data, and Python has emerged as the language of choice to do—
or at least glue—all the heavy lifting around this deluge. Indeed Python’s
popularity in data science and data engineering is one of the main drivers of
the language’s growth, helping to push it to one of the top three most used
languages across most developer surveys. Python has its own unique set of
advantages and limitations for dealing with big data, and in this book we will
explore techniques for doing efficient data processing in Python. We will
examine a variety of angles and approaches which target software, hardware,
coding, and more. Starting with pure Python best practices for efficiency, we
then move on to how to best leverage multi-processing; improving our use of
data processing libraries; and re-implementing parts of the code in lower
level languages. We will look not only at CPU processing optimizations, but
also at storage and network efficiency gains. And we will look at all this in
the context of traditional single-computer architectures as well as newer
approaches like the cloud and GPU-based computing. By the end of this
book, you will have a toolbox full of reliable solutions for using less
resources and saving money, while still responding faster to computing
requirements.
In this chapter let’s first take a look at a few specifics about the so-called
data deluge, to orient ourselves to what, exactly we are dealing with. Then
we will sketch out why the old solutions, such as increasing CPU speed, are
no longer adequate. Next we’ll look at the particular issues that Python faces
when dealing with big data, including Python’s threading and the CPython’s
infamous Global Interpreter Lock (GIL). Once we’ve seen the need for new
approaches to making Python perform better, I’ll explain what precisely I
mean by high-performance Python, and what you’ll learn in this book.
Figure 1.1. The ratio between Moore’s Law and Edholm’s law suggests that hardware will
always lag behind the amount of data being generated. Moreover the gap will increase over time.
The situation described by this graph can be seen as a fight between what we
need to analyze (Edlhom’s law) vs the power that we have to do that analysis
(Moore’s law). The graph actually paints a rosier picture than what we have
in reality. We will see why in chapter 6 when we discuss Moore’s law in the
context of modern CPU architectures.
Figure 1.2. The growth of Global Internet Traffic over the years measured in Petabytes per
month. (source: Wikipedia)
In addition, 90% of the data humankind has produced happened in the last
two years (To read more about this see
https://www.uschamberfoundation.org/bhq/big-data-and-what-it-means).
Whether the quality of this new data is proportional to its size is another
matter altogether. The point is that data produced will need to be processed
and that processing will require more resources.
The way all this new data is represented is also changing in nature. Some
project that by 2025, around 80% of data could be unstructured, (for details
see https://www.aparavi.com/data-growth-statistics-blow-your-mind/)
Simply put, unstructured data makes data processing more demanding from
a computational perspective.
How do we deal with all this growth in data? Surprisingly and sadly, it turns
out that we mostly don’t. More than 99% of data produced is never
analyzed, according to an article published in The Guardian
(https://www.theguardian.com/news/datablog/2012/dec/19/big-data-study-
digital-universe-global-volume). Part of what holds us back from making use
of so much of our data is that we lack efficient procedures to analyze it.
The growth of data and the concomitant need for more processing has
developed into one of the most pernicious mantras about computing, which
goes along these lines: "If you have more data, just throw more servers at it."
An alternative approach, when we need to increase the performance of an
existing system, is to have a look at the existing architecture and
implementation and find places where we can optimize for performance. I
have personally lost count of how many times I have been able to get ten-
fold increases in performance just by being mindful of efficiency issues
when reviewing existing code.
Your solution requires only a single computer, but suddenly you need
more machines. Adding machines means you will have to manage the
number of machines, distribute the workload across them, and make
sure the data is partitioned correctly. You might also need a file system
server to add to your list of machines. The cost of maintaining a server
farm—or just a cloud—is qualitatively much more than maintaining a
single computer.
Your solution works well in-memory but then the amount of data
increases and no longer fits your memory. To handle the new amount of
data stored in disk will normally entail a major re-write of your code.
And, of course, the code itself will grow in complexity. For instance, if
the main database is now on disk, you may need to create a cache
policy. Or you may need to do concurrent reads from multiple
processes. Or, even worse, concurrent writes.
You use a SQL database and suddenly you reach maximum throughput
capacity of the server. If it’s only a read capacity problem then you
might survive by just creating a few read replicas. But if it is a write
problem, what do you do? Maybe you set up sharding [1]? Or do you
decide to completely change your database technology in favor of some
supposedly better performant NoSQL variant?
If you are dependent on a system is in the cloud based on vendor
proprietary technologies, you might discover that the ability to scale
indefinitely is more marketing talk than technological reality. In many
cases, if you hit performance limits, the only realistic solution is to
change the technology that you are using, a change that requires
enormous time, money, and human energy.
I hope these examples make the case that growing is not just a question of
“adding more machines,” but instead entails substantial work on several
fronts to deal with the increased complexity. Even something as "simple" as
a parallel solution implemented on a single computer can bring with it all the
problems of parallel processing (races, deadlocks, and more). These more
efficient solutions can have a dramatic effect on complexity, reliability and
cost.
Finally we could make case that even if we could scale our infrastructure
linearly (we can’t, really) there would be ethical and ecological issues to
consider: Forecasts put energy consumption related to a “Tsunami of data” at
20% of global electricity production (For details see
https://www.theguardian.com/environment/2017/dec/11/tsunami-of-data-
could-consume-fifth-global-electricity-by-2025), and is there also an issue of
landfill disposal as we update hardware.
On the other hand, many of the solutions we’ll look at will have a
development cost and will add an amount of complexity themselves. When
you look at your data and forecasts for its growth, you will have to make a
judgment call on where to optimize, as there are no clear-cut recipes or one-
size-fits-all solutions. That being said, there might be just one rule that can
be applied across the board:
If the solution is good for Netflix, Google, Amazon, Apple or Facebook then
probably it is not good for you—unless, of course, you work for one of these
companies.
The amount of data the most of us will see will be substantially lower than
the biggest technological companies use. It will still be enormous, it will still
be hard, but it will probably be a few orders of magnitude lower. The
somewhat prevailing wisdom that what works for those companies is also a
good fit for the rest of us is, in my opinion, just wrong. Generally, less
complex solutions will be more appropriate for most of us.
As you can see, this new world with extreme growth—both in quantity and
complexity—of both data and algorithms requires more sophisticated
techniques to perform computation and storage in an efficient and cost-
conscious way. Don’t get me wrong, sometimes you will need to scale up
your infrastructure. But when you architect and implement your solution,
you can still use the same mindset of focusing on efficiency. Its just that the
techniques will be different.
Now that we have a broad overview of the problem, let’s see how to address
it. In the next section we will look at computing architectures in general:
From what is going on inside the computer all the away to the implications
of large clusters and cloud solutions. With these environments in mind we
can, in the section afterwards, start discussing the advantages and pitfalls of
Python for high performance processing of large datasets.
[1]Sharding is the partition of data so that parts of it reside in different
servers.
Given the sheer amount of processing power in GPUs, there was an attempt
to try to use that power for other tasks with the appearance of General-
Purpose Computing on Graphics Processing Units (GPGPU). Because of the
way GPU architectures are organized, they are mostly applicable to tasks
that are massively parallel in nature. It turns out that many modern AI
algorithms, like ones based on neural networks, tend to be massively
parallel. So there was a natural fit between the two.
Unfortunately, the difference between CPUs and GPUs is not only in number
of cores and their complexity. GPU memory—especially on the most
computationally powerful—is separated from main memory. Thus there is
also the issue of transferring data between main memory and GPU memory.
So we have two massive issues to consider when targeting GPUs.
For reasons that will become clear in chapter 9, "GPU Computng with
Python," programming GPUs with Python is substantially more difficult and
less practical than targeting CPUs. Nonetheless, there is still more than
enough scope to make use of GPUs from Python.
There are many changes in the way we program modern CPUs, and as you
will see in chapter 6, "CPU and Memory Heirarchy," some of them are so
counter-intuitive they are worth keeping an eye on from the onset. For
example, while CPU speeds have leveled in the recent years, CPUs are still
orders of magnitude faster than RAM memory. If CPU caches did not exist
then CPUs would be mostly idle as they would spend most of the time
waiting for RAM. This means that sometimes it is faster to work with
compressed data—including the cost of decompression—than with raw data.
Why? If you are able to put a compressed block on the CPU cache then
those cycles that otherwise would be idle waiting for RAM access, could be
used to decompress the data with still cycles to spare that could be used for
computation! A similar argument could work for compressed file systems:
they sometimes can be faster than raw file systems. There are direct
applications of this in the Python world: for example by changing a simple
boolean flag regarding the choice of internal representation of NumPy arrays
you take advantage of cache locality issues and speed up your NumPy
processing considerably. We have some access times and sizes for different
kinds of memory in 1.1 including CPU cache, RAM, local disk and remote
storage. The key point here are not the precise numbers but the orders of
magnitude in difference in both size and access time.
Table 1.1. Memory hierarchy with sizes and access times for an hypothetical, but realistic
modern desktop
Access
Type Size
time
CPU
L1 cache 256 KB 2 ns
L2 cache 1 MB 5 ns
L3 cache 6 MB 30 ns
RAM
DIMM 8 GB 100 ns
Secondary storage
SSD 256 GB 50 µs
HDD 2 TB 5 ms
Tertiary storage
Network
NAS - Network Access Server 100 TB
dependent
Provider
Cloud proprietary 1 PB
dependent
Table 1.1 introduces tertiary storage, which happens outside the computer.
There are also been changes there, which we will address in the next section.
Using many computers and external storage brings a whole new class of
problems related to distributed computing: network topologies, sharing data
across machines, managing processes running across the network. There are
many examples. For example, what is the price of using REST APIs on
services that require high-performance and low latency? How we deal with
the penalties of having remote file-systems, can we mitigate those?
We will be trying to optimize our usage of the network stack and for that we
will have to be aware of it at all levels shown in figure 1.3. Outside the
network we have our code and Python libraries; which make choices about
the layers below. At the top of the network stack a typical choice for data
transport HTTPS with a payload based on JSON. While this is a perfectly
reasonable choice for many applications, there more performant alternatives
for cases where network speed and lag matters. For example a binary
payload might be more efficient than JSON. Also HTTP might be replaced
by a direct TCP socket. But there are more radical alternatives like replacing
the TCP transport layer: Most Internet application protocols use TCP, though
there are a few exceptions like DNS and DHCP, which are both UDP based.
The TCP protocol is highly reliable, but there is a performance penalty to be
paid for that reliability. There will be times where the smaller overhead of
UDP will be a more efficient alternative and the extra reliability is not
needed.
Below transport protocols we have the Internet Protocol (IP) and the
physical infrastructure. The physical infrastructure can be important when
we design our solutions. For example if we have a very reliable local
network, then UDP, which can loose data, will be more of an alternative than
it would be in an unreliable network.
Figure 1.3. API calls via the network stack. Understanding the alternatives available for network
communication can dramatically increase the speed of Internet-based applications
1.2.3 The cloud
The cloud is not just about adding more computers or network storage. It’s
also about a set of proprietary extensions on how to deal with storage and
compute resources, and those extensions have consequences in terms of
performance. Furthermore, virtual computers can throw a wrench on some
CPU optimizations. For example in a bare metal machine you can devise a
solution that is considerate of cache locality issues, but in a virtual machine
you have no way to know if your cache is being preempted but another
virtual machine being executed concurrently. How do we keep our
algorithms efficient in such an environment? Also the cost model of cloud
computing is completely different—time is literally money—and as such
efficient solutions become even more important.
Many of the compute and storage solutions in the cloud are also proprietary
and have very specific APIs and behaviors. Using such proprietary solutions
also has consequences on performance that should be considered. As such,
and while most issues pertaining traditional clusters are also applicable to
the cloud, sometimes there will be specific issues that will need to be dealt
with separately.
Spoiler alert: CPython will not fare well. We have a language that is
naturally slow and a flagship implementation that does not seem to have
speed as its main consideration. Now, the good news is that most of these
problems can be overcome. Actually many people have produced
applications and libraries that will mitigate most performance issues. You
can still write code in Python that will perform very well with a small
memory footprint. You just have to write code while attending to Python’s
warts.
Note
In most of the book, when we talk about Python we are referring to the
CPython implementation. All exceptions to this rule will be explicitly called
it out.
But, back to the main issue: does the GIL impose a serious performance
penalty? In most cases the answer is a surprising No. There are two main
reasons for this:
Most of the high-performance code, those tight inner loops, will
probably have to be written in a lower level language as we’ve
discussed.
Python provides mechanisms for lower level languages to release the
GIL.
This means that when you enter a part of the code rewritten in a lower level
language, you can instruct Python to continue with other Python threads in
parallel with your low-level implementation. You should only release the
GIL if that is safe: for example if you do not write to objects that may be in
use by other threads.
The book also addresses the widely used Python ecology of libraries for data
processing and analysis (such as Pandas and NumPy), with the aim of
improving how we use them. On the computing side, this is a lot of material,
so we will not discuss very high-level libraries. For example, we will not
talk about optimizing the usage of say, TensorFlow but we will discuss
techniques to make the underlying algorithms more efficient.
With regards to data storage and transformation, you will be able to look at a
data source and understand its drawbacks for efficient processing and
storage. Then you will be able to transform the data in a way that all required
information is still maintained but access patterns to the data will be
substantially more efficient.
Finally, you will also learn about Dask a Python-based framework that
allows you to develop parallel solutions that can scale from a single machine
to very large clusters of computers or cloud computing solutions.
The reader for this book will probably have at least a couple of years of
Python experience, and will know Python control structures and what are
lists, sets and dictionaries. You will have used some of the Python standard
libraries like os, sys, pickle or multiprocessing.
To take best advantage of the techniques I present here, you should also have
some level of exposure to standard data analysis libraries like NumPy—you
will have at least minimal exposure to arrays—and Pandas where you had
some contact with data frames.
Experience dealing with IO in Python will also help you. Given that IO
libraries are less explored in the literature, we will take you from the very
beginning with formats like Apache Parquet or libraries like Zarr.
You should know the basic shell commands of Linux terminals (or MacOS
terminals). If you are on Windows, please have installed either a Unix based
shell or know your way around the command line or PowerShell. And of
course, you need Python software installed on your computer.
Sometimes we will be providing tips for the cloud, but cloud access or
knowledge is not in anyway a requirement for reading this book. If you are
interested in cloud approaches, then you are expected to know how to do
basic operations like create instances or access the storage of your cloud
provider. The book presents examples using Amazon AWS, but they should
be easily transposable to other cloud providers.
While you do not have to be, at all, academically trained in the field, a basic
notion of complexity costs will be helpful. For example, the intuitive notion
that algorithms that scale linearly with data are better than algorithms that
scale exponentially.
If you plan on using GPU optimizations, you are not expected to know
anything at this stage.
Before you continue with this book be sure to check appendix A for a
description of options to setup your environment.
1.6 Summary
Yes, the cliche is true: there is a lot of data and we have to increase the
efficiency in processing it if we want to stand a chance to extract the
most value from it.
Increased algorithm complexity adds an extra strain to computation cost
and we will have to find ways to mitigate computational impact.
There is a large heterogeneity of computing architectures: the network
now also includes cloud-based approaches. Inside our computers there
are now powerful GPUs whose computing paradigm is substantially
different from CPUs. We need to be able to harness those.
Python is an amazing language for data analysis surrounded by a
complete ecology of data processing libraries and frameworks. But it
also suffers from serious problems on the performance side. We will
need to be able to circumvent those problems in order to process lots of
data with sophisticated algorithms.
While some of the problems that we will be dealing can be hard, they
are mostly solvable. The goal of this book is to introduce you to plenty
of alternative solutions, and teach you how and where each one is best
applied, so you can choose and implement the most efficient solution
for any problem you encounter.
2 Extracting maximum
performance from built-in features
This chapter covers
There are many tools and libraries to help us write more efficient Python.
But before we dive into all the external options to improve performance,
let’s first take a closer look at how we can write pure Python code that is
more efficient, in both computing and IO performance. Indeed many, though
certainly not all, Python performance problems can be solved by being more
mindful of Python’s limits and capabilities.
The first thing that you want to do is to profile the existing code that will
ingest the data. You know that the code that you already have is slow, but
before you try to optimize it you need to find empirical evidence for where
the bottlenecks are. Profiling is important because it allows us to search, in a
rigorous and systematic way, for bottlenecks in our code. The most common
alternative—guestimating—is particularly ineffective here because many
slowdown points can be quite unintuitive.
Optimizing pure Python code is the low-hanging fruit and also where most
problems tend to reside, so it will be generally very impactful. In this chapter
we will see what pure Python offers out of the box to help us develop more
performant code. We will start by profiling the code, using several profiling
tools, to detect problem areas. Then we will focus on Python’s basic data
structures: lists, sets, and dictionaries. Our goal here will be to improve the
efficiency of these data structures and to allocate memory to them in the best
way for optimal performance. Finally, we will see how modern Python lazy
programming techniques, might help us improve the performance of the sata
data pipeline.
This chapter will only discuss optimizing Python without external libaries,
but we will still use some external tools to help us optimize performance and
access data. Will will be using Snakeviz to visualize the output of Python
profiling. We will also use line_profiler to profile code line-by-line. Finally
we will use the requests library to download data from the Internet.
If you use Docker, the default image has all you need. If you used the
instructions for Anaconda Python from Appendix A you are all set..
Lets now start by downloading our data from weather stations and studying
temperature on each station.
Data on NOAA’s site has CSV files one per year and then per station, for
example the file:
https://www.ncei.noaa.gov/data/global-
hourly/access/2021/01494099999.csv
Has all entries for station 01494099999 for year 2021. This includes, among
other entries temperature, pressure or window done pontentially several
times a day.
Let’s develop a script to download the data for a set of stations on an interval
of years. After downloading the data of interest we will get the minimum
temperature for each station.
Our script will have a simple command line interface, where we pass a list of
stations and an interval of years of interest. Here is the code to parse the
input (The code below can be found on 02-python/sec1-io-cpu/load.py):
import collections
import csv
import datetime
import sys
import requests
stations = sys.argv[1].split(",")
years = [int(year) for year in sys.argv[2].split("-")]
start_year = years[0]
end_year = years[1]
Here is the code to download the data from the server, to ease the coding
part, we will be using the requests library to actually get the file:
TEMPLATE_URL = "https://www.ncei.noaa.gov/data/global-
hourly/access/{year}/{station}.csv"
TEMPLATE_FILE = "station_{station}_{year}.csv"
The code above will write each downloaded file to disk for all the requested
stations across all years.
Let’s now get all temperatures and get the minimum temperature per station:
def get_all_temperatures(stations, start_year, end_year):
temperatures = collections.defaultdict(list)
for station in stations:
for year in range(start_year, end_year + 1):
for temperature in
get_file_temperatures(TEMPLATE_FILE.format(station=station,
year=year)):
temperatures[station].append(temperature)
return temperatures
def get_min_temperatures(all_temperatures):
return {station: min(temperatures) for station, temperatures
in all_temperatures.items()}
Now we can tie everything together: download the data, get all temperatures,
compute the minimum per station and print the results.
download_all_data(stations, start_year, end_year)
all_temperatures = get_all_temperatures(stations, start_year,
end_year)
min_temperatures = get_min_temperatures(all_temperatures)
print(min_temperatures)
For example to load the data for stations 01044099999 and 02293099999 for
the year 2021 we do:
python load.py 01044099999,02293099999 2021-2021
Now the real fun will start: as we want to be able to download lots of
stations for many years, we want to make the code as efficient as possible
and for that we will use Python built-in profiling machinery.
As we want to make sure our code is as efficient as possible the first thing
we need to do is to find existing bottlenecks in that code. Our first port of
call will be profiling the code to check each function time consumption. For
this we run the code via Python’s cProfile module. This module is built-in
into Python and allows us to obtain profiling information from our code.
Make sure you do not use the profile module, as it is orders of magnitude
slower; its only useful if you are developing profiling tools yourself.
We can run
python -m cProfile -s cumulative load.py 01044099999,02293099999
2021-2021 > profile.txt
Remember that running with python with the -m flag will execute the
module, so we are running the cProfile module. This is Python’s
recommended module to gather profiling information. We are asking for
profile statistics ordered by cumulative time. The easiest way to use the
module is by passing our script to the profiler in a module call like this. In
our case, the genetics conversion script has a parameter which is the block
size.
375402 function calls (370670 primitive calls) in 3.061
seconds #1
The output is ordered by cumulative time, which is all the time spent inside a
certain function. Another output is the number of calls per function. For
example there is only a single call to download_all_data (which takes care
of downloading all data) but its cumulative time is almost equal to the total
time of the script. You will notice two columns called percall. The first one
states the time spent on the function excluding the time spent on the all sub-
calls. The second one includes the time spent on sub-calls. In the case of
download_all_data it is clear that most time is actually consumed by some
of the sub-functions.
In many cases, when you have some intensive form of I/O like here, there is
a strong possibility that I/O dominates in terms of time needed. In our case
we have both network I/O—getting the data from NOAA—and disk I/O—
writing it to disk. Network costs can vary widely, even between runs, as they
are dependent of many connection points along the way.
As network costs are normally the biggest time sink, let’s try to mitigate
those.
To reduce network communication, we will save a copy for future use when
we download a file for the first time. We will build a local cache of data.
We will use the same code as above, save for the function
download_all_data. The implementation below can be found on 02-
python/sec1-io-cpu/load_cache.py.
import os
def download_all_data(stations, start_year, end_year):
for station in stations:
for year in range(start_year, end_year + 1):
if not
os.path.exists(TEMPLATE_FILE.format(station=station,
year=year)): #1
download_data(station, year)
The first run of the code will take the same time as the solution above, but a
second run will not require any network access. For example, given the same
run as above, it goes from 2.8s to 0.26s: more than an order of magnitude
increase. Remember that due to high variance in network access the time to
download files can vary substantially in your case: this is yet another reason
to consider caching network data—having a more predictable execution
time.
python -m cProfile -s cumulative load_cache.py
01044099999,02293099999 2021-2021 > profile_cache.txt
While the time to run decreased one order of magintude, IO is still top: now
its not the network, but disk access. This is mostly caused by the
computation being acually low.
Warning
We are now going to consider a case where CPU is the limiting factor.
def get_locations():
with open("locations.csv", "rt") as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
station = row[header.index("STATION")]
lat = float(row[header.index("LATITUDE")])
lon = float(row[header.index("LONGITUDE")])
yield station, (lat, lon)
return dist
The code above will take a long time to run. It also takes a lot of memory. If
you have memory issues, limit the numeber of stations that you are
processing.
Let’s now use Python’s profiling infrastructure to see where most time is
spent.
The -o parameter specifies the file where the profiling information will be
stored, after that we have the call to our code as usual.
Python provides the pstats module to analyze traces written to disk. You
can do python -m pstats distance_cache.prof which will start a
command line interface to analyze the cost of our script. You can find more
information about this module on the Python documentation or in the
profiling section of chapter 5.
To analyze this information we will use a web-based visualization tool called
SnakeViz. You just need to do snakeviz distance_cache.prof. This will
start an interactive browser window (Figure 2.1 shows a screenshot).
This would be a good time to play with the interface a bit. For example you
can change the style from Icicle to Sunburst (arguably cuter but with less
information as the file name disappears). Re-order the table in the bottom.
Check the Depth and Cutoff entries. Do not forget to click on some of the
colored blocks and finally return to the main view by clicking on Call Stack
and choosing the 0 entry.
Most of the time is spent inside the function get_distance, but exactly
where? We are able to see the cost of some of the math functions, but
Python’s profiling doesn’t allow us to have a fine-grained view of what
happens inside each function. We only get aggregate views for each
trigonometric function: yes there is some time spent in math.sin, but given
that we use it in several lines, where exactly are we paying a steep price? For
that we need to recruit the help of the line profing module.
Built-in profiling, like we used above, allowed us to find the piece of code
that was causing a massive delay. But there are limits to what we can do with
it. We are going discuss those limits here and introduce line profiling as a
way to find further performance bottlenecks in our code.
You might have noticed that we have not imported the profile annotation
from anywhere. This is because we will be using the convenience script
kernprof from the line_profiler package that will take care of this.
kernprof -l lprofile_distance_cache.py
Be prepared for the instrumentation required by the line profiler to slow the
code substantially, by several orders of magnitude. Let it run for a minute or
so, and after that interrupt it: kernprof would probably run for many hours
if you let it complete. If you interrupt it, you will still have a trace.
After the profiler ends, you can have a look at the results with:
If you look at the output below, you can see that the it has many calls that
take quite some time. So we will probably want to optimize that code. At
this stage, as we are discussing only profiling we will stop here, but
afterwards we would need to optimize those lines (and we will do so later in
this chapter). If you are interested in optimizing this exact piece of code have
a look at the Cython chapter or the Numba appendix as they provide the
most straightforward avenues to increase the speed.
Listing 2.1. The output of the line_profiler package for our code
There are many other utilities that can be useful if you are profiling code, but
a profiling section would not be complete without a reference to one of
these, the timeit module. This is probably the most common approach that
newcomers take to profile code and you can find endless examples using the
timeit module on the Internet. The easiest way to use the timeit module is
by using IPython or Jupyter Notebook, as these systems make timeit very
streamlined. Just add the %timeit magic to what you want to profile, for
example inside ipython:
In [1]: %timeit list(range(1000000))
27.4 ms ± 72.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops
each)
This gives you the run time of several runs of the function that you are
profiling. The magic will decide how many times to run and the report basic
statistical information. Above you have the difference between a
range(1000000) and a list(range(1000000)). In this specific case, timeit
shows that the lazy version of range is two orders of magnitude faster than
the eager one.
You will be able to find much more details in the documentation of the
timeit module, but for most use cases the %timeit magic of IPython will be
enough to access its functionality. You are encouraged to use IPython and its
magics, but in most of the rest of the book we will use the standard
interpreter. You can read more about the %timeit magick here:
https://ipython.readthedocs.io/en/stable/interactive/magics.html.
Now that we introduced profiling tools, lets direct our attention to a different
subject: optimizing the usage of Python data structures.
We will re-use the code from the first section of the chapter to read the data.
The code can be found in 02-python/sec3-basic-
ds/exists_temperature.py.
This is roughly one order of magnitude slower than our search for -10.7.
At this stage we have no number to compare against, but its safe to assume
that a milli- and even a microsecond range are not very encouraging. This
should be doable in orders-of-magnitude less time.
But can we do even better? Lets convert our ordered list into a set and try to
do a search.
set_first_all_temperatures = set(first_all_temperatures)
This is several orders of magnitude faster than the solutions above! Why is
this. There are two main reasons: one related to set size and another related
to complexity. The complexity part will be discussed in the next sub-section.
With regards to the size, remember that the original list had 141,082
elements. But, with a set, all repeated values are collapsed into a single
value, and there are plenty of repeated elements on the original list. The set
size is reduced to print(len(set_first_all_temperatures)), i.e., 400
elements. 350 times less. No wonder searching is so much faster as the size
of structure is much smaller.
The performance from the example above was mostly based on the de facto
reduction in size of the data structure when we switched from a list to a set.
What would happen if there was no repetition and hence list and set were the
same size? We can simulate this trivially with a range as all elements will be
different:
a_list_range = list(range(100000))
a_set_range = set(a_list_range)
The set implementation that we used as has much better performance. That
is because in Python (to be more precise in CPython) a set is implemented
with an hash. Finding an element thus has the cost of searching an hash.
Hash functions come in many flavors and have to deal with many design
issues. But when comparing lists and sets we can mostly assume that set
lookup is mostly constant—and will perform as well with a collection of size
10 or 10 million. This is not really correct, but as an intuition to compare
against list lookups it is reasonable.
A set is mostly implemented like a dictionary without values, which means
that when you search on a dictionary key, you get the same performance as
searching on a set
However, sets and dictionaries are not the silver bullet that they might seem
here. For example, if you want to search an interval then an ordered list is
substantially more efficient: In an ordered list you can find the lowest
element and then traverse from that point up until you find the first element
above the interval and then stop. In a set or dictionary you would have to do
a lookup for each element in the interval. So, if you know the value you are
searching for, then a dictionary can be extremely fast. But if you are looking
in an interval then it suddenly it stops being a reasonable option: An ordered
list with a bisection algorithm would perform much better.
Given that lists are so pervasive and easy to use in Python there are many
cases where more appropriate data structures exist but it is worth stressing
that lists are a fundamental data structure that have many good use cases.
The point is to be mindful of choices, not banish lists.
Tip
Be careful when using in to search inside large lists.If you browse through
Python code, the pattern of doing using in to find elements in a list (the
method index of list objects is in practice the same thing) is quite common.
This is not a problem for small lists as the time penalty is quite small and its
perfectly reasonable, but can be serious with large lists.
You can find the time complexity of many operations over many existing
Python data structures in https://wiki.python.org/moin/TimeComplexity .
Now that we have mostly discussed time performance, lets discuss another
important performance issue with big datasets: conserving memory.
Going back to our scenario, we decided that we want to reduce the disk
consumption of our data. For that we are going to start with a study of the
content of the data files. Our objective is to load a few of them and do some
statistics on character distributions.
def download_all_data(stations, start_year, end_year):
for station in stations:
for year in range(start_year, end_year + 1):
if not
os.path.exists(TEMPLATE_FILE.format(station=station,
year=year)):
download_data(station, year)
stations = ['01044099999']
start_year = 2005
end_year = 2021
download_all_data(stations, start_year, end_year)
all_files = get_all_files(stations, start_year, end_year)
all_files now has a dictionary where each item contains the contents for
all the files related to a station. Let’s study the memory usage of this
getsizeof might not return what you expect. The files on the disk are in the
MB range, so estimates below 1KB sound quite suspicious. getsizeof is
actually returning the size of the containers (the first is a dictionary, the
second is an iterator and the third is a list) without accounting for the
content. So, we have to account for two things occupying memory: the
content of the container container itself.
Note
For us, the intricacies of getsizeof are mostly a starting point to discuss
CPython memory allocation in deep.
The length is 1,303,981, corresponding to the size of but file. We get a size
of 10,431,904. This is around 8 times the size of the underlying file. Why 8
times? Because each entry is a pointer to a character and a pointer is 8 bytes
in size. At this stage this looks quite bad, as we have a large data structure
and we didn’t yet accounted for the data proper. Lets have a look at a single
character:
print(sys.getsizeof(station_content[0]))
print(type(station_content[0]))
Fortunately the situation is not as bad as it seems and also we can do much
better. We will start by seeing that Python—or rather CPython—is quite
smart with memory allocation.
The code above gets the unique identifier for all of our numbers. In CPython
that happens to be memory location. CPython is smart enough to see that the
same string content is being used over and over again—remember that each
ASCII character is represented by an integer between 0 and 127—and as
such the output of the code above is 46.
Another reason is that even CPython makes no promises about the most of
its allocation policies from version to version. What works for your specific
version might change in a different version.
Finally even if you have a fixed version, how things work might not be
completely obvious. Consider this code in Python 3.7.3 (this might vary on
other versions):
s1 = 'a' * 2
s2 = 'a' * 2 #1
s = 2
s3 = 'a' * s #2
s4 = 'a' * s
print(id(s1))
print(id(s2))
print(id(s3))
print(id(s4))
print(s1 == s4) #3
This was now the 10th of July and we had reached Camp 51. We
were well repaid for our decision, for the following morning was
perfectly glorious—not a cloud, not a breath of wind was there to
mar the quietude that man and beast at this time so much needed.
To commemorate the occasion, I photographed Malcolm enjoying his
breakfast just outside the tent, with Esau standing by the other side
of the table, holding in his hand a dish of luxuries!
About midday, Tokhta, Sulloo, and the pony walked slowly into
camp. They persisted that nothing on earth would induce them to
travel onwards another step; poor fellows, they had reached what
seemed to them a perfect haven of rest; they must have felt
thoroughly worn out, for all they wanted to do was to remain where
they were and quietly die. It was quite certain that it would have
been madness for us to remain with them, for only a few more days'
rations remained, and our only chance of getting through the
country at all lay in our coming across nomads from whom by hook
or crook we could get supplies. We did think of leaving some men
behind, while a small party marched on as fast as possible with light
loads in search of people, but these men did not relish being left,
and supposing there were no people to find, our situation would
have been still more critical. We ended our problem by leaving the
two sick men with a pony and a supply of food and drinking utensils,
etc., so that if they felt inclined they might follow after, for they
would have found no difficulty in tracking us. We buoyed them up,
too, with the hopes we entertained of shortly finding people, when
we would at once send back assistance to them. We also
endeavoured to persuade them to make an effort in reaching a fresh
camp each day, by marching and halting according to their
inclination, for we told them we should only make short marches,
and at each camp we would leave a supply of food for them and
some grain for the pony. It was a sad thing having to leave these
men and the pony as we did, and when we halted for the night and
the sun began to set calmly over these vast solitudes, there was no
sign of their coming, look back as we might to the far-off hills for
some tiny, distant, yet moving, speck. The darkness of night soon
gathered around, and we could only wonder how close they might
be to us. The next day we saw new life, for Malcolm had a shot at a
wild dog, while I saw two eagles; such sights as these at once set
our imagination at work, for we argued as to how could these
creatures exist unless people were living somewhere close. At the
same time it brought encouragement to all.
At our midday halt the men's spirits were more cheerful. We had
stopped in a fine broad nullah, running nearly due east, with
pleasant-looking grassy hills sloping down on either side, and, with a
cloudless sky and no wind, we were glad to sit in our shirt-sleeves,
whilst our twelve veteran mules, with their saddles off, rolled in the
sand before enjoying the rich grass and water. We began to pick
fresh additions to our flower collection, the specimens being chiefly
of a mauve or white colour, and up to the present time we had only
found one yellow flower. At 7.30 p.m., in Camp 61, at a height of over
16,000 feet, the temperature was forty degrees Fahrenheit, and
during the night there were nineteen degrees of frost. Fine grass
and fine weather still favoured us, while the presence of a number of
sand-grouse indicated that water was at no great distance off.
Just after leaving Camp 62, we were all struck with wonderment
at finding a track running almost at right angles to our own route. It
was so well defined, and bore such unmistakable signs of a
considerable amount of traffic having gone along it, that we
concluded it could be no other than a high road from Turkistan to
the mysterious Lhassa, yet the track was not more than a foot
broad. Our surmises, too, were considerably strengthened when one
of the men picked up the entire leg bone of some baggage animal,
probably a mule, for still adhering to the leg was a shoe. This was a
sure proof that the road had been made use of by some merchant or
explorer, and that it could not have been merely a kyang or yak
track, or one made use of only by nomads, for they never shoe their
animals in this part of the world.
Such a startling discovery as this bore weight with the men, and
nothing would have suited their spirits better than to have stuck to
the track and march northwards, and they evidently thought us
strange mortals for not following this course; therefore, instead of
being elated with joy, they became more despondent than ever
when they found we were still bent upon blundering along in our
eastern route. But it was our strong belief that we should for a
certainty find people in a very few days' time, and this being the
case, we did not see the force of travelling in a wrong direction, and
put aside the objects for which we had set out, just to suit the
passing whim of a few craven-hearted men, especially when we
knew that the cause of their running short of food and consequent
trouble was entirely due to their own dishonest behaviour. We did,
however, send one man, Mahomed Rahim, supplied with food, with
instructions to follow the road north as far as he had courage to go,
thinking that when he had crossed a certain range of hills he would
discover the whereabouts of people. Furthermore we explained to
him the way we intended going, so that there could be no chance of
his losing himself.
We had halted, and were expecting the arrival of these two men,
when Mahomed Rahim, who had been sent to follow up the track,
rejoined us, and as he approached we could see he was weeping
bitterly. On asking the man what ailed him, he sobbed out that he
had lost his way. He was a ludicrous sight, for he was a great, big,
strong fellow, and we asked him, if he wept like this at finding us
again after only being absent a day and a night, how would he weep
had he not found us at all? We fed up the great baby with some
unleavened bread, which he ate voraciously amidst his sobs. Some
kyang came trotting up to camp with a look of wonderment at our
being present there, and as we were about to move off some
antelopes also came to inspect us.
During the night our tent had great difficulty in withstanding the
wind, that blew with much violence, while the temperature fell to
twenty-one degrees of frost. As we had run short of iron pegs, we
found a most efficient substitute in fastening the ropes to our tin
boxes of ammunition. On other occasions, too, the ground was so
sandy that pegs were entirely useless, and each rope had to be
fastened to a yakdan, or to one of our bags of grain.
On the 26th July we left Camp 66, moving off by moonlight, for
the going was easy. On halting for breakfast, two antelopes ventured
to come and have a look at us, and, of course, paid the penalty of
death. Such an opportunity as this was not to be thrown away, and
laying them together, I photographed them, and afterwards cut
them up, carrying as much meat as we possibly could manage—
enough for three or four days' consumption. The afternoon was hot,
like a summer's day in England. Some yak, resembling big black
dots, could be seen in several of the grassy nullahs: a trying
temptation to have a stalk after them, for the ground was of such a
nature that with care one might have come up to within a hundred
yards of some of them without being seen. But then it would have
been useless to slaughter them, so we contented ourselves with
watching their movements, and with making out what we could have
done had we been merely on an ordinary shooting trip, or had we
been hard up for meat.
Having breakfasted off our antelope meat and some good tea,
we were busy with our maps, and drying flowers, etc. Everything
was spread out—for such frail specimens it was a splendid
opportunity; the men were sleeping, too. The mules, having eaten
their fill, were standing still enjoying the rest and perfect peace; all
was absolute silence, with the exception of our own chatting to each
other, as we amused ourselves with our hobbies, when without a
moment's notice a powerful blast of wind caught us with such
violence that the tent was blown down and many things were
carried completely away, and our camp, which only a second ago
had been the most peaceful scene imaginable, became a turbulent
one of utter confusion, as every one jumped up in an instant,
anxious to save anything he could lay hold of, or to run frantically
after whatever had escaped—for some things were being carried
along at a terrific rate. Fortunately the loss, compared to the
excitement, was trifling; but we made up our minds not to be caught
napping in this way again.
As soon as one of the men had come up, I told him to look
sharp and cut its throat for it was not quite dead, although in reality
it had breathed its last some ten minutes ago. He at once set to
work, but so tough was the hide, and so blunt his knife, that he
could not cut through it, and merely first pricked it with the point;
and although no blood exuded, he nevertheless told the other men
that he had properly hallaled the brute, and they by this time having
become less scrupulous with regard to their religious custom, made
no bones about arguing as to the meat being unfit for them to eat.
As a matter of fact they were beginning to learn what real hunger
was. Some of them came to help cut off the meat in a business-like
sort of way, pretending not to examine the throat at all.
It was now the 28th of July, and we had reached a spot between
our night encampments 69 and 70, the day camps not being
recorded in the map. Since leaving Lanak La on the 31st of May, we
had been daily finding our own way across country, over mountains
and valleys, along nullahs and beds of rivers, etc., and at last we had
found a track we could follow. Such a sensation was novel to us. We
could scarcely grasp that there was no need to go ahead to find a
way. We had simply to follow our nose. We thought that our troubles
were nearly finished, and for the rest of our journey that there
would be easy marching, and every moment we quite expected that
the dwellings of mankind would heave into sight. Especially, too,
when one of the men picked up a stout stick, three or four feet long,
which must have been carried there by somebody or other, for since
leaving Niagzu the highest species of vegetation we had seen was
the wild onion. Some of the men also declared that they had found a
man's footprint. Personally we did not see this sign of civilization, but
the men maintained there could be no mistake about it, for they said
it was the footprint of a cripple!
Besides all this comforting news, there was no need to be
tramping over the hills in search of game for food. The antelope,
yak, and kyang were plentiful and easily shot in all the valleys, and,
had we been so disposed, we might have shot a dozen yak during
the afternoon's march. When we halted for the night one of the wild
yak actually came and grazed amongst our mules!
We had been marching uphill, and at the top of the valley found
a fast-running rivulet taking its rise from the snow mountains that
lay south of us, the same range that had blocked our way and
compelled us to make the detour. Added to the work of once more
having to find our own way, the country took a change for the
worse. Although there was no difficulty about the water, still there
was less grass, the soil became slatey, and in places barren. Storms
began to brew around us, but we were lucky in being favoured with
only some of the outlying drops. We had a perfectly still night with
one degree of frost.
It had been our custom, especially on dark nights, to make the
men take their turn of guard over the mules, to watch and see that
they did not stray. They were far too precious to lose, and by
marching in the early morning, felt less fatigue.
Another inducement for doing so was that of late there had been
little difficulty in keeping all well supplied with meat. It thus
happened that when everything was in our favour, we were sanguine
of accomplishing our journey without any further mishap. We
crossed over several cols and saw fresh-water lakes, while yak,
kyang, antelope, and sand-grouse were plentiful.
We were sorry to find that Shahzad Mir had not come in, though
very shortly the man who was carrying the plane-table walked up,
saying that Shahzad Mir had stopped the other side of the stream
with a pain in his stomach. We knew quite well what was the cause
of this. He had been taking some chlorodyne and afterwards had
eaten enormous quantities of meat. As there was nothing to be
gained by getting anybody else soaked, we sent back the same man
to fetch him in. The night was very dark and the rain turned to
snow, still neither of them came. Fearing that on account of the
darkness they had gone astray, we popped outside and fired off our
gun at intervals; still the ammunition was wasted. Nothing but
daybreak brought them back, when it turned out that they had been
so ridiculous as to sleep in a nullah only a few yards from our camp.
They had even heard the shots, but still could not find us. Neither of
them was any the worse for the outing, in fact the result had been
beneficial, for the stomach-ache from which Shahzad Mir had been
suffering was completely cured. They caused a good deal of
merriment amongst us all, and we all thought they might have
selected a more suitable night for sleeping out of camp.
The day was fine and warm, and as I went on ahead to explore,
I saw below me some grassy hillocks, and, grazing in their midst, a
fine yak. I thought it would be interesting to make a stalk just to see
how close it was possible to get without disturbing him. I walked
down the hill I was on and dodged in and out between the hillocks,
always keeping out of sight, still getting closer and closer, till at last
there was only one small hillock that separated us, not more than
half a dozen yards. But when I stood up before him and he raised
his head, for he was intent upon grazing, and saw me, his look of
utter bewilderment was most amusing to see. He was so filled with
astonishment, as the chances are he had never seen a human form
before, that it was some moments before he could collect his
thoughts sufficiently to make up his mind and be off.
Our men that morning had behaved in a peculiar way, for each
of them had come to make his salaam to us; not that we attached
much importance at the time to it, still it flitted across our minds that
they were becoming very faithful muleteers all at once, and perhaps
intended doing better work for the future. That evening we
impressed upon them the necessity of making double marches
again, as the last two days we had only made single, and told them
how impossible it was to march much further without meeting
somebody, and gave orders for them to commence loading at 3.30
a.m.
It was some time before we could collect all the mules again.
Some of them seemed to know there was something up, and there
was every chance of their being deserters. One little black chap in
fact was so clever at evading our united efforts to catch him, that we
had to give him up as a bad job, and load eleven animals instead of
twelve.
Soon after dark, when everything was in readiness for the night,
rain began to fall; it rained, as the saying goes, cats and dogs, such
as we had never seen it rain before. All five of us were snug, dry,
and warm in our little tents, from which we could watch the mules,
whilst the deserters must have spent a most miserable night without
any shelter and food, and the hot tea which they all loved so much.
We felt that they were being deservedly punished for their sins. Esau
and Lassoo soon realized how much they had already gained by
following us, and they swore to stick to us through thick and thin,
and this for evermore they undoubtedly did.
It rained during the greater part of the night, so that the sodden
condition of the ground put all idea of early marching out of our
head.
In order to lessen our work, and to make the marching easier for
the mules, we decided to load only ten of them, and let two always
go spare. We made a pile of the things we should not require, such
as the muleteers' big cooking pot, and their tent, etc., and left them
at the camp. By thus lightening our loads we reckoned we should be
able to march sixteen or eighteen miles a day, an astounding fact for
the muleteers, who had imagined we could not move without their
aid. We drew comparisons between the welfare of the men with us
and that of the deserters. The latter were possessors of all the flour
and most of the cooked meat and the tobacco, but no cooking pots,
while the former had three days' rice and plenty of tea, cooking
utensils, and shelter at night, an advantage they were already fully
aware of.
The one behind the table also Known as Mina Fu ie also
Known as Kao ie
CHAPTER XIII.
RETURN OF THE DESERTERS—SHUKR ALI—LONG MARCHES—DEATH OF EIGHT
MULES AND A PONY—A CHEERING REPAST.
ebookbell.com