0% found this document useful (0 votes)
7 views

Module 5 Complete [TB]

This chapter focuses on creating interactive visualizations using the Bokeh library, which is designed for modern web browsers and supports various programming languages. It covers Bokeh's key features, including simple and animated visualizations, interactivity, and the use of widgets, as well as the differences between its plotting and models interfaces. The chapter also explains how to output Bokeh charts and integrate them into applications, emphasizing the importance of interactivity in data visualization.

Uploaded by

Adithi S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module 5 Complete [TB]

This chapter focuses on creating interactive visualizations using the Bokeh library, which is designed for modern web browsers and supports various programming languages. It covers Bokeh's key features, including simple and animated visualizations, interactivity, and the use of widgets, as well as the differences between its plotting and models interfaces. The chapter also explains how to output Bokeh charts and integrate them into applications, emphasizing the importance of interactivity in data visualization.

Uploaded by

Adithi S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

6

Making Things
Interactive with Bokeh

Overview
In this chapter, we will design interactive plots using the Bokeh library. By
the end of this chapter, you will be able to use Bokeh to create insightful
web-based visualizations and explain the difference between two interfaces
for plotting. You will identify when to use the Bokeh server and create
interactive visualizations.
306 | Making Things Interactive with Bokeh

Introduction
Bokeh is an interactive visualization library focused on modern browsers and the
web. Other than Matplotlib or geoplotlib, the plots and visualizations we are going to
create in this chapter will be based on JavaScript widgets. Bokeh allows us to create
visually appealing plots and graphs nearly out of the box without much styling. In
addition to that, it helps us construct performant interactive dashboards based on
large static datasets or even streaming data.

Bokeh has been around since 2013, with version 1.4.0 being released in November
2019. It targets modern web browsers to present interactive visualizations to users
rather than static images. The following are some of the features of Bokeh:

• Simple visualizations: Through its different interfaces, it targets users of many


skill levels, providing an API for quick and straightforward visualizations as well
as more complex and extremely customizable ones.

• Excellent animated visualizations: It provides high performance and can,


therefore, work on large or even streaming datasets, which makes it the go-to
choice for animated visualizations and data analysis.

• Inter-visualization interactivity: This is a web-based approach; it's easy to


combine several plots and create unique and impactful dashboards
with visualizations that can be interconnected to create
inter-visualization interactivity.

• Supports multiple languages: Other than Matplotlib and geoplotlib, Bokeh has
libraries for both Python and JavaScript, in addition to several other
popular languages.

• Multiple ways to perform a task: Adding interactivity to Bokeh visualizations


can be done in several ways. The simplest built-in way is the ability to zoom and
pan in and out of your visualization. This gives the users better control of what
they want to see. It also allows users to filter and transform the data.

• Beautiful chart styling: The tech stack is based on Tornado in the backend
and is powered by D3 in the frontend. D3 is a JavaScript library for creating
outstanding visualizations. Using the underlying D3 visuals allows us to create
beautiful plots without much custom styling.

Since we are using Jupyter Notebook throughout this book, it's worth mentioning that
Bokeh, including its interactivity, is natively supported in Notebook.
Introduction | 307

Concepts of Bokeh
The basic concept of Bokeh is, in some ways, comparable to that of Matplotlib. In
Bokeh, we have a figure as our root element, which has sub-elements such as a title,
an axis, and glyphs. Glyphs have to be added to a figure, which can take on different
shapes, such as circles, bars, and triangles. The following hierarchy shows the
different concepts of Bokeh:

Figure 6.1: Concepts of Bokeh


308 | Making Things Interactive with Bokeh

Interfaces in Bokeh
The interface-based approach provides different levels of complexity for users that
either want to create some basic plots with very few customizable parameters or
want full control over their visualizations to customize every single element of their
plots. This layered approach is divided into two levels:

• Plotting: This layer is customizable.

• Models interface: This layer is complex and provides an open approach to


designing charts.

Note
The models interface is the basic building block for all plots.

The following are the two levels of the layered approach to interfaces:

• bokeh.plotting

This mid-level interface has a somewhat comparable API to Matplotlib. The


workflow is to create a figure and then enrich this figure with different glyphs
that render data points in the figure. As in Matplotlib, the composition of
sub-elements such as axes, grids, and the inspector (which provide basic ways
of exploring your data through zooming, panning, and hovering) is done without
additional configuration.

The vital thing to note here is that even though its setup is done automatically,
we can configure the sub-elements. When using this interface, the creation of
the scene graph used by BokehJS is handled automatically too.

• bokeh.models

This low-level interface is composed of two libraries: the JavaScript library called
BokehJS, which gets used for displaying the charts in the browser, and the core
plot creation Python code, which provides the developer interface. Internally, the
definition created in Python creates JSON objects that hold the declaration for
the JavaScript representation in the browser.
Introduction | 309

The models interface provides complete control over how Bokeh plots and
widgets (elements that enable users to interact with the data displayed) are
assembled and configured. This means that it is up to the developer to ensure
the correctness of the scene graph (a collection of objects describing
the visualization).

Output
Outputting Bokeh charts is straightforward. There are three ways this can be done:

• The .show() method: The primary option is to display the plot in an HTML page
using this method.

• The inline .show() method: When using inline plotting with a Jupyter
Notebook, the .show() method will allow you to display the chart inside
your Notebook.

• The .output_file() method: You're also able to directly save the


visualization to a file without any overhead using the .output_file()
method. This will create a new file at the given path with a given name.

The most powerful way of providing your visualization is through the use of the
Bokeh server.

Bokeh Server
Bokeh creates scene graph JSON objects that will be interpreted by the BokehJS
library to create the visualization output. This process gives you a unified format for
other languages to create the same Bokeh plots and visualizations, independently of
the language used.

To create more complex visualizations and leverage the tooling provided by Python,
we need a way to keep our visualizations in sync with one another. This way, we can
not only filter data but also do calculations and operations on the server-side, which
updates the visualizations in real-time.

In addition to that, since we will have an entry point for data, we can create
visualizations that get fed by streams instead of static datasets. This design provides a
way to develop more complex systems with even greater capabilities.
310 | Making Things Interactive with Bokeh

Looking at the scheme of this architecture, we can see that the documents are
provided on the server-side, then moved over to the browser, which then inserts
it into the BokehJS library. This insertion will trigger the interpretation by BokehJS,
which will then create the visualization. The following diagram describes how the
Bokeh server works:

Figure 6.2: The Bokeh server

Presentation
In Bokeh, presentations help make the visualization more interactive by using
different features, such as interactions, styling, tools, and layouts.

Interactions

Probably the most exciting feature of Bokeh is its interactions. There are two types of
interactions: passive and active.
Introduction | 311

Passive interactions are actions that the users can take that doesn't change the
dataset. In Bokeh, this is called the inspector. As we mentioned before, the inspector
contains attributes such as zooming, panning, and hovering over data. This tooling
allows the user to inspect the data in more detail and might provide better insights
by allowing the user to observe a zoomed-in subset of the visualized data points. The
elements highlighted with a box in the following figure show the essential passive
interaction elements provided by Bokeh. They include zooming, panning, and
clipping data.

Figure 6.3: Example of passive interaction zooming

Active interactions are actions that directly change the displayed data. This includes
actions such as selecting subsets of data or filtering the dataset based on parameters.
Widgets are the most prominent of active interactions since they allow users to
manipulate the displayed data with handlers. Examples of available widgets are
buttons, sliders, and checkboxes.
312 | Making Things Interactive with Bokeh

Referring back to the subsection about the output styles, these widgets can be
used in both the so-called standalone applications in the browser and the Bokeh
server. This will help us consolidate the recently learned theoretical concepts and
make things more transparent. Some of the interactions in Bokeh are tab panes,
dropdowns, multi-selects, radio groups, text inputs, check button groups, data tables,
and sliders. The elements highlighted with a red box in the following figure show a
custom active interaction widget for the same plot we looked at in the example of
passive interaction.

Figure 6.4: Example of custom active interaction widgets


Introduction | 313

Integrating
Embedding Bokeh visualizations can take two forms:

• HTML document: These are the standalone HTML documents. These


documents are self-contained, which means that all the necessary dependencies
of Bokeh are part of the generated HTML document. This format is simple to
generate and can be sent to clients or quickly displayed on a web page.

• Bokeh applications: Backed by a Bokeh server, these provide the possibility to


connect to, for example, Python tooling for more advanced visualizations.

Bokeh is a little bit more complicated than Matplotlib with Seaborn and has its
drawbacks like every other library. Once you have the basic workflow down, however,
you're able to quickly extend basic visualizations with interactivity features to give
power to the user.

Note
One interesting feature is the to_bokeh method, which allows you to
plot Matplotlib figures with Bokeh without configuration overhead. Further
information about this method is available at https://bokeh.pydata.org/
en/0.12.3/docs/user_guide/compat.html.

In the following exercises and activities, we'll consolidate the theoretical knowledge
and build several simple visualizations to explain Bokeh and its two interfaces.
After we've covered the basic usage, we will compare the plotting and models
interfaces and work with widgets that add interactivity to the visualizations.
314 | Making Things Interactive with Bokeh

Basic Plotting
As mentioned before, the plotting interface of Bokeh gives us a higher-level
abstraction, which allows us to quickly visualize data points on a grid.

To create a new plot, we have to define our imports to load the


necessary dependencies:

# importing the necessary dependencies


import pandas as pd
from bokeh.plotting import figure, show
from bokeh.io import output_notebook

output_notebook()

Before we can create a plot, we need to import the dataset. In the examples in this
chapter, we will work with a computer hardware dataset. It can be imported by using
pandas' read_csv method.

# loading the Dataset with pandas


dataset = pd.read_csv('../../Datasets/computer_hardware.csv')

The basic flow when using the plotting interface is comparable to that of
Matplotlib. We first create a figure. This figure is then used as a container to define
elements and call methods on:

# adding an index column to use it for the x-axis


dataset['index'] = dataset.index

# plotting the cache memory levels as line


plot = figure(title='Cache per Hardware', \
              x_axis_label='Hardware index', \
              y_axis_label='Cache Memory')
plot.line(dataset['index'], dataset['cach'], line_width=5)

show(plot)

Once we have created a new figure instance using the imported figure() method,
we can use it to draw lines, circles, or any glyph objects that Bokeh offers. Note that
the first two arguments of the plot.line method is datasets that contain an equal
number of elements to plot the element.
Basic Plotting | 315

To display the plot, we then call the show() method we imported from the bokeh.
plotting interface earlier on. The following figure shows the output of the
preceding code:

Figure 6.5: Line plot showing the cache memory of different hardware
316 | Making Things Interactive with Bokeh

Since the interface of different plotting types is unified, scatter plots can be created in
the same way as line plots:

# plotting the hardware cache as dots


plot = figure(title='Cache per Hardware', \
              x_axis_label='Hardware', \
              y_axis_label='Cache Memory')
plot.scatter(dataset['index'], dataset['cach'], size=5, color='red')
show(plot)

The following figure shows the output of the preceding code:

Figure 6.6: Scatter plot showing the cache memory of different hardware
Basic Plotting | 317

In many cases, a visualization will have several attributes of a dataset plotted. A


legend will help users understand which attributes they are looking at. Legends
display a mapping between, for example, lines in the plot and according to
information such as the hardware cache memory.

By adding a legend_label argument to the plot calls like plot.line(), we get a


small box containing the information in the top-right corner (by default):

# plotting cache memory and cycle time with legend


plot = figure(title='Attributes per Hardware', \
              x_axis_label='Hardware index', \
              y_axis_label='Attribute Value')
plot.line(dataset['index'], dataset['cach'], \
          line_width=5, legend_label='Cache Memory')
plot.line(dataset['index'], dataset['myct'], line_width=5, \
          color='red', legend_label='Cycle time in ns')

show(plot)
318 | Making Things Interactive with Bokeh

The following figure shows the output of the preceding code:

Figure 6.7: Line plots displaying the cache memory and cycle time per
hardware with the legend
Basic Plotting | 319

When looking at the preceding example, we can see that once we have several lines,
the visualization can get cluttered.

We can give the user the ability to mute, meaning defocus, the clicked element in
the legend.

Adding a muted_alpha argument to the line plotting and adding a click_policy


of mute to our legend element are the only two steps needed:

# adding mutability to the legend


plot = figure(title='Attributes per Hardware', \
              x_axis_label='Hardware index', \
              y_axis_label='Attribute Value')
plot.line(dataset['index'], dataset['cach'], line_width=5, \
          legend_label='Cache Memory', muted_alpha=0.2)
plot.line(dataset['index'], dataset['myct'], line_width=5, \
          color='red', legend_label='Cycle time in ns', \
          muted_alpha=0.2)

plot.legend.click_policy="mute"

show(plot)
320 | Making Things Interactive with Bokeh

The following figure shows the output of the preceding code:

Figure 6.8: Line plots displaying the cache memory and cycle time per hardware with a
mutable legend; cycle time is also muted
344 | Making Things Interactive with Bokeh

In the next section, we will create interactive visualizations that allow the user to
modify the data that is displayed.

Adding Widgets
One of the most powerful features of Bokeh is the ability to use widgets to
interactively change the data that's displayed in a visualization. To understand the
importance of interactivity in your visualizations, imagine seeing a static visualization
about stock prices that only shows data for the last year.

If you're interested in seeing the current year or even visually comparing it to the
recent and coming years, static plots won't be suitable. You would need to create one
plot for every year or even overlay different years on one visualization, which would
make it much harder to read.

Comparing this to a simple plot that lets the user select the date range they want, we
can already see the advantages. You can guide the user by restricting values and only
displaying what you want them to see. Developing a story behind your visualization is
very important, and doing this is much easier if the user has ways of interacting with
the data.

Bokeh widgets work best when used in combination with the Bokeh server. However,
using the Bokeh server approach is beyond the content of this book, since we would
need to work with simple Python files. Instead, we will use a hybrid approach that
only works with the Jupyter Notebook.

We will look at the different widgets and how to use them before going in and
building a basic plot with one of them. There are a few different options regarding
how to trigger updates, which are also explained in this section. The widgets that will
be covered in the following exercise are explained in the following table:
Adding Widgets | 345

Figure 6.21: Some of the basic widgets with examples

The general way to create a new widget visible in a Jupyter Notebook is to define
a new method and wrap it into an interact widget. We'll be using the "syntactic
sugar" way of adding a decorator to a method—that is, by using annotations. This will
give us an interactive element that will be displayed after the executable cell, like in
the following example:

# importing the widgets


from ipywidgets import interact, interact_manual

# creating an input text


@interact(Value='Input Text')
def text_input(Value):
    print(Value)
346 | Making Things Interactive with Bokeh

The following screenshot shows the output of the preceding code:

Figure 6.22: Interactive text input

In the preceding example, we first import the interact element from the
ipywidgets library. This then allows us to define a new method and annotate it
with the @interact decorator.

The Value attribute tells the interact element which widget to use based on the
data type of the argument. In our example, we provide a string, which will give us a
TextBox widget. We can refer to the preceding table to determine which Value
data type will return which widget.

The print statement in the preceding code prints whatever has been entered in the
textbox below the widget.

Note
The methods that we can use interact with always have the same structure.
We will look at several examples in the following exercise.

Exercise 6.03: Building a Simple Plot Using Basic Interactivity Widgets


This first exercise of the Adding Widgets topic will give you a gentle introduction to the
different widgets and the general concept of how to use them. We will quickly go over
the most common widgets, sliders, checkboxes, and dropdowns to understand
their structure.

1. Create an Exercise6.03.ipynb Jupyter Notebook within the


Chapter06/Exercise6.03 folder to implement this exercise.
Chapter 12

Networked programs

While many of the examples in this book have focused on reading files and looking
for data in those files, there are many different sources of information when one
considers the Internet.
In this chapter we will pretend to be a web browser and retrieve web pages using
the Hypertext Transfer Protocol (HTTP). Then we will read through the web page
data and parse it.

12.1 Hypertext Transfer Protocol - HTTP


The network protocol that powers the web is actually quite simple and there is
built-in support in Python called socket which makes it very easy to make network
connections and retrieve data over those sockets in a Python program.
A socket is much like a file, except that a single socket provides a two-way connec-
tion between two programs. You can both read from and write to the same socket.
If you write something to a socket, it is sent to the application at the other end
of the socket. If you read from the socket, you are given the data which the other
application has sent.
But if you try to read a socket1 when the program on the other end of the socket
has not sent any data, you just sit and wait. If the programs on both ends of
the socket simply wait for some data without sending anything, they will wait for
a very long time, so an important part of programs that communicate over the
Internet is to have some sort of protocol.
A protocol is a set of precise rules that determine who is to go first, what they are
to do, and then what the responses are to that message, and who sends next, and
so on. In a sense the two applications at either end of the socket are doing a dance
and making sure not to step on each other’s toes.
There are many documents that describe these network protocols. The Hypertext
Transfer Protocol is described in the following document:
1 If you want to learn more about sockets, protocols or how web servers are developed, you

can explore the course at https://www.dj4e.com.

145
146 CHAPTER 12. NETWORKED PROGRAMS

https://www.w3.org/Protocols/rfc2616/rfc2616.txt
This is a long and complex 176-page document with a lot of detail. If you find
it interesting, feel free to read it all. But if you take a look around page 36 of
RFC2616 you will find the syntax for the GET request. To request a document
from a web server, we make a connection, e.g. to the www.pr4e.org server on port
80, and then send a line of the form
GET http://data.pr4e.org/romeo.txt HTTP/1.0
where the second parameter is the web page we are requesting, and then we also
send a blank line. The web server will respond with some header information about
the document and a blank line followed by the document content.

12.2 The world’s simplest web browser


Perhaps the easiest way to show how the HTTP protocol works is to write a very
simple Python program that makes a connection to a web server and follows the
rules of the HTTP protocol to request a document and display what the server
sends back.

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)


mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')

mysock.close()

# Code: https://www.py4e.com/code3/socket1.py

First the program makes a connection to port 80 on the server www.pr4e.com.


Since our program is playing the role of the “web browser”, the HTTP protocol
says we must send the GET command followed by a blank line. \r\n signifies
an EOL (end of line), so \r\n\r\n signifies nothing between two EOL sequences.
That is the equivalent of a blank line.
Once we send that blank line, we write a loop that receives data in 512-character
chunks from the socket and prints the data out until there is no more data to read
(i.e., the recv() returns an empty string).
The program produces the following output:

HTTP/1.1 200 OK
12.2. THE WORLD’S SIMPLEST WEB BROWSER 147

Your
Program
www.py4e.com
socket )
* Web Pages
connect + Port 80 .
send ,
- .
recv . .

Figure 12.1: A Socket Connection

Date: Wed, 11 Apr 2018 18:52:55 GMT


Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks


It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

The output starts with headers which the web server sends to describe the docu-
ment. For example, the Content-Type header indicates that the document is a
plain text document (text/plain).
After the server sends us the headers, it adds a blank line to indicate the end of
the headers, and then sends the actual data of the file romeo.txt.
This example shows how to make a low-level network connection with sockets.
Sockets can be used to communicate with a web server or with a mail server or
many other kinds of servers. All that is needed is to find the document which
describes the protocol and write the code to send and receive the data according
to the protocol.
However, since the protocol that we use most commonly is the HTTP web protocol,
Python has a special library specifically designed to support the HTTP protocol
for the retrieval of documents and data over the web.
One of the requirements for using the HTTP protocol is the need to send and
receive data as bytes objects, instead of strings. In the preceding example, the
encode() and decode() methods convert strings into bytes objects and back again.
148 CHAPTER 12. NETWORKED PROGRAMS

The next example uses b'' notation to specify that a variable should be stored as
a bytes object. encode() and b'' are equivalent.

>>> b'Hello world'


b'Hello world'
>>> 'Hello world'.encode()
b'Hello world'

12.3 Retrieving an image over HTTP REFER NOTES

In the above example, we retrieved a plain text file which had newlines in the file
and we simply copied the data to the screen as the program ran. We can use a
similar program to retrieve an image across using HTTP. Instead of copying the
data to the screen as the program runs, we accumulate the data in a string, trim
off the headers, and then save the image data to a file as follows:

import socket
import time

HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b""

while True:
data = mysock.recv(5120)
if len(data) < 1: break
#time.sleep(0.25)
count = count + len(data)
print(len(data), count)
picture = picture + data

mysock.close()

# Look for the end of the header (2 CRLF)


pos = picture.find(b"\r\n\r\n")
print('Header length', pos)
print(picture[:pos].decode())

# Skip past the header and save the picture data


picture = picture[pos+4:]
fhand = open("stuff.jpg", "wb")
fhand.write(picture)
fhand.close()

# Code: https://www.py4e.com/code3/urljpeg.py
12.3. RETRIEVING AN IMAGE OVER HTTP 149

When the program runs, it produces the following output:

$ python urljpeg.py
5120 5120
5120 10240
4240 14480
5120 19600
...
5120 214000
3200 217200
5120 222320
5120 227440
3167 230607
Header length 393
HTTP/1.1 200 OK
Date: Wed, 11 Apr 2018 18:54:09 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: image/jpeg

You can see that for this url, the Content-Type header indicates that body of the
document is an image (image/jpeg). Once the program completes, you can view
the image data by opening the file stuff.jpg in an image viewer.
As the program runs, you can see that we don’t get 5120 characters each time
we call the recv() method. We get as many characters as have been transferred
across the network to us by the web server at the moment we call recv(). In this
example, we either get as few as 3200 characters each time we request up to 5120
characters of data.
Your results may be different depending on your network speed. Also note that on
the last call to recv() we get 3167 bytes, which is the end of the stream, and in
the next call to recv() we get a zero-length string that tells us that the server has
called close() on its end of the socket and there is no more data forthcoming.
We can slow down our successive recv() calls by uncommenting the call to
time.sleep(). This way, we wait a quarter of a second after each call so that
the server can “get ahead” of us and send more data to us before we call recv()
again. With the delay, in place the program executes as follows:

$ python urljpeg.py
5120 5120
5120 10240
5120 15360
...
5120 225280
150 CHAPTER 12. NETWORKED PROGRAMS

5120 230400
207 230607
Header length 393
HTTP/1.1 200 OK
Date: Wed, 11 Apr 2018 21:42:08 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: image/jpeg

Now other than the first and last calls to recv(), we now get 5120 characters each
time we ask for new data.
There is a buffer between the server making send() requests and our application
making recv() requests. When we run the program with the delay in place, at
some point the server might fill up the buffer in the socket and be forced to pause
until our program starts to empty the buffer. The pausing of either the sending
application or the receiving application is called “flow control.”

12.4 Retrieving web pages with urllib


While we can manually send and receive data over HTTP using the socket library,
there is a much simpler way to perform this common task in Python by using the
urllib library.
Using urllib, you can treat a web page much like a file. You simply indicate
which web page you would like to retrieve and urllib handles all of the HTTP
protocol and header details.
The equivalent code to read the romeo.txt file from the web using urllib is as
follows:

import urllib.request

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())

# Code: https://www.py4e.com/code3/urllib1.py

Once the web page has been opened with urllib.request.urlopen, we can treat
it like a file and read through it using a for loop.
When the program runs, we only see the output of the contents of the file. The
headers are still sent, but the urllib code consumes the headers and only returns
the data to us.
12.5. READING BINARY FILES USING URLLIB 151

But soft what light through yonder window breaks


It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

As an example, we can write a program to retrieve the data for romeo.txt and
compute the frequency of each word in the file as follows:

import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
words = line.decode().split()
for word in words:
counts[word] = counts.get(word, 0) + 1
print(counts)

# Code: https://www.py4e.com/code3/urlwords.py

Again, once we have opened the web page, we can read it like a local file.

12.5 Reading binary files using urllib


Sometimes you want to retrieve a non-text (or binary) file such as an image or
video file. The data in these files is generally not useful to print out, but you can
easily make a copy of a URL to a local file on your hard disk using urllib.
The pattern is to open the URL and use read to download the entire contents of
the document into a string variable (img) then write that information to a local
file as follows:

import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read()
fhand = open('cover3.jpg', 'wb')
fhand.write(img)
fhand.close()

# Code: https://www.py4e.com/code3/curl1.py

This program reads all of the data in at once across the network and stores it in the
variable img in the main memory of your computer, then opens the file cover.jpg
and writes the data out to your disk. The wb argument for open() opens a binary
file for writing only. This program will work if the size of the file is less than the
size of the memory of your computer.
However if this is a large audio or video file, this program may crash or at least
run extremely slowly when your computer runs out of memory. In order to avoid
152 CHAPTER 12. NETWORKED PROGRAMS

running out of memory, we retrieve the data in blocks (or buffers) and then write
each block to your disk before retrieving the next block. This way the program can
read any size file without using up all of the memory you have in your computer.

import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg')
fhand = open('cover3.jpg', 'wb')
size = 0
while True:
info = img.read(100000)
if len(info) < 1: break
size = size + len(info)
fhand.write(info)

print(size, 'characters copied.')


fhand.close()

# Code: https://www.py4e.com/code3/curl2.py

In this example, we read only 100,000 characters at a time and then write those
characters to the cover3.jpg file before retrieving the next 100,000 characters of
data from the web.
This program runs as follows:

python curl2.py
230210 characters copied.

12.6 Parsing HTML and scraping the web


One of the common uses of the urllib capability in Python is to scrape the web.
Web scraping is when we write a program that pretends to be a web browser and
retrieves pages, then examines the data in those pages looking for patterns.
As an example, a search engine such as Google will look at the source of one web
page and extract the links to other pages and retrieve those pages, extracting links,
and so on. Using this technique, Google spiders its way through nearly all of the
pages on the web.
Google also uses the frequency of links from pages it finds to a particular page as
one measure of how “important” a page is and how high the page should appear
in its search results.

12.7 Parsing HTML using regular expressions


One simple way to parse HTML is to use regular expressions to repeatedly search
for and extract substrings that match a particular pattern.
Here is a simple web page:
12.7. PARSING HTML USING REGULAR EXPRESSIONS 153

<h1>The First Page</h1>


<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>

We can construct a well-formed regular expression to match and extract the link
values from the above text as follows:

href="http[s]?://.+?"

Our regular expression looks for strings that start with “href="http://” or
“href="https://”, followed by one or more characters (.+?), followed by another
double quote. The question mark behind the [s]? indicates to search for the
string “http” followed by zero or one “s”.
The question mark added to the .+? indicates that the match is to be done in
a “non-greedy” fashion instead of a “greedy” fashion. A non-greedy match tries
to find the smallest possible matching string and a greedy match tries to find the
largest possible matching string.
We add parentheses to our regular expression to indicate which part of our matched
string we would like to extract, and produce the following program:

# Search for link values within URL input


import urllib.request, urllib.parse, urllib.error
import re
import ssl

# Ignore SSL certificate errors


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')


html = urllib.request.urlopen(url, context=ctx).read()
links = re.findall(b'href="(http[s]?://.*?)"', html)
for link in links:
print(link.decode())

# Code: https://www.py4e.com/code3/urlregex.py

The ssl library allows this program to access web sites that strictly enforce HTTPS.
The read method returns HTML source code as a bytes object instead of returning
an HTTPResponse object. The findall regular expression method will give us a
list of all of the strings that match our regular expression, returning only the link
text between the double quotes.
When we run the program and input a URL, we get the following output:
154 CHAPTER 12. NETWORKED PROGRAMS

Enter - https://docs.python.org
https://docs.python.org/3/index.html
https://www.python.org/
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
https://www.python.org/
https://www.python.org/psf/donations/
http://sphinx.pocoo.org/

Regular expressions work very nicely when your HTML is well formatted and
predictable. But since there are a lot of “broken” HTML pages out there, a solution
only using regular expressions might either miss some valid links or end up with
bad data.
This can be solved by using a robust HTML parsing library.

12.8 Parsing HTML using BeautifulSoup


Even though HTML looks like XML2 and some pages are carefully constructed to
be XML, most HTML is generally broken in ways that cause an XML parser to
reject the entire page of HTML as improperly formed.
There are a number of Python libraries which can help you parse HTML and
extract data from the pages. Each of the libraries has its strengths and weaknesses
and you can pick one based on your needs.
As an example, we will simply parse some HTML input and extract links using
the BeautifulSoup library. BeautifulSoup tolerates highly flawed HTML and still
lets you easily extract the data you need. You can download and install the
BeautifulSoup code from:
https://pypi.python.org/pypi/beautifulsoup4
Information on installing BeautifulSoup with the Python Package Index tool pip
is available at:
https://packaging.python.org/tutorials/installing-packages/
We will use urllib to read the page and then use BeautifulSoup to extract the
href attributes from the anchor (a) tags.

# To run this, download the BeautifulSoup zip file


# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

2 The XML format is described in the next chapter.


12.8. PARSING HTML USING BEAUTIFULSOUP 155

import urllib.request, urllib.parse, urllib.error


from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')


html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags


tags = soup('a')
for tag in tags:
print(tag.get('href', None))

# Code: https://www.py4e.com/code3/urllinks.py

The program prompts for a web address, then opens the web page, reads the data
and passes the data to the BeautifulSoup parser, and then retrieves all of the
anchor tags and prints out the href attribute for each tag.
When the program runs, it produces the following output:

Enter - https://docs.python.org
genindex.html
py-modindex.html
https://www.python.org/
#
whatsnew/3.6.html
whatsnew/index.html
tutorial/index.html
library/index.html
reference/index.html
using/index.html
howto/index.html
installing/index.html
distributing/index.html
extending/index.html
c-api/index.html
faq/index.html
py-modindex.html
genindex.html
glossary.html
search.html
contents.html
bugs.html
about.html
license.html
copyright.html
download.html
156 CHAPTER 12. NETWORKED PROGRAMS

https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
genindex.html
py-modindex.html
https://www.python.org/
#
copyright.html
https://www.python.org/psf/donations/
bugs.html
http://sphinx.pocoo.org/

This list is much longer because some HTML anchor tags are relative paths (e.g.,
tutorial/index.html) or in-page references (e.g., ‘#’) that do not include “http://”
or “https://”, which was a requirement in our regular expression.
You can use also BeautifulSoup to pull out various parts of each tag:

# To run this, download the BeautifulSoup zip file


# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

from urllib.request import urlopen


from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')


html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags


tags = soup('a')
for tag in tags:
# Look at the parts of a tag
print('TAG:', tag)
print('URL:', tag.get('href', None))
print('Contents:', tag.contents[0])
print('Attrs:', tag.attrs)

# Code: https://www.py4e.com/code3/urllink2.py

python urllink2.py
12.9. BONUS SECTION FOR UNIX / LINUX USERS 157

Enter - http://www.dr-chuck.com/page1.htm
TAG: <a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>
URL: http://www.dr-chuck.com/page2.htm
Content: ['\nSecond Page']
Attrs: [('href', 'http://www.dr-chuck.com/page2.htm')]

html.parser is the HTML parser included in the standard Python 3 library. In-
formation on other HTML parsers is available at:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
These examples only begin to show the power of BeautifulSoup when it comes to
parsing HTML.

12.9 Bonus section for Unix / Linux users

If you have a Linux, Unix, or Macintosh computer, you probably have commands
built in to your operating system that retrieves both plain text and binary files
using the HTTP or File Transfer (FTP) protocols. One of these commands is
curl:

$ curl -O http://www.py4e.com/cover.jpg

The command curl is short for “copy URL” and so the two examples listed earlier
to retrieve binary files with urllib are cleverly named curl1.py and curl2.py
on www.py4e.com/code3 as they implement similar functionality to the curl com-
mand. There is also a curl3.py sample program that does this task a little more
effectively, in case you actually want to use this pattern in a program you are
writing.
A second command that functions very similarly is wget:

$ wget http://www.py4e.com/cover.jpg

Both of these commands make retrieving webpages and remote files a simple task.

12.10 Glossary

BeautifulSoup A Python library for parsing HTML documents and extracting


data from HTML documents that compensates for most of the imperfections
in the HTML that browsers generally ignore. You can download the Beauti-
fulSoup code from www.crummy.com.
port A number that generally indicates which application you are contacting when
you make a socket connection to a server. As an example, web traffic usually
uses port 80 while email traffic uses port 25.

You might also like