0% found this document useful (0 votes)

7 views

Module 5 Complete [TB]

This chapter focuses on creating interactive visualizations using the Bokeh library, which is designed for modern web browsers and supports various programming languages. It covers Bokeh's key features, including simple and animated visualizations, interactivity, and the use of widgets, as well as the differences between its plotting and models interfaces. The chapter also explains how to output Bokeh charts and integrate them into applications, emphasizing the importance of interactivity in data visualization.

Uploaded by

Adithi S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Module 5 Complete [TB]

Uploaded by

Adithi S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

6

Making Things
Interactive with Bokeh

Overview
In this chapter, we will design interactive plots using the Bokeh library. By
the end of this chapter, you will be able to use Bokeh to create insightful
web-based visualizations and explain the difference between two interfaces
for plotting. You will identify when to use the Bokeh server and create
interactive visualizations.
306 | Making Things Interactive with Bokeh

Introduction
Bokeh is an interactive visualization library focused on modern browsers and the
web. Other than Matplotlib or geoplotlib, the plots and visualizations we are going to
create in this chapter will be based on JavaScript widgets. Bokeh allows us to create
visually appealing plots and graphs nearly out of the box without much styling. In
addition to that, it helps us construct performant interactive dashboards based on
large static datasets or even streaming data.

Bokeh has been around since 2013, with version 1.4.0 being released in November
2019. It targets modern web browsers to present interactive visualizations to users
rather than static images. The following are some of the features of Bokeh:

• Simple visualizations: Through its different interfaces, it targets users of many

skill levels, providing an API for quick and straightforward visualizations as well
as more complex and extremely customizable ones.

• Excellent animated visualizations: It provides high performance and can,

therefore, work on large or even streaming datasets, which makes it the go-to
choice for animated visualizations and data analysis.

• Inter-visualization interactivity: This is a web-based approach; it's easy to

combine several plots and create unique and impactful dashboards
with visualizations that can be interconnected to create
inter-visualization interactivity.

• Supports multiple languages: Other than Matplotlib and geoplotlib, Bokeh has
libraries for both Python and JavaScript, in addition to several other
popular languages.

• Multiple ways to perform a task: Adding interactivity to Bokeh visualizations

can be done in several ways. The simplest built-in way is the ability to zoom and
pan in and out of your visualization. This gives the users better control of what
they want to see. It also allows users to filter and transform the data.

• Beautiful chart styling: The tech stack is based on Tornado in the backend
and is powered by D3 in the frontend. D3 is a JavaScript library for creating
outstanding visualizations. Using the underlying D3 visuals allows us to create
beautiful plots without much custom styling.

Since we are using Jupyter Notebook throughout this book, it's worth mentioning that
Bokeh, including its interactivity, is natively supported in Notebook.
Introduction | 307

Concepts of Bokeh
The basic concept of Bokeh is, in some ways, comparable to that of Matplotlib. In
Bokeh, we have a figure as our root element, which has sub-elements such as a title,
an axis, and glyphs. Glyphs have to be added to a figure, which can take on different
shapes, such as circles, bars, and triangles. The following hierarchy shows the
different concepts of Bokeh:

Figure 6.1: Concepts of Bokeh

308 | Making Things Interactive with Bokeh

Interfaces in Bokeh
The interface-based approach provides different levels of complexity for users that
either want to create some basic plots with very few customizable parameters or
want full control over their visualizations to customize every single element of their
plots. This layered approach is divided into two levels:

• Plotting: This layer is customizable.

• Models interface: This layer is complex and provides an open approach to

designing charts.

Note
The models interface is the basic building block for all plots.

The following are the two levels of the layered approach to interfaces:

• bokeh.plotting

This mid-level interface has a somewhat comparable API to Matplotlib. The

workflow is to create a figure and then enrich this figure with different glyphs
that render data points in the figure. As in Matplotlib, the composition of
sub-elements such as axes, grids, and the inspector (which provide basic ways
of exploring your data through zooming, panning, and hovering) is done without
additional configuration.

The vital thing to note here is that even though its setup is done automatically,
we can configure the sub-elements. When using this interface, the creation of
the scene graph used by BokehJS is handled automatically too.

• bokeh.models

This low-level interface is composed of two libraries: the JavaScript library called
BokehJS, which gets used for displaying the charts in the browser, and the core
plot creation Python code, which provides the developer interface. Internally, the
definition created in Python creates JSON objects that hold the declaration for
the JavaScript representation in the browser.
Introduction | 309

The models interface provides complete control over how Bokeh plots and
widgets (elements that enable users to interact with the data displayed) are
assembled and configured. This means that it is up to the developer to ensure
the correctness of the scene graph (a collection of objects describing
the visualization).

Output
Outputting Bokeh charts is straightforward. There are three ways this can be done:

• The .show() method: The primary option is to display the plot in an HTML page
using this method.

• The inline .show() method: When using inline plotting with a Jupyter
Notebook, the .show() method will allow you to display the chart inside
your Notebook.

• The .output_file() method: You're also able to directly save the

visualization to a file without any overhead using the .output_file()
method. This will create a new file at the given path with a given name.

The most powerful way of providing your visualization is through the use of the
Bokeh server.

Bokeh Server
Bokeh creates scene graph JSON objects that will be interpreted by the BokehJS
library to create the visualization output. This process gives you a unified format for
other languages to create the same Bokeh plots and visualizations, independently of
the language used.

To create more complex visualizations and leverage the tooling provided by Python,
we need a way to keep our visualizations in sync with one another. This way, we can
not only filter data but also do calculations and operations on the server-side, which
updates the visualizations in real-time.

In addition to that, since we will have an entry point for data, we can create
visualizations that get fed by streams instead of static datasets. This design provides a
way to develop more complex systems with even greater capabilities.
310 | Making Things Interactive with Bokeh

Looking at the scheme of this architecture, we can see that the documents are
provided on the server-side, then moved over to the browser, which then inserts
it into the BokehJS library. This insertion will trigger the interpretation by BokehJS,
which will then create the visualization. The following diagram describes how the
Bokeh server works:

Figure 6.2: The Bokeh server

Presentation
In Bokeh, presentations help make the visualization more interactive by using
different features, such as interactions, styling, tools, and layouts.

Interactions

Probably the most exciting feature of Bokeh is its interactions. There are two types of
interactions: passive and active.
Introduction | 311

Passive interactions are actions that the users can take that doesn't change the
dataset. In Bokeh, this is called the inspector. As we mentioned before, the inspector
contains attributes such as zooming, panning, and hovering over data. This tooling
allows the user to inspect the data in more detail and might provide better insights
by allowing the user to observe a zoomed-in subset of the visualized data points. The
elements highlighted with a box in the following figure show the essential passive
interaction elements provided by Bokeh. They include zooming, panning, and
clipping data.

Figure 6.3: Example of passive interaction zooming

Active interactions are actions that directly change the displayed data. This includes
actions such as selecting subsets of data or filtering the dataset based on parameters.
Widgets are the most prominent of active interactions since they allow users to
manipulate the displayed data with handlers. Examples of available widgets are
buttons, sliders, and checkboxes.
312 | Making Things Interactive with Bokeh

Referring back to the subsection about the output styles, these widgets can be
used in both the so-called standalone applications in the browser and the Bokeh
server. This will help us consolidate the recently learned theoretical concepts and
make things more transparent. Some of the interactions in Bokeh are tab panes,
dropdowns, multi-selects, radio groups, text inputs, check button groups, data tables,
and sliders. The elements highlighted with a red box in the following figure show a
custom active interaction widget for the same plot we looked at in the example of
passive interaction.

Figure 6.4: Example of custom active interaction widgets

Introduction | 313

Integrating
Embedding Bokeh visualizations can take two forms:

• HTML document: These are the standalone HTML documents. These

documents are self-contained, which means that all the necessary dependencies
of Bokeh are part of the generated HTML document. This format is simple to
generate and can be sent to clients or quickly displayed on a web page.

• Bokeh applications: Backed by a Bokeh server, these provide the possibility to

connect to, for example, Python tooling for more advanced visualizations.

Bokeh is a little bit more complicated than Matplotlib with Seaborn and has its
drawbacks like every other library. Once you have the basic workflow down, however,
you're able to quickly extend basic visualizations with interactivity features to give
power to the user.

Note
One interesting feature is the to_bokeh method, which allows you to
plot Matplotlib figures with Bokeh without configuration overhead. Further
information about this method is available at https://bokeh.pydata.org/
en/0.12.3/docs/user_guide/compat.html.

In the following exercises and activities, we'll consolidate the theoretical knowledge
and build several simple visualizations to explain Bokeh and its two interfaces.
After we've covered the basic usage, we will compare the plotting and models
interfaces and work with widgets that add interactivity to the visualizations.
314 | Making Things Interactive with Bokeh

Basic Plotting
As mentioned before, the plotting interface of Bokeh gives us a higher-level
abstraction, which allows us to quickly visualize data points on a grid.

To create a new plot, we have to define our imports to load the

necessary dependencies:

# importing the necessary dependencies

import pandas as pd
from bokeh.plotting import figure, show
from bokeh.io import output_notebook

output_notebook()

Before we can create a plot, we need to import the dataset. In the examples in this
chapter, we will work with a computer hardware dataset. It can be imported by using
pandas' read_csv method.

# loading the Dataset with pandas

dataset = pd.read_csv('../../Datasets/computer_hardware.csv')

The basic flow when using the plotting interface is comparable to that of
Matplotlib. We first create a figure. This figure is then used as a container to define
elements and call methods on:

# adding an index column to use it for the x-axis

dataset['index'] = dataset.index

# plotting the cache memory levels as line

plot = figure(title='Cache per Hardware', \
x_axis_label='Hardware index', \
y_axis_label='Cache Memory')
plot.line(dataset['index'], dataset['cach'], line_width=5)

show(plot)

Once we have created a new figure instance using the imported figure() method,
we can use it to draw lines, circles, or any glyph objects that Bokeh offers. Note that
the first two arguments of the plot.line method is datasets that contain an equal
number of elements to plot the element.
Basic Plotting | 315

To display the plot, we then call the show() method we imported from the bokeh.
plotting interface earlier on. The following figure shows the output of the
preceding code:

Figure 6.5: Line plot showing the cache memory of different hardware
316 | Making Things Interactive with Bokeh

Since the interface of different plotting types is unified, scatter plots can be created in
the same way as line plots:

# plotting the hardware cache as dots

plot = figure(title='Cache per Hardware', \
x_axis_label='Hardware', \
y_axis_label='Cache Memory')
plot.scatter(dataset['index'], dataset['cach'], size=5, color='red')
show(plot)

The following figure shows the output of the preceding code:

Figure 6.6: Scatter plot showing the cache memory of different hardware
Basic Plotting | 317

In many cases, a visualization will have several attributes of a dataset plotted. A

legend will help users understand which attributes they are looking at. Legends
display a mapping between, for example, lines in the plot and according to
information such as the hardware cache memory.

By adding a legend_label argument to the plot calls like plot.line(), we get a

small box containing the information in the top-right corner (by default):

# plotting cache memory and cycle time with legend

plot = figure(title='Attributes per Hardware', \
              x_axis_label='Hardware index', \
              y_axis_label='Attribute Value')
plot.line(dataset['index'], dataset['cach'], \
          line_width=5, legend_label='Cache Memory')
plot.line(dataset['index'], dataset['myct'], line_width=5, \
          color='red', legend_label='Cycle time in ns')

show(plot)
318 | Making Things Interactive with Bokeh

The following figure shows the output of the preceding code:

Figure 6.7: Line plots displaying the cache memory and cycle time per
hardware with the legend
Basic Plotting | 319

When looking at the preceding example, we can see that once we have several lines,
the visualization can get cluttered.

We can give the user the ability to mute, meaning defocus, the clicked element in
the legend.

Adding a muted_alpha argument to the line plotting and adding a click_policy

of mute to our legend element are the only two steps needed:

# adding mutability to the legend

plot = figure(title='Attributes per Hardware', \
              x_axis_label='Hardware index', \
              y_axis_label='Attribute Value')
plot.line(dataset['index'], dataset['cach'], line_width=5, \
          legend_label='Cache Memory', muted_alpha=0.2)
plot.line(dataset['index'], dataset['myct'], line_width=5, \
          color='red', legend_label='Cycle time in ns', \
          muted_alpha=0.2)

plot.legend.click_policy="mute"

show(plot)
320 | Making Things Interactive with Bokeh

The following figure shows the output of the preceding code:

Figure 6.8: Line plots displaying the cache memory and cycle time per hardware with a
mutable legend; cycle time is also muted
344 | Making Things Interactive with Bokeh

In the next section, we will create interactive visualizations that allow the user to
modify the data that is displayed.

Adding Widgets
One of the most powerful features of Bokeh is the ability to use widgets to
interactively change the data that's displayed in a visualization. To understand the
importance of interactivity in your visualizations, imagine seeing a static visualization
about stock prices that only shows data for the last year.

If you're interested in seeing the current year or even visually comparing it to the
recent and coming years, static plots won't be suitable. You would need to create one
plot for every year or even overlay different years on one visualization, which would
make it much harder to read.

Comparing this to a simple plot that lets the user select the date range they want, we
can already see the advantages. You can guide the user by restricting values and only
displaying what you want them to see. Developing a story behind your visualization is
very important, and doing this is much easier if the user has ways of interacting with
the data.

Bokeh widgets work best when used in combination with the Bokeh server. However,
using the Bokeh server approach is beyond the content of this book, since we would
need to work with simple Python files. Instead, we will use a hybrid approach that
only works with the Jupyter Notebook.

We will look at the different widgets and how to use them before going in and
building a basic plot with one of them. There are a few different options regarding
how to trigger updates, which are also explained in this section. The widgets that will
be covered in the following exercise are explained in the following table:
Adding Widgets | 345

Figure 6.21: Some of the basic widgets with examples

The general way to create a new widget visible in a Jupyter Notebook is to define
a new method and wrap it into an interact widget. We'll be using the "syntactic
sugar" way of adding a decorator to a method—that is, by using annotations. This will
give us an interactive element that will be displayed after the executable cell, like in
the following example:

# importing the widgets

from ipywidgets import interact, interact_manual

# creating an input text

@interact(Value='Input Text')
def text_input(Value):
print(Value)
346 | Making Things Interactive with Bokeh

The following screenshot shows the output of the preceding code:

Figure 6.22: Interactive text input

In the preceding example, we first import the interact element from the
ipywidgets library. This then allows us to define a new method and annotate it
with the @interact decorator.

The Value attribute tells the interact element which widget to use based on the
data type of the argument. In our example, we provide a string, which will give us a
TextBox widget. We can refer to the preceding table to determine which Value
data type will return which widget.

The print statement in the preceding code prints whatever has been entered in the
textbox below the widget.

Note
The methods that we can use interact with always have the same structure.
We will look at several examples in the following exercise.

Exercise 6.03: Building a Simple Plot Using Basic Interactivity Widgets

This first exercise of the Adding Widgets topic will give you a gentle introduction to the
different widgets and the general concept of how to use them. We will quickly go over
the most common widgets, sliders, checkboxes, and dropdowns to understand
their structure.

1. Create an Exercise6.03.ipynb Jupyter Notebook within the

Chapter06/Exercise6.03 folder to implement this exercise.
Chapter 12

Networked programs

While many of the examples in this book have focused on reading files and looking
for data in those files, there are many different sources of information when one
considers the Internet.
In this chapter we will pretend to be a web browser and retrieve web pages using
the Hypertext Transfer Protocol (HTTP). Then we will read through the web page
data and parse it.

12.1 Hypertext Transfer Protocol - HTTP

The network protocol that powers the web is actually quite simple and there is
built-in support in Python called socket which makes it very easy to make network
connections and retrieve data over those sockets in a Python program.
A socket is much like a ﬁle, except that a single socket provides a two-way connec-
tion between two programs. You can both read from and write to the same socket.
If you write something to a socket, it is sent to the application at the other end
of the socket. If you read from the socket, you are given the data which the other
application has sent.
But if you try to read a socket1 when the program on the other end of the socket
has not sent any data, you just sit and wait. If the programs on both ends of
the socket simply wait for some data without sending anything, they will wait for
a very long time, so an important part of programs that communicate over the
Internet is to have some sort of protocol.
A protocol is a set of precise rules that determine who is to go ﬁrst, what they are
to do, and then what the responses are to that message, and who sends next, and
so on. In a sense the two applications at either end of the socket are doing a dance
and making sure not to step on each other’s toes.
There are many documents that describe these network protocols. The Hypertext
Transfer Protocol is described in the following document:
1 If you want to learn more about sockets, protocols or how web servers are developed, you

can explore the course at https://www.dj4e.com.

145
146 CHAPTER 12. NETWORKED PROGRAMS

https://www.w3.org/Protocols/rfc2616/rfc2616.txt
This is a long and complex 176-page document with a lot of detail. If you ﬁnd
it interesting, feel free to read it all. But if you take a look around page 36 of
RFC2616 you will ﬁnd the syntax for the GET request. To request a document
from a web server, we make a connection, e.g. to the www.pr4e.org server on port
80, and then send a line of the form
GET http://data.pr4e.org/romeo.txt HTTP/1.0
where the second parameter is the web page we are requesting, and then we also
send a blank line. The web server will respond with some header information about
the document and a blank line followed by the document content.

12.2 The world’s simplest web browser

Perhaps the easiest way to show how the HTTP protocol works is to write a very
simple Python program that makes a connection to a web server and follows the
rules of the HTTP protocol to request a document and display what the server
sends back.

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')

mysock.close()

# Code: https://www.py4e.com/code3/socket1.py

First the program makes a connection to port 80 on the server www.pr4e.com.

Since our program is playing the role of the “web browser”, the HTTP protocol
says we must send the GET command followed by a blank line. \r\n signiﬁes
an EOL (end of line), so \r\n\r\n signiﬁes nothing between two EOL sequences.
That is the equivalent of a blank line.
Once we send that blank line, we write a loop that receives data in 512-character
chunks from the socket and prints the data out until there is no more data to read
(i.e., the recv() returns an empty string).
The program produces the following output:

HTTP/1.1 200 OK
12.2. THE WORLD’S SIMPLEST WEB BROWSER 147

Your
Program
www.py4e.com
socket )
* Web Pages
connect + Port 80 .
send ,
- .
recv . .

Figure 12.1: A Socket Connection

Date: Wed, 11 Apr 2018 18:52:55 GMT

Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks

It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

The output starts with headers which the web server sends to describe the docu-
ment. For example, the Content-Type header indicates that the document is a
plain text document (text/plain).
After the server sends us the headers, it adds a blank line to indicate the end of
the headers, and then sends the actual data of the file romeo.txt.
This example shows how to make a low-level network connection with sockets.
Sockets can be used to communicate with a web server or with a mail server or
many other kinds of servers. All that is needed is to find the document which
describes the protocol and write the code to send and receive the data according
to the protocol.
However, since the protocol that we use most commonly is the HTTP web protocol,
Python has a special library specifically designed to support the HTTP protocol
for the retrieval of documents and data over the web.
One of the requirements for using the HTTP protocol is the need to send and
receive data as bytes objects, instead of strings. In the preceding example, the
encode() and decode() methods convert strings into bytes objects and back again.
148 CHAPTER 12. NETWORKED PROGRAMS

The next example uses b'' notation to specify that a variable should be stored as
a bytes object. encode() and b'' are equivalent.

>>> b'Hello world'

b'Hello world'
>>> 'Hello world'.encode()
b'Hello world'

12.3 Retrieving an image over HTTP REFER NOTES

In the above example, we retrieved a plain text file which had newlines in the file
and we simply copied the data to the screen as the program ran. We can use a
similar program to retrieve an image across using HTTP. Instead of copying the
data to the screen as the program runs, we accumulate the data in a string, trim
off the headers, and then save the image data to a file as follows:

import socket
import time

HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b""

while True:
data = mysock.recv(5120)
if len(data) < 1: break
#time.sleep(0.25)
count = count + len(data)
print(len(data), count)
picture = picture + data

mysock.close()

# Look for the end of the header (2 CRLF)

pos = picture.find(b"\r\n\r\n")
print('Header length', pos)
print(picture[:pos].decode())

# Skip past the header and save the picture data

picture = picture[pos+4:]
fhand = open("stuff.jpg", "wb")
fhand.write(picture)
fhand.close()

# Code: https://www.py4e.com/code3/urljpeg.py
12.3. RETRIEVING AN IMAGE OVER HTTP 149

When the program runs, it produces the following output:

$ python urljpeg.py
5120 5120
5120 10240
4240 14480
5120 19600
...
5120 214000
3200 217200
5120 222320
5120 227440
3167 230607
Header length 393
HTTP/1.1 200 OK
Date: Wed, 11 Apr 2018 18:54:09 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: image/jpeg

You can see that for this url, the Content-Type header indicates that body of the
document is an image (image/jpeg). Once the program completes, you can view
the image data by opening the ﬁle stuff.jpg in an image viewer.
As the program runs, you can see that we don’t get 5120 characters each time
we call the recv() method. We get as many characters as have been transferred
across the network to us by the web server at the moment we call recv(). In this
example, we either get as few as 3200 characters each time we request up to 5120
characters of data.
Your results may be diﬀerent depending on your network speed. Also note that on
the last call to recv() we get 3167 bytes, which is the end of the stream, and in
the next call to recv() we get a zero-length string that tells us that the server has
called close() on its end of the socket and there is no more data forthcoming.
We can slow down our successive recv() calls by uncommenting the call to
time.sleep(). This way, we wait a quarter of a second after each call so that
the server can “get ahead” of us and send more data to us before we call recv()
again. With the delay, in place the program executes as follows:

$ python urljpeg.py
5120 5120
5120 10240
5120 15360
...
5120 225280
150 CHAPTER 12. NETWORKED PROGRAMS

5120 230400
207 230607
Header length 393
HTTP/1.1 200 OK
Date: Wed, 11 Apr 2018 21:42:08 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: image/jpeg

Now other than the first and last calls to recv(), we now get 5120 characters each
time we ask for new data.
There is a buffer between the server making send() requests and our application
making recv() requests. When we run the program with the delay in place, at
some point the server might fill up the buffer in the socket and be forced to pause
until our program starts to empty the buffer. The pausing of either the sending
application or the receiving application is called “flow control.”

12.4 Retrieving web pages with urllib

While we can manually send and receive data over HTTP using the socket library,
there is a much simpler way to perform this common task in Python by using the
urllib library.
Using urllib, you can treat a web page much like a ﬁle. You simply indicate
which web page you would like to retrieve and urllib handles all of the HTTP
protocol and header details.
The equivalent code to read the romeo.txt ﬁle from the web using urllib is as
follows:

import urllib.request

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())

# Code: https://www.py4e.com/code3/urllib1.py

Once the web page has been opened with urllib.request.urlopen, we can treat
it like a ﬁle and read through it using a for loop.
When the program runs, we only see the output of the contents of the ﬁle. The
headers are still sent, but the urllib code consumes the headers and only returns
the data to us.
12.5. READING BINARY FILES USING URLLIB 151

But soft what light through yonder window breaks

It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

As an example, we can write a program to retrieve the data for romeo.txt and
compute the frequency of each word in the ﬁle as follows:

import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
words = line.decode().split()
for word in words:
counts[word] = counts.get(word, 0) + 1
print(counts)

# Code: https://www.py4e.com/code3/urlwords.py

Again, once we have opened the web page, we can read it like a local ﬁle.

12.5 Reading binary files using urllib

Sometimes you want to retrieve a non-text (or binary) file such as an image or
video file. The data in these files is generally not useful to print out, but you can
easily make a copy of a URL to a local file on your hard disk using urllib.
The pattern is to open the URL and use read to download the entire contents of
the document into a string variable (img) then write that information to a local
file as follows:

import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read()
fhand = open('cover3.jpg', 'wb')
fhand.write(img)
fhand.close()

# Code: https://www.py4e.com/code3/curl1.py

This program reads all of the data in at once across the network and stores it in the
variable img in the main memory of your computer, then opens the file cover.jpg
and writes the data out to your disk. The wb argument for open() opens a binary
file for writing only. This program will work if the size of the file is less than the
size of the memory of your computer.
However if this is a large audio or video file, this program may crash or at least
run extremely slowly when your computer runs out of memory. In order to avoid
152 CHAPTER 12. NETWORKED PROGRAMS

running out of memory, we retrieve the data in blocks (or buﬀers) and then write
each block to your disk before retrieving the next block. This way the program can
read any size ﬁle without using up all of the memory you have in your computer.

import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg')
fhand = open('cover3.jpg', 'wb')
size = 0
while True:
info = img.read(100000)
if len(info) < 1: break
size = size + len(info)
fhand.write(info)

print(size, 'characters copied.')

fhand.close()

# Code: https://www.py4e.com/code3/curl2.py

In this example, we read only 100,000 characters at a time and then write those
characters to the cover3.jpg ﬁle before retrieving the next 100,000 characters of
data from the web.
This program runs as follows:

python curl2.py
230210 characters copied.

12.6 Parsing HTML and scraping the web

One of the common uses of the urllib capability in Python is to scrape the web.
Web scraping is when we write a program that pretends to be a web browser and
retrieves pages, then examines the data in those pages looking for patterns.
As an example, a search engine such as Google will look at the source of one web
page and extract the links to other pages and retrieve those pages, extracting links,
and so on. Using this technique, Google spiders its way through nearly all of the
pages on the web.
Google also uses the frequency of links from pages it ﬁnds to a particular page as
one measure of how “important” a page is and how high the page should appear
in its search results.

12.7 Parsing HTML using regular expressions

One simple way to parse HTML is to use regular expressions to repeatedly search
for and extract substrings that match a particular pattern.
Here is a simple web page:
12.7. PARSING HTML USING REGULAR EXPRESSIONS 153

<h1>The First Page</h1>

<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>

We can construct a well-formed regular expression to match and extract the link
values from the above text as follows:

href="http[s]?://.+?"

Our regular expression looks for strings that start with “href="http://” or
“href="https://”, followed by one or more characters (.+?), followed by another
double quote. The question mark behind the [s]? indicates to search for the
string “http” followed by zero or one “s”.
The question mark added to the .+? indicates that the match is to be done in
a “non-greedy” fashion instead of a “greedy” fashion. A non-greedy match tries
to ﬁnd the smallest possible matching string and a greedy match tries to ﬁnd the
largest possible matching string.
We add parentheses to our regular expression to indicate which part of our matched
string we would like to extract, and produce the following program:

# Search for link values within URL input

import urllib.request, urllib.parse, urllib.error
import re
import ssl

# Ignore SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')

html = urllib.request.urlopen(url, context=ctx).read()
links = re.findall(b'href="(http[s]?://.*?)"', html)
for link in links:
print(link.decode())

# Code: https://www.py4e.com/code3/urlregex.py

The ssl library allows this program to access web sites that strictly enforce HTTPS.
The read method returns HTML source code as a bytes object instead of returning
an HTTPResponse object. The findall regular expression method will give us a
list of all of the strings that match our regular expression, returning only the link
text between the double quotes.
When we run the program and input a URL, we get the following output:
154 CHAPTER 12. NETWORKED PROGRAMS

Enter - https://docs.python.org
https://docs.python.org/3/index.html
https://www.python.org/
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
https://www.python.org/
https://www.python.org/psf/donations/
http://sphinx.pocoo.org/

Regular expressions work very nicely when your HTML is well formatted and
predictable. But since there are a lot of “broken” HTML pages out there, a solution
only using regular expressions might either miss some valid links or end up with
bad data.
This can be solved by using a robust HTML parsing library.

12.8 Parsing HTML using BeautifulSoup

Even though HTML looks like XML2 and some pages are carefully constructed to
be XML, most HTML is generally broken in ways that cause an XML parser to
reject the entire page of HTML as improperly formed.
There are a number of Python libraries which can help you parse HTML and
extract data from the pages. Each of the libraries has its strengths and weaknesses
and you can pick one based on your needs.
As an example, we will simply parse some HTML input and extract links using
the BeautifulSoup library. BeautifulSoup tolerates highly ﬂawed HTML and still
lets you easily extract the data you need. You can download and install the
BeautifulSoup code from:
https://pypi.python.org/pypi/beautifulsoup4
Information on installing BeautifulSoup with the Python Package Index tool pip
is available at:
https://packaging.python.org/tutorials/installing-packages/
We will use urllib to read the page and then use BeautifulSoup to extract the
href attributes from the anchor (a) tags.

# To run this, download the BeautifulSoup zip file

# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

2 The XML format is described in the next chapter.

12.8. PARSING HTML USING BEAUTIFULSOUP 155

import urllib.request, urllib.parse, urllib.error

from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')

html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags

tags = soup('a')
for tag in tags:
print(tag.get('href', None))

# Code: https://www.py4e.com/code3/urllinks.py

The program prompts for a web address, then opens the web page, reads the data
and passes the data to the BeautifulSoup parser, and then retrieves all of the
anchor tags and prints out the href attribute for each tag.
When the program runs, it produces the following output:

Enter - https://docs.python.org
genindex.html
py-modindex.html
https://www.python.org/
#
whatsnew/3.6.html
whatsnew/index.html
tutorial/index.html
library/index.html
reference/index.html
using/index.html
howto/index.html
installing/index.html
distributing/index.html
extending/index.html
c-api/index.html
faq/index.html
py-modindex.html
genindex.html
glossary.html
search.html
contents.html
bugs.html
about.html
license.html
copyright.html
download.html
156 CHAPTER 12. NETWORKED PROGRAMS

https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
genindex.html
py-modindex.html
https://www.python.org/
#
copyright.html
https://www.python.org/psf/donations/
bugs.html
http://sphinx.pocoo.org/

This list is much longer because some HTML anchor tags are relative paths (e.g.,
tutorial/index.html) or in-page references (e.g., ‘#’) that do not include “http://”
or “https://”, which was a requirement in our regular expression.
You can use also BeautifulSoup to pull out various parts of each tag:

# To run this, download the BeautifulSoup zip file

# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

from urllib.request import urlopen

from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')

html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags

tags = soup('a')
for tag in tags:
# Look at the parts of a tag
print('TAG:', tag)
print('URL:', tag.get('href', None))
print('Contents:', tag.contents[0])
print('Attrs:', tag.attrs)

# Code: https://www.py4e.com/code3/urllink2.py

python urllink2.py
12.9. BONUS SECTION FOR UNIX / LINUX USERS 157

Enter - http://www.dr-chuck.com/page1.htm
TAG: <a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>
URL: http://www.dr-chuck.com/page2.htm
Content: ['\nSecond Page']
Attrs: [('href', 'http://www.dr-chuck.com/page2.htm')]

html.parser is the HTML parser included in the standard Python 3 library. In-
formation on other HTML parsers is available at:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
These examples only begin to show the power of BeautifulSoup when it comes to
parsing HTML.

12.9 Bonus section for Unix / Linux users

If you have a Linux, Unix, or Macintosh computer, you probably have commands
built in to your operating system that retrieves both plain text and binary ﬁles
using the HTTP or File Transfer (FTP) protocols. One of these commands is
curl:

$ curl -O http://www.py4e.com/cover.jpg

The command curl is short for “copy URL” and so the two examples listed earlier
to retrieve binary ﬁles with urllib are cleverly named curl1.py and curl2.py
on www.py4e.com/code3 as they implement similar functionality to the curl com-
mand. There is also a curl3.py sample program that does this task a little more
eﬀectively, in case you actually want to use this pattern in a program you are
writing.
A second command that functions very similarly is wget:

$ wget http://www.py4e.com/cover.jpg

Both of these commands make retrieving webpages and remote ﬁles a simple task.

12.10 Glossary

BeautifulSoup A Python library for parsing HTML documents and extracting

data from HTML documents that compensates for most of the imperfections
in the HTML that browsers generally ignore. You can download the Beauti-
fulSoup code from www.crummy.com.
port A number that generally indicates which application you are contacting when
you make a socket connection to a server. As an example, web traﬃc usually
uses port 80 while email traﬃc uses port 25.

Accounting 1A Textbook E9I1 31-01-2021
100% (10)
Accounting 1A Textbook E9I1 31-01-2021
536 pages
100-Way Dress Tutorial2020
No ratings yet
100-Way Dress Tutorial2020
110 pages
Python Dashboards
100% (1)
Python Dashboards
133 pages
DADM - Tools Help
No ratings yet
DADM - Tools Help
24 pages
Matplotlib Review 2021 Complete
No ratings yet
Matplotlib Review 2021 Complete
352 pages
BOKEH
No ratings yet
BOKEH
42 pages
3 - Introducing The Bokeh Server
No ratings yet
3 - Introducing The Bokeh Server
23 pages
1 - Interactive Data Visualization With Bokeh
No ratings yet
1 - Interactive Data Visualization With Bokeh
31 pages
02_data_viz
No ratings yet
02_data_viz
27 pages
Bokeh File Datacamp
No ratings yet
Bokeh File Datacamp
21 pages
Interactive Visualization Libraries
No ratings yet
Interactive Visualization Libraries
38 pages
Chapter 4 From Static To Interactive Visualization
No ratings yet
Chapter 4 From Static To Interactive Visualization
18 pages
Interactive Data Visualization With Bokeh: A Case Study
No ratings yet
Interactive Data Visualization With Bokeh: A Case Study
18 pages
67dc20efa0fcfDAV-Week-03
No ratings yet
67dc20efa0fcfDAV-Week-03
31 pages
Data Visualization Using Python
No ratings yet
Data Visualization Using Python
15 pages
Advanced Plotting
No ratings yet
Advanced Plotting
49 pages
Bokeh: Glyphs Grid Layout
No ratings yet
Bokeh: Glyphs Grid Layout
51 pages
Bokeh Cheat Sheet
No ratings yet
Bokeh Cheat Sheet
1 page
Visual Basic Programming:How To Develop Information System Using Visual Basic 2010, A Step By Step Guide For Beginners
From Everand
Visual Basic Programming:How To Develop Information System Using Visual Basic 2010, A Step By Step Guide For Beginners
Sherwyn Allibang
3.5/5 (2)
Python Data Visualization: 2019 Tools and Trends
No ratings yet
Python Data Visualization: 2019 Tools and Trends
22 pages
Chapter 1
No ratings yet
Chapter 1
26 pages
Dspl Exp4 Case Study
No ratings yet
Dspl Exp4 Case Study
6 pages
Python Data Science and Machine Learning Module - (DataScience-Bokeh) 18
No ratings yet
Python Data Science and Machine Learning Module - (DataScience-Bokeh) 18
11 pages
Tools Visualizations 1
No ratings yet
Tools Visualizations 1
8 pages
Bokeh Cheat Sheet Python For Data Science: 3 Renderers & Visual Customizations
0% (1)
Bokeh Cheat Sheet Python For Data Science: 3 Renderers & Visual Customizations
1 page
09 Static To Interactive Visualisation
No ratings yet
09 Static To Interactive Visualisation
27 pages
Tools Visualizations
No ratings yet
Tools Visualizations
7 pages
Pharmasug-China-2019-DV19
No ratings yet
Pharmasug-China-2019-DV19
9 pages
Power BI
From Everand
Power BI
Vishal Mehra
No ratings yet
Unit 5 DVTs
No ratings yet
Unit 5 DVTs
31 pages
TOP 7 Python Libraries For DATA Visualization!!
No ratings yet
TOP 7 Python Libraries For DATA Visualization!!
9 pages
Exploring Web Components: Build Reusable UI Web Components with Standard Technologies (English Edition)
From Everand
Exploring Web Components: Build Reusable UI Web Components with Standard Technologies (English Edition)
Andrea Chiarelli
No ratings yet
Plotly Deck
No ratings yet
Plotly Deck
7 pages
Bokeh Cheat Sheet
No ratings yet
Bokeh Cheat Sheet
1 page
Professional SharePoint 2010 Development
From Everand
Professional SharePoint 2010 Development
Thomas Rizzo
No ratings yet
Combinepdf
No ratings yet
Combinepdf
77 pages
Combinepdf
No ratings yet
Combinepdf
101 pages
Advanced Visualizations
No ratings yet
Advanced Visualizations
32 pages
Python Bokeh Cheat Sheet PDF
No ratings yet
Python Bokeh Cheat Sheet PDF
1 page
Python Bokeh Cheat Sheet
No ratings yet
Python Bokeh Cheat Sheet
1 page
Unit 4 Plotting Final
No ratings yet
Unit 4 Plotting Final
51 pages
Plotly Express Cheat Sheet
No ratings yet
Plotly Express Cheat Sheet
1 page
21css303t-Data Science Unit-3 Visualization
100% (1)
21css303t-Data Science Unit-3 Visualization
70 pages
PythonForDataScience Cheatsheet PDF
100% (5)
PythonForDataScience Cheatsheet PDF
21 pages
Need Visa Info
No ratings yet
Need Visa Info
40 pages
Dv With Python Programs-8,9,10
No ratings yet
Dv With Python Programs-8,9,10
8 pages
01 Data Viz
No ratings yet
01 Data Viz
22 pages
CodeNotes for Web-Based UI
From Everand
CodeNotes for Web-Based UI
Gregory Brill
4/5 (1)
Unit 3 - Data Visualization
No ratings yet
Unit 3 - Data Visualization
64 pages
Dav Exp5 098
No ratings yet
Dav Exp5 098
5 pages
Data Visualization
No ratings yet
Data Visualization
55 pages
Unlocking the Potential of Big Data with Power BI for Real-Time Analysis and Visualizations
From Everand
Unlocking the Potential of Big Data with Power BI for Real-Time Analysis and Visualizations
Shana
No ratings yet
Interactive+Dashboards+With+Plotly+&+Dash
No ratings yet
Interactive+Dashboards+With+Plotly+&+Dash
181 pages
Data Visualization With Python
No ratings yet
Data Visualization With Python
36 pages
Lecture3434 - CAP792 - UNIT 5
No ratings yet
Lecture3434 - CAP792 - UNIT 5
25 pages
UX: Essential Tools
From Everand
UX: Essential Tools
Daniel Schwarz
No ratings yet
Subject: Business Visualization: SVKM'S Nmims Mukesh Patel School of Technology Management and Engineering
No ratings yet
Subject: Business Visualization: SVKM'S Nmims Mukesh Patel School of Technology Management and Engineering
13 pages
Mastering Backbone.js
From Everand
Mastering Backbone.js
Echamea Abiee
No ratings yet
Creating add-ons for Blender
From Everand
Creating add-ons for Blender
Michel Anders
5/5 (1)
Python Plotly
No ratings yet
Python Plotly
8 pages
Python Dataviz
No ratings yet
Python Dataviz
16 pages
Data Visualization With Matplotib
No ratings yet
Data Visualization With Matplotib
20 pages
Python Basic Plot
No ratings yet
Python Basic Plot
43 pages
Guru Hargobind Sahib Ji - 6th Sikh Guru - Blog Post
No ratings yet
Guru Hargobind Sahib Ji - 6th Sikh Guru - Blog Post
1 page
_25 đề thi vào 10 môn Anh có đáp án năm 2025 ĐỀ SỐ 18
No ratings yet
_25 đề thi vào 10 môn Anh có đáp án năm 2025 ĐỀ SỐ 18
7 pages
DJC Thelightguide-Final
No ratings yet
DJC Thelightguide-Final
20 pages
Chapter III Legislative Drafting General Overview
No ratings yet
Chapter III Legislative Drafting General Overview
11 pages
CBA Exam Solutions With Comments (1)
No ratings yet
CBA Exam Solutions With Comments (1)
42 pages
CG Characteristics & Comp Performance
No ratings yet
CG Characteristics & Comp Performance
10 pages
Taylor Pipe Supports
No ratings yet
Taylor Pipe Supports
6 pages
Proof Packet
No ratings yet
Proof Packet
10 pages
Julius Caesar Notes
No ratings yet
Julius Caesar Notes
4 pages
How To Change Your Career To A Planning Engineer PDF
No ratings yet
How To Change Your Career To A Planning Engineer PDF
3 pages
Area Perimter
No ratings yet
Area Perimter
14 pages
Irrevocable General Power of Attorney With Sale Power: Page No. 1
100% (1)
Irrevocable General Power of Attorney With Sale Power: Page No. 1
4 pages
The Satanic Rebellion:: Introduction To The Series: This Five-Part Series Serves As An Essential Introduction
No ratings yet
The Satanic Rebellion:: Introduction To The Series: This Five-Part Series Serves As An Essential Introduction
43 pages
Around The World in 80 Days Begins at The Reform Club With Phileas Fogg, Thomas Flanagan, Samuel
No ratings yet
Around The World in 80 Days Begins at The Reform Club With Phileas Fogg, Thomas Flanagan, Samuel
1 page
The Kano Model: How To Delight Your Customers
No ratings yet
The Kano Model: How To Delight Your Customers
16 pages
Laundry Services Proposal
No ratings yet
Laundry Services Proposal
7 pages
Epp 3
No ratings yet
Epp 3
2 pages
Starlight 9 Additional
No ratings yet
Starlight 9 Additional
18 pages
Forged Cheques, Mistake of Fact and Tracing:: B.M.P. Global Distribution Inc. v. Bank of Nova Scotia
No ratings yet
Forged Cheques, Mistake of Fact and Tracing:: B.M.P. Global Distribution Inc. v. Bank of Nova Scotia
13 pages
Company Practice Questions
No ratings yet
Company Practice Questions
7 pages
Spe Llcer Anglais 2022 Amerique Nord Sujet Officiel
No ratings yet
Spe Llcer Anglais 2022 Amerique Nord Sujet Officiel
10 pages
Adhitya MYP3 Unit2 Formative Assessment (Sec D & F)
No ratings yet
Adhitya MYP3 Unit2 Formative Assessment (Sec D & F)
9 pages
Bam 031 Income Taxation Quiz
No ratings yet
Bam 031 Income Taxation Quiz
1 page
Activity #1 (Raiza R. Ananca)
100% (1)
Activity #1 (Raiza R. Ananca)
5 pages
Linguistic Society of America
No ratings yet
Linguistic Society of America
8 pages
Aghora-Gauri-Pashupata-Abhisheka-Pooja-english-16-8-24-1
No ratings yet
Aghora-Gauri-Pashupata-Abhisheka-Pooja-english-16-8-24-1
53 pages
Living With Chickens
No ratings yet
Living With Chickens
98 pages