0% found this document useful (0 votes)
49 views

Thread Level Parallelism and Interactive Performan

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Thread Level Parallelism and Interactive Performan

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220938684

Thread Level Parallelism and Interactive Performance of Desktop Applications.

Conference Paper  in  ACM SIGARCH Computer Architecture News · December 2000


DOI: 10.1145/356989.357001 · Source: DBLP

CITATIONS READS
48 576

4 authors, including:

Krisztián Flautner Richard Uhlig


ARM Intel
68 PUBLICATIONS   7,075 CITATIONS    47 PUBLICATIONS   2,162 CITATIONS   

SEE PROFILE SEE PROFILE

Trevor N. Mudge
University of Michigan
413 PUBLICATIONS   21,402 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

University of Michigan, Ann Arbor View project

All content following this page was uploaded by Trevor N. Mudge on 18 March 2014.

The user has requested enhancement of the downloaded file.


Thread-level Parallelism and Interactive
Performance of Desktop Applications

Krisztián Flautner Rich Uhlig Steve Reinhardt Trevor Mudge


[email protected] [email protected] [email protected] [email protected]

University of Michigan Intel Microprocessor Research Lab


1301 Beal Ave. 5350 NE Elam Young Parkway
Ann Arbor, MI 48109-2122 Hillsboro, OR 97123
+1-734-764-0203 +1-503-696-3154

Abstract applications. Intuitively, the premise makes sense: sudden


bursts of background activity can be handled concurrently
Multiprocessing is already prevalent in servers where with the foreground task and individual processes can be
multiple clients present an obvious source of thread-level sped up if they are composed of multiple threads. In this
parallelism. However, the case for multiprocessing is less paper we investigate whether multiprocessing can indeed
clear for desktop applications. Nevertheless, architects are affect the user-perceived response time—the time it takes
designing processors that count on the availability of for the computer to respond to user initiated events—of
thread-level parallelism. Unlike server workloads, the pri- interactive desktop applications. The primary questions that
mary requirement of interactive applications is to respond we deal with are the following:
to user events under human perception bounds rather than
to maximize end-to-end throughput. In this paper we report • How much do threads run concurrently in interactive
on the thread-level parallelism and interactive response desktop applications?
time of a variety of desktop applications. By tracking the • Does concurrency translate into improved interactive
communication between tasks, we can focus our measure- performance (response time)?
ments on the portions of the benchmark’s execution that
have the greatest impact on the user. We find that running These questions are particularly important for proces-
our benchmarks on a dual-processor machine improves sor designers who are considering techniques that exploit
response time of mouse-click events by as much as 36%, and thread-level parallelism, such as simultaneous multithread-
22% on average—out of a maximum possible 50%. The ing (SMT) [13] and single chip multiprocessing (CMP) [7].
benefits of multiprocessing are even more apparent when To date, most research in this area has used either synthetic
background tasks are considered. In our experiments, run- workloads (e.g., concurrently running multiple SPEC
ning a simple MP3 playback program in the background benchmarks) or server workloads [1][10][12]. Our results
increases response time by 14% on a uniprocessor while it show that the performance characteristics of these bench-
only increases the response time on a dual processor by 4%. marks are very different from those of desktop workloads.
When response times are fast enough for further improve- Although our measurements were made on a multiproces-
ments to be imperceptible, the increased idle time after sor, we look at thread-level parallelism at the system (appli-
interactive episodes could be exploited to build systems that cation and OS) level, not at the microarchitecture level.
are more power efficient. Thus our results are not particular to any specific architec-
ture for exploiting thread-level parallelism.
Over the past years, multiprocessors have moved from
1. Introduction the server segment to workstation users and are now enter-
Does multiprocessing make sense on the desktop? ing the desktop arena as well. Recently, Apple Computer
There is anecdotal evidence regarding the positive effect of made dual PowerPC based machines standard across most
multiprocessing on the “responsiveness” of interactive of its desktop PowerMac line [16]. Moreover, processors
capable of executing multiple instruction streams concur-
rently will be readily available in the near future. SMT- and
CMP-based products have already been announced
[3][4][15] and the cost of existing multichip multiprocessors
have been decreasing steadily [16].
Figure 1 illustrates the machine utilization and idle time
characteristics of 52 desktop workloads across three operat-
ing systems (Windows NT, BeOS, and Linux) running on a
quad-processor machine (using data from [6]). Machine uti-
lization is a measure of how effectively a machine’s com-
puting resources are exploited and is 100% if all processors

Thread level parallelism and Interactive performance of desktop applications - ASPLOS 2000 August 21, 2000 1 of 10
FIGURE 1. Machine utilization and idle time of 52 workloads
use more realistic desktop workloads that can include a lot
of idle time. Intuitively, TLP is a metric of speedup due to
100% concurrent execution on the non-idle portions of the work-
load (Section 3).
Automated "Realistic"
benchmark runs benchmark runs The most relevant metric for interactive applications is
75% not the overall throughput but response time: the amount of
time it takes for the computer to respond to a user initiated
event. These periods are also referred to as interactive epi-
sodes. We focus our measurements on the interactive epi-
50% Idle time sodes by tracking communication between tasks in the
kernel (Section 3.1). Figure 2 illustrates a sample TLP trace
where the ranges corresponding to interactive episodes have
25% been highlighted.
Machine Utilization Our response time measurements are detailed in Sec-
tion 4. We find that, while desktop applications can incur
0% more than 90% idle time during execution, running our
Benchmarks benchmarks on a dual-processor machine provides a 22%
average improvement in application response times. In Sec-
The figure shows machine utilization and idle time of 52 workloads on a tion 4.4, we investigate the effects on response time of a
quad-processor machine. The automated benchmarks were driven by a GUI concurrently running MP3 playback application. Here, the
automation tool (e.g., Visual Test), while the realistic workloads were run dual-processor machine improves response time by 29% on
by a human. average. Thus, multiprocessing on the desktop can be a via-
ble means of improving the user experience.
in the system are utilized. It would be difficult to make a
case for multiprocessing for desktop workloads based on the 2. Previous work
presented machine utilization data. Very few of the work-
loads exceed 25% machine utilization, suggesting that for Various papers have dealt with the characterization of
the most part, only one out of the four available processors desktop applications [9][2][5]. In [5], Endo et al. performed
is exercised. Some of the benchmarks exhibit even lower a detailed analysis of interactive performance in a unipro-
utilization in the 5% to 10% range, suggesting that a single cessor Windows NT environment. Our methodology has
processor is more than powerful enough for running these been influenced by their design choices and their definitions
workloads. What is the need for multiprocessing when a of think and wait time.
single processor seems to be adequate? Hauser et al. [8] have approached the role of threads in
The problem with the machine utilization metric is that interactive systems by analyzing the design patterns in two
it weighs all parts of the benchmark’s execution equally. threaded object-oriented environments. Their analysis
From the metric’s point of view, generating a page of output focused on the use of threads for program structuring
on the screen is as important as the idle period when the instead of run-time statistics. However, their conclusion that
user is consuming the data (think time) and the processor is most threads are used for programmers’ convenience and
doing no useful work. We define idle time as the percentage few for exploiting concurrency is echoed by our results.
of time all processors in the machine are idle simulta- In their study of the characteristics of desktop applica-
neously. Machine utilization can only be used to accurately tions [9] Lee et al. observed that most of the instructions are
measure the effects of concurrent execution if idle time dur- executed from a single dominant thread. In our experience
ing the benchmark run is close to zero. TLP can vary greatly based on the choice of OS and work-
As the figure shows, the amount of idle time in desktop loads. In this study we show that multiprocessing can have
applications can be very large. In some cases idle time beneficial effects even on standard desktop workloads.
amounts to more than 90% of total execution time. This Our previous investigations into the concurrency char-
high ratio should not be unexpected since interactive appli- acteristics of desktop applications have provided a high-
cations run at the rate at which the user interacts with them, level view of a broad set of workloads on three operating
which is determined by human cognition and motor skills systems: Windows NT, Linux, and BeOS [6]. These results
and includes significant think time. The high proportion of showed that while most workloads under BeOS and Win-
idle time can be obscured by the use of automated bench- dows NT use a relatively large number of threads, the actual
marks, such as Sysmark 98 [17] or Winstone 99 [18], that concurrency derived from them is limited and heavily
perform each operation as soon as the previous operation dependent on the workload and the operating system. In this
completes without taking think time into account. The auto- study we expand on these results by focusing in more detail
mated benchmarks in Figure 1 have an average of 12% idle on interactive applications under Linux. On a previous ver-
time versus 64% for the realistic benchmark runs. sion of Linux that we studied, applications exhibited very
To get around the limitations of the machine utilization little concurrency (TLP of 1.0-1.13). This was partially due
metric, we define a new metric called thread-level parallel- to the fact that many of the applications did not use kernel
ism (TLP). TLP is the machine utilization over the non-idle threads to work around reentrancy bugs in the C library, and
portions of the benchmark’s execution. This definition side- to the heavy use of the global kernel lock in the 2.2.13 ker-
steps the problems of the original metric and allows us to nel. However, less than a year later, due to a more recent

Thread level parallelism and Interactive performance of desktop applications - ASPLOS 2000 August 21, 2000 2 of 10
FIGURE 2. Segment of the TLP trace of the Ghostview benchmark

... ........................................
. . . ................... .......................... ................ ................... ............................. ...............
...................... ...................
.... . . ..... .. . ............ .... . ... ............... ........... .
2. .
..
. . .
.. ..... . ..
.. . .
.. . .
. .. . .. . . . ..
. . . . ..... . ..........
..
.. .
. . . . ... . ....... ..
...... ...... ...... .
. ... ..
.
. . .... . ... . .. . ... ....
. ..... .....
. .. .. .
.
.. .
. ...
.... ....
.. .... .
. ....... .
. . . .. . .
. .. ... .
.... .
. .. ..
.. ..
.. .
... . .. .
.
. .
.... .
.
.
.
.
..
....
.
. ..
.. .. . .
.. .... ..
... . ..
..
.
..
. ..
.
. .
.. ...
... .. ... .
..... ...
. ... .
. . . . . .
. ... . . .. .. . . .
. ..
..
. .. .. .
.. . .... . .
..
. ...
. .
.. ....
.
... ...
.. ..... ... . .. ....
..... ...
. .. .
. .
. .... ...... .... . .
...... . .
. . . . . . . . . . . . ...... .... . . . . .
1 ... ..... .... ..
......................................................................................................................................................... ...... .... .....
.. . ....... ... .
..
. ...
.. .. . .
.... ....................
.. ..... ......... ................... . ..................... .. ... ................
. . . . . . . . .

0.................................................................................................... ..................................... ....................................................................................................................................

5.7 5.8 5.9 6.0 6.1 6.2 6.3


This figure contains a portion of the TLP trace of the Ghostview benchmark. The light areas correspond to the interactive episodes. In this
trace, there are two interactive episodes: a short one from 5.772s to 5.775s and a long one from 5.802s to 6.205s. The y axis corresponds to
the measured TLP, while the x axis specifies the elapsed time in seconds.

kernel (2.3.99-pre3) and an updated C library, the average Equation 3 relates machine utilization to TLP, which in
TLP of our Linux benchmarks has increased to 1.27. essence is the machine utilization over the non-idle portions
of the program execution. Note that the result is scaled by n
since machine utilization ranges from 0 to 1 while TLP
3. Metrics and methodology ranges from 1 to n.
The principle metrics that we are interested in are idle
nMU- (EQ 3)
time (Idle), thread-level parallelism (TLP), and response TLP = -------------
time (TR). Idle time is the fraction of measurement time 1 – c0
when all of the processors in the system are executing the
idle task. Thread-level parallelism, on the other hand, is a In certain cases, we use a subscript to show the range
measure of how many threads are executing concurrently on which our metrics were computed. In particular, TLPie
when the machine is not idle. These two quantities provide a gives TLP for the interactive episodes and TLPrun shows the
more accurate insight into workload characteristics than thread-level parallelism for the entire benchmark.
machine utilization (MU) alone. While machine utilization Our measurement technique relies on intercepting
can give an accurate picture of the concurrency of the sys- thread switch events in the OS kernel and keeping an accu-
tem if idle time is close to zero, it can obscure the presence rate trace of the executing threads on all of the CPUs in the
of concurrent execution in the case of interactive applica- system. Since the timestamp counters are synchronized
tions, where idle time is high. Response time (TR) is the across all processors in the machine, we use the time stamps
length of time between the initiation and completion of an associated with thread switch events to compute how much
interactive event, which we also refer to as the length of an the processors in the system execute concurrently. We have
interactive episode. confirmed that the counters under Linux are synchronized to
The following equations precisely define machine utili- within 100 cycles, well below the microsecond resolution
zation (MU), the TLP metric and their relationship. we desired. This methodology allows us to monitor the exe-
cution of all threads in the system, not just the ones in the
n
running benchmark, and takes all scheduling and synchroni-
∑ ci i zation overhead into account.
MU = i---------------
=1 - (EQ 1) Figure 3 shows the hardware and software configura-
n tion of our benchmarking environment.
We use the variable ci to denote the fraction of time that
exactly i threads execute concurrently. The value of i ranges 3.1 Detecting interactive episodes
from 0 to n, where n is the number of processors in the The beginning of an interactive episode is initiated by
machine. Equation 1 shows the formula for machine utiliza- the user and is usually signified by a GUI event, such as
tion, while Equation 2 shows the computation for TLP. The
variable c0 represents the fraction of time that the machine
was idle. Correspondingly, idle time is defined as the frac- FIGURE 3. Benchmarking environment configuration
tion of time when all processors in the machine were idle
simultaneously. Hardware configuration Software configuration
Dell Precision WorkStation 410 Linux Mandrake 7
n
∑ ci i Two 450Mhz Pentium II Modified 2.3.99-pre3 kernel

TLP = i---------------
=1 - (EQ 2) 512K L2 Cache XFree86 3.3.6
1 – c0 512M RAM Helix GNOME 1.2
Matrox Millennium II AGP 4M glibc 2.1.3 C library

Thread level parallelism and Interactive performance of desktop applications - ASPLOS 2000 August 21, 2000 3 of 10
FIGURE 4. Trace fragments illustrating tasks and communication events

CPU 0 CPU 1 CPU 0 CPU 1

757 757
R W
895 W
778 757 W 757 W

W
757 W
778 W 757

R 2090 757
895 W W
889 757 R
2088 W
757
W W

757 757 W

757 W

Two typical trace fragments are shown in this figure from the Ghostview benchmark. The first picture shows communication events
between tasks after a mouse button was pushed, while the picture on the right corresponds to a complex image being rendered on the
screen. The letter R to the right of the pid shows that the task was preempted (the task is ready to execute), a W indicates that it is waiting
for an event and gave up time on its own (it is waiting for an event to complete). The arrows indicate the communication flow between the
tasks. Task pids in the figure correspond to the following: 757 - X server, 778 - sawmill (window manager), 895 - gnome-terminal, 889 -
tasklist_applet, 2088 - Ghostview (gv), 2090 - Ghostscript (gs). Note that Ghostview uses Ghostscript to render pdf and postscript data. In
these examples, when a task is waiting for an event, it is blocked in the select or the poll system call.

pressing a mouse button or a key on the keyboard. Finding reached when all the following conditions are met for tasks
the end of an episode is more difficult since there is no event in the task set:
that automatically gets generated when the computer is done
responding. One approach is to assume that user initiated • No tasks are executing.
events are CPU bound and to define the end of an episode as • Data written by the tasks have been consumed.
the beginning of a relatively long idle section [11]. The • No task was preempted the last time it ran (i.e., all gave
length of an interactive episode is thus the elapsed time up time on their own by blocking in a system call).
between a user initiated event (e.g., a mouse click) and the • No tasks are blocked on I/O.
beginning of the next idle period that is longer than a pre-
defined threshold. There are two problems with this Figure 4 illustrates two typical trace fragments from the
approach: Ghostview benchmark. In the first case, four processes com-
municate with the server as the result of a mouse-click
• Episodes that are I/O bound may be terminated prema- event. Note that when a task gives up time it usually does so
turely if the wait time exceeds the idle threshold. in the poll or select system calls which signifies that the task
• There is a significant latency between the end of an inter- is ready to process more data.
active episode and its classification, complicating on-
line use of episode information (e.g., for scheduling). The second trace fragment illustrates the interaction
between the X server and a client that is continuously send-
We developed a more robust episode detection mecha- ing data to be displayed. In this case, the Ghostscript ren-
nism to alleviate these problems. To find interactive epi- derer is running continuously on CPU 0 and sends the data
sodes, we keep track of the set of tasks that communicate to the display whenever it has completed rendering a seg-
with each other as a result of a user-initiated GUI event. ment. The communication is unidirectional and asynchro-
nous and the resulting parallelism is asymmetric. In our
The start of an interactive episode is signified by the experiments symmetric utilization of both processors was
GUI controller (X server in our case) sending a message rare (i.e., when the TLP is 2 for an extended period of time).
through a socket to another task. When this happens both
the GUI controller and the receiver of the task are added to Most applications under UNIX communicate using
what we refer to as the task set of the episode. If the mem- sockets, signals, and pipes. In particular, the X server uses
bers of the task set communicate with non-member tasks, sockets to communicate with its clients. We do not track
then those tasks are also added. The end of the episode is interactions via other methods such as System V IPC and
shared memory since our benchmarks do not use them. By

Thread level parallelism and Interactive performance of desktop applications - ASPLOS 2000 August 21, 2000 4 of 10
TABLE 1. Linux benchmark descriptions and characteristics

Dual processor Uniprocessor


Benchmark Version Description
TLPie TLPrun Idlerun Idlerun
Acroread 4.0 Acrobat PDF file viewer 1.20 1.19 88% 87%
FrameMaker 5.5.6beta Document editor 1.35 1.33 93% 93%
Ghostview 3.5.8 PostScript and PDF file viewer 1.42 1.39 84% 84%
GIMP 1.1.22 The GNU Image Manipulation Program 1.26 1.24 88% 84%
Netscape 4.7 Web browser 1.34 1.28 90% 89%
Xemacs 21.1 patch 8 Text editor 1.26 1.21 93% 92%

Average 1.31 1.27 89% 88%

tracking the communications between the tasks, we are able Linux to find which system call caused a transition to the
to determine which tasks have an effect on interactive per- kernel, we instrumented key system calls to put their id in a
formance. Unlike other operating systems (e.g., Windows field of the executing task’s task_struct. Once execution
NT), Linux does not differentiate between threads and pro- gets to the schedule function, our code looks at this field
cesses. Threads are implemented using regular processes and outputs the task’s reason for giving up time.
and the clone system call. We use the name “task” as a
synonym for both threads and processes. An attractive feature of our methodology is that the
ends of interactive episodes can be found immediately,
The implementation that performs the tracking is as without having to wait first for an arbitrary amount of time
non-invasive as possible. The difficulty was not in the to elapse. This information could be used on line by the ker-
actual implementation but in finding all the parts of the ker- nel to make better scheduling and service quality decisions.
nel that needed to be tracked. Currently we track communi-
cations through the following system calls: 3.2 Benchmarks
kill, pread, pwrite, read, readv, recv, recvfrom, rcvmsg,
Table 1 gives a short description of our benchmarks and
send, sendmsg, sendto, write, writev summarizes their high-level characteristics. The data pre-
We instrumented each of these system calls to emit a trace sented in this paper are averages of seven benchmark runs in
of the signals, inodes, and sockets that they are accessing. each configuration. All benchmarks were run by a live user.
The socket information is output instead of the inode num- While we aimed to repeat each run as accurately as possible,
ber, when a socket is accessed through an inode. To be able there are slight variations between the runs. All the signifi-
to match read and write requests through socket pairs, we cant events (e.g., mouse clicks, text entry) were performed
use the socket’s pair (sock->sk->pair) on a write and the in the same order during each benchmark run. However, the
read socket itself on a read event. Currently we track only exact path of mouse movement (and therefore the interac-
communications through UNIX sockets since this is the tive episodes corresponding to them) and the amount of
only socket type that is local to the machine. One could time between events varies from one run to the other.
extend this methodology to track communications through All our applications show significant amounts of TLP
other types of sockets if the communicating programs are all and a high fraction of idle time. It is likely that the idle time
local to the machine. However, we have seen no need for of the application executing on an actual user’s desktop
this extension so far. would be higher since although we interacted manually, we
The primary reason for tracking signals is that the made no efforts to consume all the information presented by
thread library (LinuxThreads) uses signals to implement the program. The overall TLP of our applications are similar
synchronization between threads. By looking at the signal to what we measured during the interactive episodes. How-
activity we can determine how threads communicate ever, in all cases the TLP in interactive episodes was higher
through condition variables, mutexes, and locks. The two than the average for the entire run of the benchmarks.
functions that needed to be instrumented are
handle_signal and send_sig_info. An alternative to
this approach would have been to instrument the thread 4. Response time results
library; however, our current approach is more generic and One way of quantifying an application’s performance is
has lower overhead. to measure the time it takes to complete a run of the bench-
To determine when tasks are blocked on I/O, we instru- mark. This approach works for throughput-oriented bench-
mented the schedule function to record the reason why it marks, however it runs into difficulties when one tries to
was called. If it is called from a part of the kernel that is measure interactive applications. Nonetheless, benchmarks
related to I/O (such as the read and write system calls), then such as Sysmark 98 [17] and Winstone 99 [18] attempt to
we assume that the task is blocked while waiting for an I/O quantify the performance of interactive applications by turn-
event to complete. Since there is no predefined way in ing them into throughput-oriented benchmarks. They
accomplish this by using a software driver that clicks

Thread level parallelism and Interactive performance of desktop applications - ASPLOS 2000 August 21, 2000 5 of 10
TABLE 2. Response time on a dual-processor and a uniprocessor machine

Selected episodes
Average
response-time Dual processor Uniprocessor
improvement of Response time
Benchmark mouse-click Episode description (TR) Measured Predicted
events improvement TLPie TR (sec)
TR (sec) TR (sec)
14% 1.21 0.119 0.138 0.144
Displaying successive
Acroread 15% 17% 1.21 0.125 0.150 0.151
pages of a pdf file.
12% 1.15 0.231 0.261 0.264
30% 1.43 0.296 0.425 0.422

Visually manipulating a 20% 1.29 0.040 0.050 0.051


FrameMaker 22%
FrameMaker document. 25% 1.41 0.022 0.029 0.031
27% 1.38 0.021 0.029 0.029
36% 1.51 0.223 0.346 0.336
19% 1.29 0.455 0.562 0.586
Displaying successive
Ghostview 34% 36% 1.53 0.225 0.352 0.345
pages of a pdf file.
30% 1.47 0.403 0.578 0.593
32% 1.52 0.331 0.484 0.502
Pixelize 18% 1.37 0.456 0.553 0.623
Motion blur 8% 1.15 1.206 1.312 1.390
GIMP 19% Sharpen 20% 1.33 0.408 0.511 0.541
Laplace edge-detect 15% 1.19 0.982 1.156 1.164
Undo 22% 1.38 0.118 0.152 0.164
24% 1.34 0.252 0.331 0.338
Displaying simple
HTML pages from a 25% 1.42 0.096 0.127 0.137
Netscape 21%
machine-local web 20% 1.31 0.064 0.079 0.084
server.
28% 1.44 0.085 0.118 0.123

Average 22%

through these applications as quickly as possible and by 4.1 Individual episodes


measuring the length of the end-to-end execution.
We have noted that while the runs of our benchmarks
The problem with this approach is that in interactive are similar to each other, they are not completely identical.
applications not all parts of the program’s execution are This poses problems when we attempt to compare two dif-
equally important. The relevant metric is improvement in ferent runs to each other. We cannot just compare the aver-
response time (the time it takes to respond to a user-initiated age lengths of interactive episodes from one run to the other,
event) of critical episodes [5]. This time has also been called since the set of episodes in each run could be slightly differ-
“wait time,” to refer to the fact that during these periods the ent. We overcame this problem by comparing only individ-
user is actively waiting for the computer to complete the ual episodes that occur after the same mouse click in each
task. trace. These episodes are more repeatable than those caused
by other events, such as focus changes, and tend to have
This section presents measured response times from
uni- and dual-processor machines. The first subsection com- longer response times.
pares the performance of individual interactive episodes To correlate mouse-click events with interactive epi-
while the second analyzes and compares aggregate statistics sodes we modified the X server to write an entry into the
from full benchmark runs. Section 4.3 further analyzes these trace every time a mouse button is pressed. The postproces-
results in the context of user-perceptible improvement. sor can then correlate interactive episodes that occur after a
Finally, Section 4.4 repeats the initial experiments with an marker. Table 2 summarizes the results of the response time
added active background process (an MP3 player). measurements for these episodes. The first column includes
the overall response time improvement for the mouse-click
episodes in the benchmark, while the rest of the columns
show the collected data and the response time improvement
for a few selected episodes. Note that, since our Xemacs

Thread level parallelism and Interactive performance of desktop applications - ASPLOS 2000 August 21, 2000 6 of 10
TABLE 3. Episode distribution (dual processor)

[0ms, 1ms) [1ms, 10ms) [10ms, 100ms) [100ms, inf)


Benchmark
% of % of % of % of
% of time % of time % of time % of time
episodes episodes episodes episodes
Acroread 92.69% 5.75% 4.11% 3.52% 1.13% 11.89% 2.08% 78.85%
FrameMaker 72.10% 2.87% 17.60% 6.98% 8.58% 42.11% 1.72% 48.04%
Ghostview 89.87% 2.24% 6.73% 2.22% 0.76% 6.7% 2.64% 88.85%
GIMP 87.93% 2.7% 10.19% 5.89% 0.32% 0.64% 1.57% 90.77%
Netscape 89.98% 3.56% 8.62% 13.88% 0.98% 31.43% 0.42% 51.13%
Xemacs 65.01% 4.78% 34.36% 86.01% 0.63% 9.21% 0% 0%

workload was driven solely by keyboard interactions, we Most Linux applications are not threaded; concurrency
were unable to compute response time improvement for it. emerges from simultaneously running multiple processes.
The only applications from our benchmarks that actually
The results indicate that on our benchmarks response used threads (through the LinuxThreads API) were
time improved on a dual-processor machine by an average Netscape and GIMP. These applications derived some TLP
of 22% (36% in the best case, 8% in the worst). The average by running intra-application threads concurrently. However,
TLP for the episodes is 1.31, higher than the average for the most of the TLP was achieved by running the application
complete benchmarks, which is 1.27. Idle time during the thread concurrently with the user interface threads: mostly
interactive episodes in all cases was zero or very close to it with the X server but also with the other GUI tasks (such as
(a few tenths of a percent). the window manager, desktop applets, etc.). This contrasts
To check our results we used the dual-processor num- with our experience under Windows NT, where application
bers to estimate the uniprocessor run-time (Equation 4) and threads ran concurrently primarily with threads from the
then checked them against actual measured values. DP System process, which includes device drivers and other
refers to the response time (TR) and idle time (TIdle) on a operating system threads [6].
dual processor, while UP refers to the same measurements
on a uniprocessor machine. 4.2 All interactive episodes
In the previous section we looked at select episodes that
T R ( UP ) = ( T R ( DP ) – T Idle ( DP ) )TLP + T Idle ( DP ) (EQ 4) come after mouse clicks to figure out the response time
improvement. These episodes usually represent the heavy-
weight episodes during the benchmark runs and, while these
The equation scales the non-idle portions of the episode episodes usually make up the largest percentage of time dur-
by the measured TLP and assumes that no scaling occurs on ing the run, the number of short episodes dominates.
the idle portions. This simple model predicts the uniproces-
sor episode lengths to within 4% on average (one run had an Table 3 shows the episode length distribution of our
error of 11% and for all others, error was under 7%). Given benchmarks. Due to the large variance of episode lengths,
that we made no special provisions to reduce experimental we separated the results into four categories. For each cate-
variations (e.g., by turning off background daemons) and gory, the number of episodes that fall into it are given (per-
that all traces were driven by a real user (instead of an auto- centage of episodes) along with the total amount of time
mated script), we think that the error is within a reasonable spent (as a percentage of total time in all interactive epi-
margin. sodes). While the majority of the episodes are very short
and are in the few tenths of a millisecond range, only a

TABLE 4. TLP and episode length distribution (dual processor)

[0ms, 1ms) [1ms, 10ms) [10ms, 100ms) [100ms, inf)


Benchmark
TLPie avg. length TLPie avg. length TLPie avg. length TLPie avg. length
Acroread 1.38 0.25 1.19 3.47 1.20 42.76 1.18 153.66
FrameMaker 1.38 0.30 1.20 2.94 1.36 36.42 1.37 207.72
Ghostview 1.30 0.23 1.18 3.10 1.33 83.39 1.43 315.85
GIMP 1.43 0.17 1.22 3.28 1.35 11.49 1.26 328.83
Netscape 1.70 0.07 1.16 2.73 1.41 55.59 1.33 210.60
Xemacs 1.87 0.07 1.24 2.52 1.14 14.68 N?A N/A

Thread level parallelism and Interactive performance of desktop applications - ASPLOS 2000 August 21, 2000 7 of 10
TABLE 5. Episodes above the perception threshold

100ms threshold 50ms threshold

Benchmark Dual processor Uniprocessor Dual processor Uniprocessor

% time # episodes % time # episodes % time # episodes % time # episodes


Acroread 79% 8 85% 9 89% 9 90% 9
FrameMaker 48% 2 49% 2 69% 5 75% 6
Ghostview 89% 12 95% 15 96% 15 96% 15
GIMP 91% 9 92% 9 91% 9 92% 9
Netscape 51% 4 63% 7 72% 10 73% 10
Xemacs 0% 0 0% 0 0% 0 0% 0

small portion of the time is spent in episodes corresponding threshold is depends on the event type and can vary from
to them. This makes sense given the orders-of-magnitude one user to another. While its actual value is hard to quan-
variance in episode lengths. Examples of the short episodes tify, the perception threshold sets an upper bound for the
include: required performance.
• Moving the mouse and updating its position. Determining the exact length of an interactive episode
• Updating the appearance of the cursor. can also be problematic. Does an episode begin when the
• Handling window focus changes. user clicks the mouse button, or when that event is delivered
by the X server? We have taken the position that interactive
• Handling keyboard events. episodes begin when the event is delivered, since the X
Towards the right hand side of the table is the data cor- server may need to wait for additional events—such as extra
responding to the heavyweight episodes that were the sub- mouse clicks to distinguish a double-click from single-click
ject of our investigations in the previous section. Since in or for a button release—before delivering the event. Our
the majority of our benchmarks most of the time is spent in measurements show that the delay between the hardware
executing these kinds of episodes and most of them fall event and event dispatch can vary from a few tenths to hun-
above the perception threshold of the user, these are of pri- dreds of milliseconds. When considering an appropriate
mary importance for speeding up. perception threshold for a user, this extra delay may need to
be accounted for. We do not address this problem in this
Table 4 gives the TLP and average length of interactive paper.
episodes. The most significant observation is that in all
cases TLP is higher in the interactive episodes than the aver- Table 5 shows the number and fraction of time spent in
age for the entire run of the benchmark (see Table 1). This episodes that are above the perception threshold. The frac-
matches with our observation in the previous section. More- tion of time is expressed as the percentage of time in all
over, in all cases the interactive episodes appear to be very interactive episodes. Data is computed for two threshold
CPU bound, with zero or close zero idle time. values, which were selected based on data from [2]. Given
either threshold, most of the time is spent executing epi-
Short episodes (less than one millisecond long) tend to sodes that fall above the perception threshold. Based on this
have the highest TLP. This should not be surprising since data, FrameMaker, Netscape, and Xemacs are the most
these episodes usually perform an update of a few GUI responsive applications. This correlates well with our expe-
objects on the screen, which requires the tight interaction of rience; all three of these applications seemed to be very
both the X server and the client. The TLP in the most used responsive and we could not experience any qualitative dif-
category varies from one benchmark to the other. It is never ference between the dual-processor and uniprocessor runs
smaller than the overall average for the program but in some of these applications. We must note, however, that while
cases it is smaller than the average for the interactive epi- interactive episodes in Netscape were under the perception
sodes. threshold when accessing a web server on the local
While most time is spent executing episodes that fall in machine, accessing servers on the Internet would certainly
show more episodes in the perceptible range due to network
the tenth of a second to over a second range, some last for latency. However, network latency is not something that
only a fraction of a millisecond. With episodes that are so
short, the question comes up whether there is any percepti- multiprocessing in the client can reduce.
ble improvement in responsiveness using two processors. While exploiting TLP reduces the average length of the
interactive episodes, it only causes a shift of an interactive
4.3 The perception threshold episode from above to below the perception threshold, if its
length on a uniprocessor does not greatly exceed the thresh-
We have shown that TLP can be exploited successfully old. None of the benchmarks had a significant shift in the
to reduce the response time of interactive applications. Once number of episodes when the perception threshold is set to
the response time reaches a certain threshold, the user is not 50ms, and only a few of our benchmarks’ episodes moved
able to detect any further improvement. What exactly that

Thread level parallelism and Interactive performance of desktop applications - ASPLOS 2000 August 21, 2000 8 of 10
TABLE 6. Response time improvement on dual-processor and uniprocessor machines with MP3 playback in background

Response time increase


TLPie TLPrun Response time due to MP3 playback
Benchmark
improvement
Dual processor Uniprocessor
Acroread 1.25 1.19 23% 4% 15%
FrameMaker 1.40 1.20 29% 1% 13%
GhostView 1.46 1.34 38% 4% 10%
GIMP 1.32 1.23 23% 4% 14%
Netscape 1.39 1.24 31% 5% 16%
Xemacs 1.35 1.18 N/A N/A N/A

Average 1.36 1.23 29% 4% 14%

below the cutoff at the 100ms threshold (Acroread, Ghost- Since interactive episodes have very little idle time, TLP
View, and Netscape). goes up due to the concurrently running MP3 playback pro-
cess. On the non-interactive portions, the background appli-
4.4 Effects of background activity cation reduces idle time and replaces the idle thread with a
single running application. Since all of our benchmarks are
To round out our investigations, we wanted to know dominated by idle time, the TLPrun of the applications is the
what happens when a background process is executing same or lower with MP3 playback in the background than
along with the interactive application. To gain some insight without.
into such workloads, we repeated our experiments with an
MP3 player running in the background. We used a very sim-
ple MP3 player called mpg123 (version 0.59r) along with 5. Conclusions and future work
the esd sound daemon. The player lacks a graphical inter-
face and only does music playback—no visual effects. This The fundamental question that this paper attempts to
application is light-weight and exhibits very little concur- illuminate is whether it is beneficial for a desktop user to
rency. We measured 1.02 TLP and 95% idle time when run- use a multiprocessor machine for everyday tasks. We have
ning music playback by itself. shown that existing Linux workloads exhibit thread-level
parallelism, which translates into improvement of the user-
The results are presented in Table 6. The response times perceived response time of the applications. Using two pro-
are averages over all mouse-click events, as in Table 2. The cessors instead of one is a straightforward way to reduce
performance improvements due to two processors is more execution length in the critical path in our benchmarks by
significant than in our previous measurements. The average 8% to 36% (22% on average). Moreover, these improve-
improvement is 29%, in contrast to 22% without the back- ments represent 16% to 72% of the maximum reduction
ground task. On a dual processor the work required for MP3 achievable on a dual-processor machine (50%). Using two
playback is mostly absorbed by the extra processor. How- processors can thus be an effective and efficient way of
ever, on a uniprocessor the extra work cannot be off-loaded improving interactive performance.
and must be performed during the critical path, thus extend-
ing the lengths of the interactive episodes. Compared to our The average response time improvement on a dual-pro-
previous results, the dual-processor episode lengths are cessor machine increases to 29% with an MP3 player exe-
increased by an average of 4%, while the uniprocessor epi- cuting in the background. Although the extra processor
sode lengths are increased by an average of 14%. eliminates most of the overhead of a background task, it
does not absorb it all. The average length of an interactive
The average TLP within the interactive episodes episode increased by 4% due to audio playback (vs. 14% on
increases to 1.36 while the average for the entire benchmark the uniprocessor). This result is consistent with the level of
run decreases to 1.23. The trend is the same as in our previ- TLP we found in interactive applications: we should expect
ous measurements without MP3 playback. However, in this the response time to remain unchanged only if the second
case the difference between TLPie and TLPrun is signifi- processor is completely unused by the foreground applica-
cantly greater (the average TLPie is 1.31 and TLPrun is 1.27 tion.
when there is no MP3 playback in the background). The
reason for the greater difference is that MP3 playback is For most of our applications, using more than two pro-
periodic and has no inherent concurrency (it is not threaded, cessors is not likely to yield great improvements. This con-
just a single task), which affects TLP in two ways: clusion is supported by our previous experience on a quad-
processor machine, where the only workloads that had a
• It reduces idle time and the new non-idle portions have a TLP of 2 or more were hand-parallelized or were batch jobs
TLP of one. [6]. Our current results show that most workloads have TLP
• On existing non-idle periods, it increases TLP. under 1.4, which implies that increasing the number of pro-
cessors would only be beneficial in less than 40% percent of
the time (i.e., the proportion of the episode where TLP is 1

Thread level parallelism and Interactive performance of desktop applications - ASPLOS 2000 August 21, 2000 9 of 10
is not reduced). Even by making the optimistic assumption References
that in all these episodes four threads could run concurrently
40% of the time, we can only expect an overall speedup of [1] L. A. Barroso, K. Gharachorloo, R. McNamara, A.
20% on a quad-processor over a dual-processor machine. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B.
Berghese. Piranha: A Scalable Architecture Based on Single-
In our opinion, our results indicate that the Linux kernel Chip Multiprocessing. Proceedings of the 27th Annual Inter-
and associated software have come of age as an SMP plat- national Symposium on Computer Architecture, June 2000.
form. While thread-level parallelism can certainly be further [2] J. B. Chen, Y. Endo, K. Chan, D. Mazieres, A. Dias, M. Selt-
improved by removing uses of the global kernel lock, we zer, and M. D. Smith. The Measured Performance of Personal
believe that the emphasis should now be on application Computer Operating Systems. Proceedings of the 15th ACM
writers to refactor their programs with multithreading in Symposium on Operating System Principles, pp. 299-313,
mind. Multiprocessing would become even more compel- December 1995.
ling if TLP could be increased above 1.5. [3] K. Diefendorff. Power4 Focuses on Memory Bandwidth:
IBM Confronts IA-64, Says ISA Not Important. Microproces-
Historically, uniprocessor performance has doubled sor Report, Volume 13, Number 13, October 6, 1999.
every 18 months. Given a TLP of 1.5, this means that the [4] K. Diefendorff. Compaq Chooses SMT for Alpha: Simulta-
lead time of a dual processor over an equivalent-performing neous Multithreading Exploits Instruction- and Thread-Level
uniprocessor is about three quarters of a year. Given that Parallelism. Microprocessor Report, Volume 13, Number 16,
most interactive episodes in our measurements were shorter December 6, 1999.
than 500ms, in about three years these episodes will fall [5] Y. Endo, Z. Wang, J. B. Chen, and M. I. Seltzer. Using
under the 100ms perception threshold on a dual-processor Latency to Evaluate Interactive System Performance. 2nd
machine and in about four years it will be sufficiently fast Symposium on Operating Systems Design and Implementa-
on a uniprocessor. On the other hand, the software four tion, pp. 185-199, October 1996.
years from now will likely increase its processing require- [6] K. Flautner, R. Uhlig, S. Reinhardt, and T. Mudge. Thread-
ments. level parallelism of desktop applications Proceedings of
Workshop on Multi-threaded Execution, Architecture and
In this paper we focused on the performance effects of Compilation, Toulouse, France, January 2000.
multiprocessing on interactive applications. However, at [7] L. Hammond and K. Olukotun. Considerations in the Design
some point in the future the decision whether to have multi- of Hydra: a Multiprocessor-on-a-Chip Microarchitecture.
ple thread contexts in hardware may have to be made on Stanford University Technical Report No. CSL-TR-98-749.
some factors other than performance, such as energy and [8] C. Hauser, C. Jacobi, M. Theimer, B. Welch, and M. Weiser.
power efficiency. If response time is below the threshold of Using Threads in Interactive Systems: A Case Study. Pro-
user perception, one can reduce energy and power consump- ceedings of the 14th ACM Symposium on Operating Systems
Principles, pp. 94-105, December 1993.
tion by running each CPU at a lower frequency and voltage,
without degrading the user experience. Exploiting TLP to [9] D. C. Lee, P. J. Crowley, J. Baer, T. E. Anderson, and B. N.
Bershad. Characteristics of Desktop Applications on Win-
reduce critical paths could enable further frequency and dows NT. Proceedings of the 25th Annual International Sym-
voltage reductions. posium on Computer Architecture, June 1998.
Dynamic voltage and frequency scaling [11][14] also [10] J. Lo, L. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and
requires algorithms that determine a priori which episodes S. Parekh. An Analysis of Database Workload Performance
are fast enough and how fast to execute them. This is a on Simultaneous Multithreaded Processors. Proceedings of
the 25th Annual International Symposium on Computer
direction of our ongoing research. Our methodology of Architecture, June 1998.
tracking communications between tasks can be used to iden-
[11] T. Pering, T. Burd, and R. Brodersen. The Simulation and
tify the performance critical parts of the workload and to Evaluation of Dynamic Voltage Scaling Algorithms. Proceed-
estimate the required level of performance. ings of International Symposium on Low Power Electronics
In our experience most of the concurrency was and Design 1998, pp. 76-81, June 1998.
achieved by overlaying the execution of the GUI controller [12] J. S. Seng, D. M. Tullsen, and G. Z. N. Cai. Power-Sensitive
(X server) with an application task that is communicating Multithreaded Architecture. Proceedings of International
Conference on Computer Design 2000, September 2000.
with the GUI. We have noted that the utilization of proces-
sors is usually unbalanced. This leaves some room for soft- [13] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous
Multithreading: Maximizing On-chip Parallelism. Proceed-
ware designers to repartition interfaces in order to more ings of the 22nd International Symposium on Computer
efficiently utilize the hardware. In particular, a higher level Architecture, pp. 206-218, June 1995.
API for rendering images in the X server could improve the [14] M. Weiser, B. Welch, A. Demers, and S. Shenker. Scheduling
balance between the server and the clients. A more general for Reduced CPU Energy. Proceedings of the First Sympo-
approach to balancing would be to run the CPUs in the sys- sium of Operating Systems Design and Implementation,
tem at different levels of performance depending on the par- November 1994.
ticular workload. This optimization would increase TLP and [15] Microprocessor Architecture for Java Computing. http://
decrease energy consumption. www.sun.com/microelectronics/MAJC, Sun Microsystems,
1999.
[16] Press release: Apple Debuts New PowerMac G4s with Dual
6. Acknowledgments Processors. http://www.apple.com/pr/library/2000/jul/
19g4.html
This work was supported by an Intel Graduate Fellow- [17] http://www.bapco.com/sys98k.htm
ship, by an equipment grant from Intel, and by DARPA con-
[18] http://www.zdnet.com/zdbop
tract number F33615-00-C-1678.

Thread level parallelism and Interactive performance of desktop applications - ASPLOS 2000 August 21, 2000 10 of 10

View publication stats

You might also like