Phillip G. Ezolt - Optimizing Linux Performance_ a Hands-On Guide to Linux Performance Tools-Prentice Hall (2005)
Phillip G. Ezolt - Optimizing Linux Performance_ a Hands-On Guide to Linux Performance Tools-Prentice Hall (2005)
1
main
Copyright
Hewlett-Packard®
Professional
Books
Preface
Why Is
Performance
Important?
Linux:
Strengths and
Weakness
How Can This
Book Help
You?
Why Learn
How to Use
Performance
Tools?
Can I Tune
for
Performance?
Who Should
Read This
Book?
How Is This
Book
Organized?
Acknowledgments
About the
Author
Chapter 1.
Performance
Hunting Tips
Section 1.1.
General Tips
2
main
Section 1.2.
Outline of a
Performance
Investigation
Section 1.3.
Chapter
Summary
Chapter 2.
Performance
Tools: System
CPU
Section 2.1.
CPU
Performance
Statistics
Section 2.2.
Linux
Performance
Tools: CPU
Section 2.3.
Chapter
Summary
Chapter 3.
Performance
Tools: System
Memory
Section 3.1.
Memory
Performance
Statistics
Section 3.2.
Linux
Performance
Tools: CPU
and Memory
Section 3.3.
Chapter
Summary
Chapter 4.
Performance
Tools:
Process-Specific
CPU
Section 4.1.
Process
Performance
Statistics
Section 4.2.
The Tools
Section 4.3.
Chapter
3
main
Summary
Chapter 5.
Performance
Tools:
Process-Specific
Memory
Section 5.1.
Linux
Memory
Subsystem
Section 5.2.
Memory
Performance
Tools
Section 5.3.
Chapter
Summary
Chapter 6.
Performance
Tools: Disk I/O
Section 6.1.
Introduction
to Disk I/O
Section 6.2.
Disk I/O
Performance
Tools
Section 6.3.
What's
Missing?
Section 6.4.
Chapter
Summary
Chapter 7.
Performance
Tools: Network
Section 7.1.
Introduction
to Network
I/O
Section 7.2.
Network
Performance
Tools
Section 7.3.
Chapter
Summary
Chapter 8.
Utility Tools:
Performance
Tool Helpers
4
main
Section 8.1.
Performance
Tool Helpers
Section 8.2.
Tools
Section 8.3.
Chapter
Summary
Chapter 9.
Using
Performance
Tools to Find
Problems
Section 9.1.
Not Always a
Silver Bullet
Section 9.2.
Starting the
Hunt
Section 9.3.
Optimizing an
Application
Section 9.4.
Optimizing a
System
Section 9.5.
Optimizing
Process CPU
Usage
Section 9.6.
Optimizing
Memory
Usage
Section 9.7.
Optimizing
Disk I/O
Usage
Section 9.8.
Optimizing
Network I/O
Usage
Section 9.9.
The End
Section 9.10.
Chapter
Summary
Chapter 10.
Performance
Hunt 1: A
CPU-Bound
Application
5
main
(GIMP)
Section 10.1.
CPU-Bound
Application
Section 10.2.
Identify a
Problem
Section 10.3.
Find a
Baseline/Set a
Goal
Section 10.4.
Configure the
Application
for the
Performance
Hunt
Section 10.5.
Install and
Configure
Performance
Tools
Section 10.6.
Run
Application
and
Performance
Tools
Section 10.7.
Analyze the
Results
Section 10.8.
Jump to the
Web
Section 10.9.
Increase the
Image Cache
Section
10.10.
Hitting a
(Tiled) Wall
Section
10.11.
Solving the
Problem
Section
10.12. Verify
Correctness?
Section
10.13. Next
Steps
6
main
Section
10.14.
Chapter
Summary
Chapter 11.
Performance
Hunt 2: A
Latency-Sensitive
Application
(nautilus)
Section 11.1.
A
Latency-Sensitive
Application
Section 11.2.
Identify a
Problem
Section 11.3.
Find a
Baseline/Set a
Goal
Section 11.4.
Configure the
Application
for the
Performance
Hunt
Section 11.5.
Install and
Configure
Performance
Tools
Section 11.6.
Run
Application
and
Performance
Tools
Section 11.7.
Compile and
Examine the
Source
Section 11.8.
Using gdb to
Generate Call
Traces
Section 11.9.
Finding the
Time
Differences
7
main
Section
11.10. Trying
a Possible
Solution
Section
11.11.
Chapter
Summary
Chapter 12.
Performance
Hunt 3: The
System-Wide
Slowdown
(prelink)
Section 12.1.
Investigating
a
System-Wide
Slowdown
Section 12.2.
Identify a
Problem
Section 12.3.
Find a
Baseline/Set a
Goal
Section 12.4.
Configure the
Application
for the
Performance
Hunt
Section 12.5.
Install and
Configure
Performance
Tools
Section 12.6.
Run
Application
and
Performance
Tools
Section 12.7.
Simulating a
Solution
Section 12.8.
Reporting the
Problem
Section 12.9.
Testing the
8
main
Solution
Section
12.10.
Chapter
Summary
Chapter 13.
Performance
Tools: What's
Next?
Section 13.1.
The State of
Linux Tools
Section 13.2.
What Tools
Does Linux
Still Need?
Section 13.3.
Performance
Tuning on
Linux
Section 13.4.
Chapter
Summary
Appendix A.
Performance
Tool Locations
Appendix B.
Installing
oprofile
B.1 Fedora
Core 2 (FC2)
B.2
Enterprise
Linux 3 (EL3)
B.3 SUSE 9.1
Index
< Day Day Up >
9
main
Copyright
www.hp.com/hpbooks
The author and publisher have taken care in the preparation of this book, but make no
expressed or implied warranty of any kind and assume no responsibility for errors or
omissions. No liability is assumed for incidental or consequential damages in
connection with or arising out of the use of the information or programs contained
herein.
The publisher offers excellent discounts on this book when ordered in quantity for
bulk purchases or special sales, which may include electronic versions and/or custom
covers and content particular to your business, training goals, marketing focus, and
branding interests. For more information, please contact:
(800) 382-3419
International Sales
All rights reserved. Printed in the United States of America. This publication is
protected by copyright, and permission must be obtained from the publisher prior to
any prohibited reproduction, storage in a retrieval system, or transmission in any form
or by any means, electronic, mechanical, photocopying, recording, or likewise. For
information regarding permissions, write to:
Copyright 10
main
Text printed in the United States on recycled paper at RR Donnelley & Sons Company
in Crawfordsville, IN
Dedication
This book is dedicated to my wife Sarah, (the best in the world), who gave up so many
weekends to make this book possible. Thank you, Thank you, Thank you!
< Day Day Up >
Dedication 11
main
UNIX, LINUX
IA-64 Linux Kernel
Mosberger/Eranian
Linux on HP Integrity Servers
Poniatowski
UNIX User's Handbook, Second Edition
Poniatowski
UNIX Fault Management
Stone/Symons
COMPUTER ARCHITECTURE
Itanium Architecture for Programmers
Evans/Trimper
PA-RISC 2.0 Architecture
Kane
IA-64 and Elementary Functions
Markstein
NETWORKING/COMMUNICATIONS
Architecting Enterprise Solutions with
Blommers UNIX Networking
OpenView Network Node Manager
Blommers
Practical Planning for Network Growth
Blommers
Mobilize Your Enterprise
Brans
Building Enterprise Information
Cook Architecture
Designing and Implementing Computer
Lucke Workgroups
Integrating UNIX and PC Network
Lund Operating Systems
SECURITY
Security in Distributed Computing
Bruce
Modern Cryptography: Theory and
Mao Practice
Trusted Computing Platforms
Pearson et al.
Halting the Hacker, Second Edition
Pipkin
Information Security
Pipkin
WEB/INTERNET CONCEPTS AND
PROGRAMMING
E-business (R)evolution, Second Edition
Amor
UDDI
Apte/Mehta
Developing Enterprise Web Services:
Chatterjee/Webber An Architect's Guide
J2EE Security for Servlets, EJBs, and
Kumar Web Services
Java Transaction Processing
Little/Maron/Pavlik
Online Communities
Mowbrey/Werry
.NET Programming
Tapadiya
OTHER PROGRAMMING
Portable Shell Programming
Blinn
Power Programming in HP OpenView
Caruso
Object Databases in Practice
Chaudhri
The Java/C++ Cross Reference
Chew Handbook
Practical Software Metrics for Project
Grady Management and Process Improvement
Software Metrics
Grady
Successful Software Process
Grady Improvement
Mobile Applications
Lee/Schneider/Schell
Preface
Why Is Performance Important?
Preface 16
main
If you are a system administrator, you have a responsibility to the users of the system
to make sure that it runs at an adequate performance level. If the system runs slowly,
users complain. If you can determine the problem and fix it quickly, they stop
complaining. As a bonus, if you can solve their problem by tuning the application or
operating system (and thus keep them from having to buy new hardware), you make
company bean counters happy. Knowing how to effectively use performance tools can
mean the difference between spending days or spending hours on a performance
problem.
< Day Day Up >
Even with these impressive benefits, the Linux ecosystem still has challenges to
overcome. Linux performance tools are scattered everywhere. Different groups with
different aims develop the tools, and as a result, the tools are not necessarily in a
centralized location. Some tools are included in standard Linux distributions, such as
Red Hat, SUSE, and Debian; others are scattered throughout the Internet. If you're
trying to solve a performance problem, you first have to know that the tools you need
exist, and then figure out where to find them. Because no single Linux performance
tool solves every type of performance problem, you also must figure out how to use
them jointly to determine what is broken. This can be a bit of an art, but becomes
easier with experience. Although most of the general strategies can be documented,
Linux does not have any guide that tells you how to aggregate performance tools to
actually solve a problem. Most of the tools or subsystems have information about
tuning the particular subsystem, but not how to use them with other tools. Many
performance problems span several areas of the system, and unless you know how to
use the tools collectively, you will not be able the fix the problem.
< Day Day Up >
Using the methods in this book, you can make a well-organized and diagnosed
problem description that you can pass on to the original developers. If you're lucky,
they will solve the problem for you.
< Day Day Up >
If you know how to effectively diagnose performance problems, you can take a
targeted approach to solving the problem instead of just taking a shot in the dark and
hoping that it works. If you are an application developer, this means that you can
quickly figure out what piece of code is causing the problem. If you are a system
administrator, it means that you can figure out what part of the system needs to be
tuned, or upgraded, without wasting time unsuccessfully trying different solutions. If
you are an end user, you can figure out which applications are lagging and report the
problem to the developers (or update your hardware, if necessary).
Linux has reached a crossroads. Most of the functionality for a highly productive
system is already complete. The next evolutionary step is for Linux and its applications
to be tuned to compete with and surpass the performance of other operating systems.
Some of this performance optimization has already begun. For example, the SAMBA,
Apache, and TUX Web server projects have, through significant time investments,
tuned and optimized the system and code. Other performance optimizations such as
the Native POSIX Thread Library (NPTL), which dramatically improves threading
performance; and object prelinking, which improves application startup time are just
starting to be integrated into Linux. Linux is ripe for performance improvements.
< Day Day Up >
To get to the voilà, you must understand the powerful but sometimes confusing world
of Linux performance tools. This takes some work, but in the end, it is worth it. The
tools can show you aspects of your application and system that you never expected to
see.
< Day Day Up >
Software developers learn how to pinpoint the exact line of source code that causes a
performance problem. System administrators who are performance tuning a system
learn about the tools that show why a system is slowing down, and they can then use
that information to tune the system. Finally, although not the primary focus of the
book, end users learn the basic skills necessary to figure out which applications are
consuming all the system resources.
< Day Day Up >
Chapters 2 through 8 (the bulk of this book) cover the various tools available to
measure different performance statistics on a Linux system. These chapters explain
what various tools measure, how they are invoked, and provide an example of each
tool being used. Each chapter demonstrates tools that measure aspects of different
Linux subsystems, such as system CPU, user CPU, memory, network I/O, and disk I/O.
If a tool measures aspects of more than one subsystem, it is presented in more than
one chapter. Each chapter describes multiple tools, but only the appropriate tool
options for a particular subsystem are presented in a given chapter. The descriptions
follow this format:
1. Introduction This section explains what the tool is meant to measure and how
it operates.
2. Performance tool options This section does not just rehash the tool's
documentation. Instead, it explains which options are relevant to the current
topic and what those options mean. For example, some performance tool man
pages identify the events that a tool measures but do not explain what the
events mean. This section explains the meaning of the events and how they are
relevant to the current subsystem.
3. Example This section provides one or more examples of the tool being used to
measure performance statistics. This section shows the tool being invoked and
any output that it generates.
Chapter 9 is Linux specific and contains a series of steps to use when confronted with
a slow-performing Linux system. It explains how to use the previously described Linux
performance tools in concert to pinpoint the cause of the performance problem. This
chapter is the most useful if you want to start with a misbehaving Linux system and
just diagnose the problem without necessarily understanding the details of the tools.
Chapters 10 through 12 present case studies in which the methodologies and tools
previously described are used together to solve real-world problems. The case studies
highlight Linux performance tools used to find and fix different types of performance
problems: a CPU-bound application, a latency-sensitive application, and an I/O bound
application.
Chapter 13 overviews the performance tools and the opportunities Linux has for
This book also has two appendixes. Appendix A contains a table of the performance
tools discussed in this book and includes a URL to the latest version of each tool.
Appendix A also identifies which Linux distributions support each particular tool.
Finally, Appendix B contains information that explains how to install oprofile, which
is a very powerful but hard-to-install tool on a few major Linux distributions.
< Day Day Up >
Acknowledgments
First, I want to thank to the good people at Prentice Hall, including Jill Harry, Brenda
Mulligan, Gina Kanouse, and Keith Cline.
Second, I want to thank all the people who reviewed the initial book proposal and
added valuable technical reviews and suggestions, including Karel Baloun, Joe
Brazeal, Bill Carr, Jonathan Corbet, Matthew Crosby, Robert Husted, Paul Lussier,
Scott Mann, Bret Strong, and George Vish II. I also want to thank all the people who
taught me what I know about performance and let me optimize Linux even though the
value of Linux optimization was uncertain at the time, including John Henning, Greg
Tarsa, Dave Stanley, Greg Gaertner, Bill Carr, and the whole BPE tools group (which
supported and encouraged my work on Linux).
In addition, I want to thank the good folks of SPEC who took me in and taught me why
benchmarks, when done well, help the entire industry. I especially want to thank
Kaivalya Dixit, whose passion and integrity for benchmarking will be sorely missed.
Thanks also to all the people who helped me keep my sanity with many games of
Carcassonne and Settlers of Catan, including Sarah Ezolt, Dave and Yoko Mitzel, Tim
and Maureen Chorma, Ionel and Marina Vasilescu, Joe Doucette, and Jim Zawisza.
Finally, I want to thank my family, including Sasha and Mischief, who remind me that
we always have time for a walk or to chase dental floss; Ron and Joni Elias, who cheer
me on; Russell, Carol, and Tracy Ezolt, who gave their support and encouragement as
I worked on this; and to my wife, Sarah, who is the most understanding and supportive
person you can imagine.
< Day Day Up >
Acknowledgments 25
main
If you have never investigated a performance problem, the first steps can be
overwhelming. However, by following a few obvious and nonobvious tips, you can save
time and be well on your way to finding the cause of a performance problem. The goal
of this chapter is to provide you with a series of tips and guidelines to help you hunt a
performance problem. These tips show you how to avoid some of the common traps
when investigating what is wrong with your system or application. Most of these tips
were hard-learned lessons that resulted from wasted time and frustrating dead ends.
These tips help you solve your performance problem quickly and efficiently.
Although no performance investigation is flawless (you will almost always say, "If only
I would have thought of that first"), these tips help you to avoid some of the common
mistakes of a performance investigation.
< Day Day Up >
Probably the most important thing that you can do when investigating a performance
problem is to record every output that you see, every command that you execute, and
every piece of information that you research. A well-organized set of notes allows you
to test a theory about the cause of a performance problem by simply looking at your
notes rather than rerunning tests. This saves a huge amount of time. Write it down to
create a permanent record.
Example: Save the output of cat /proc/pci, dmesg, and uname -a for each test.
• Save and organize performance results It can be valuable to review
performance results a long time after you run them. Record the results of a test
with the configuration of the system. This allows you to compare how different
configurations affect the performance results. It would be possible just to rerun
the test if needed, but usually testing a configuration is a time-consuming
process. It is more efficient just to keep your notes well organized and avoid
repeating work.
• Write down the command-line invocations As you run performance tools, you
will often create complicated and complex command lines that measure the
exact areas of the system that interest you. If you want to rerun a test, or run
the same test on a different application, reproducing these command lines can
be annoying and hard to do right on the first try. It is better just to record
exactly what you typed. You can then reproduce the exact command line for a
future test, and when reviewing past results, you can also see exactly what you
measured. The Linux command script (described in detail in Chapter 8, "Utility
Tools: Performance Tool Helpers") or "cut and paste" from a terminal is a good
way to do this.
• Record research information and URLs As you investigate a performance
problem, it is import to record relevant information you found on the Internet,
through e-mail, or through personal interactions. If you find a Web site that
As you collect and record all this information, you may wonder why it is worth the
effort. Some information may seem useless or misleading now, but it might be useful
in the future. (A good performance investigation is like a good detective show:
Although the clues are confusing at first, everything becomes clear in the end.) Keep
the following in mind as you investigate a problem:
All information is useful information (which is why you save it) It might
not be immediately clear why you save information about what tests you
have run or the configuration of the system. It can prove immensely
useful when you try to explain to a developer or manager why a system is
performing poorly. By recording and organizing everything you have seen
during your investigation, you have proof to support a particular theory
and a large base of test results to prove or disprove other theories.
Periodically reviewing your notes can provide new insights When you
have a big pool of information about your performance problem, review it
periodically. Taking a fresh look allows you to concentrate on the results,
rather than the testing. When many test results are aggregated and
reviewed at the same time, the cause of the problem may present itself.
Looking back at the data you have collected allows you test theories
without actually running any tests.
Although it is inevitable that you will have to redo some work as you investigate a
problem, the less time that you spend redoing old work, the more efficient you will be.
If you take copious notes and have a method to record the information as you discover
it, you can rely on the work that you have already done and avoid rerunning tests and
redoing research. To save yourself time and frustration, keep reliable and consistent
notes.
For example, if you investigate a performance problem and eventually determine the
cause to be a piece of hardware (slow memory, slow CPU, and so on), you will
probably want to test this theory by upgrading that slow hardware and rerunning the
test. It often takes a while to get new hardware, and a large amount of time might
pass before you can rerun your test. When you are finally able, you want to be able to
run an identical test on the new and old hardware. If you have saved your old test
invocations and your test results, you will know immediately how to configure the test
for the new hardware, and will be able to compare the new results with the old results
that you have stored.
As you start to tweak the system to improve performance, it can become easy to make
mistakes when typing complicated commands. Inadvertently using incorrect
parameters or configurations can generate misleading performance information. It is a
good idea to automate performance tool invocations and application tests:
If you automate as much as you can, you will reduce mistakes. Automation with
scripting can save time and help to avoid misleading information caused by improper
tool and test invocations.
For example, if you are trying to monitor a system during a particular workload or
length of time, you might not be present when the test finishes. It proves helpful to
have a script that, after the test has completed, automatically collects, names, and
saves all the generated performance data and places it automatically in a "Results"
directory. After you have this piece of infrastructure in place, you can rerun your tests
with different optimizations and tunings without worrying about whether the data will
be saved. Instead, you can turn your full attention to figuring out the cause of the
problem rather than managing test results.
In general, the act of observing a system modifies its behavior. (For you physics buffs,
this is known as the Heisenberg uncertainty principle.)
Specifically, when using performance tools, they change the way that the system
behaves. When you investigate a problem, you want to see how the application
performs and must deal with the error introduced by performance tools. This is a
necessary evil, but you must know that it exists and try to minimize it. Some
performance tools provide a highly accurate view of the system, but use a
high-overhead way of retrieving the information. Tools with a very high overhead
change system behavior more than tools with lower overhead. If you only need a
coarse view of the system, it is better to use the tools with lower overhead even
though they are not as accurate.
Although it would be extraordinarily convenient if you needed only one tool to figure
out the cause of a performance problem, this is rarely the case. Instead, each tool you
use provides a hint of the problem's cause, and you must use several tools in concert
to really understand what is happening. For example, one performance tool may tell
you that the system has a high amount of disk I/O, and another tool may show that the
system is using a large amount of swap. If you base your solution only on the results of
the first tool, you may simply purchase a faster disk drive (and find that the
performance problem has only improved slightly). Using the tools together, however,
you determine that the high amount of disk I/O results from the high amount of swap
that is being used. In this case, you might reduce the swapping by buying more
memory (and thus cause the high disk I/O to disappear).
Using multiple performance tools together often gives you a much clearer picture of
the performance problem than is possible with any single tool.
Obviously, not one of them had the correct answer. If they had shared and
combined their impressions, however, they might have discovered the truth
about the elephant. Don't be like the blind men with the elephant. Use
multiple performance tools together to verify the cause of a problem.
One of the most exciting and frustrating times during a performance hunt is when a
tool shows an "impossible" result. Something that "cannot" happen has clearly
happened. The first instinct is to believe that the tools are broken. Do not be fooled.
The tools are impartial. Although they can be incorrect, it is more likely that the
application is doing what it should not be doing. Use the tools to investigate the
problem.
When investigating any performance problem, you may find the task overwhelming.
Do not go it alone. Ask the developers whether they have seen similar problems. Try to
find someone else who has already solved the problem that you are experiencing.
Search the Web for similar problems and, hopefully, solutions. Send e-mail to user
lists and to developers.
This piece of advice comes with a word of warning: Even the developers who think
that they know their applications are not always right. If the developer disagrees with
the performance tool data, the developer might be wrong. Show developers your data
and how you came to a particular conclusion. They will usually help you to reinterpret
the data or fix the problem. Either way, you will be a little bit further along in your
investigation. Do not be afraid to disagree with developers if your data shows
something happening that should not be happening.
For example, you can often solve performance problems by following instructions you
find from a Google search of similar problems. When investigating a Linux problem,
many times, you will find that others have run into it before (even if it was years ago)
and have reported a solution on a public mailing list. It is easy to use Google, and it
can save you days of work.
< Day Day Up >
To figure out when you have finished, you must create or use an already established
metric of your system's performance. A metric is an objective measurement that
indicates how the system is performing. For example, if you are optimizing a Web
server, you could choose "serviced Web requests per second." If you do not have an
objective way to measure the performance, it can be nearly impossible to determine
whether you are making any progress as you tune the system.
After you figure out how you are going to measure the performance of a particular
system or application, it is important to determine your current performance levels.
Run the application and record its performance before any tuning or optimization; this
is called the baseline value, and it is the starting point for the performance
investigation.
After you pick a metric and baseline for the performance, it is important to pick a
target. This target guides you to the end of the performance hunt. You can indefinitely
tweak a system, and you can always get it just a little better with more and more time.
If you pick your target, you will know when have finished. To pick a reasonable goal,
the following are good starting points:
• Find others with a similar configuration and ask for their performance
measurements This is an ideal situation. If you can find someone with a similar
system that performs better, not only will you be able to pick a target for your
system, you may also be able to work with that person to determine why your
configuration is slower and how your configurations differ. Using another
system as a reference can prove immensely useful when investigating a
problem.
• Find results of industry standard benchmarks Many Web sites compare
benchmark results of various aspects of computing systems. Some of the
benchmark results can be achieved only with a heroic effort, so they might not
represent realistic use. However, many benchmark sites have the configuration
used for particular results. These configurations can provide clues to help you
tune the system.
• Use your hardware with a different OS or application It may be possible to run
a different application on your system with a similar function. For example, if
you have two different Web servers, and one performs slowly, try a different one
to see whether it performs any better. Alternatively, try running the same
application on a different operating system. If the system performs better in
either of these cases, you know that your original application has room for
improvement.
If you use existing performance information to guide your target goal, you have a
much better chance of picking a target that is aggressive but not impossible to reach.
Use the performance tools to take a first cut at determining the cause of the problem.
By taking an initial rough cut at the problem, you get a high-level idea of the problem.
The goal of the rough cut is to gather enough information to pass along to the other
users and developers of this program, so that they can provide advice and tips. It is
vitally important to have a well-written explanation of what you think the problem
might be and what tests led you to that conclusion.
Your next goal should be to determine whether others have already solved the
problem. A performance investigation can be a lengthy and time-consuming affair. If
you can just reuse the work of others, you will be done before you start. Because your
goal is simply to improve the performance of the system, the best way to solve a
performance problem is to rely on what someone else has already done.
Although you must take specific advice regarding performance problems with a grain
of salt, the advice can be enlightening, enabling you to see how others may have
investigated a similar problem, how they tried to solve the problem, and whether they
succeeded.
• Search the Web for similar error messages/problems This is usually my first
line of investigation. Web searches often reveal lots of information about the
application or the particular error condition that you are seeing. They can also
lead to information about another user's attempt to optimize the systems, and
possibly tips about what worked and what did not. A successful search can yield
pages of information that directly applies to your performance problem.
Searching with Google or Google groups is a particularly helpful way to find
people with similar performance problems.
• Ask for help on the application mailing lists Most popular or publicly developed
software has an e-mail list of people who use that software. This is a perfect
place to find answers to performance questions. The readers and contributors
are usually experienced at running the software and making it perform well.
Search the archive of the mailing list, because someone may have asked about a
similar problem. Subsequent replies to the original message might describe a
solution. If they do not, send an e-mail to the person who originally wrote about
the problem and ask whether he or she figured out how to resolve it. If that
fails, or no one else had a similar problem, send an e-mail describing your
problem to the list; if you are lucky, someone may have already solved your
problem.
• Send an e-mail to the developer Most Linux software includes the e-mail
address of the developer somewhere in the documentation. If an Internet search
and the mailing list fails, you can try to send an e-mail to the developer directly.
Developers are usually very busy, so they might not have time to answer.
However, they know the application better than anyone else. If you can provide
the developer with a coherent analysis of the performance problem, and are
willing to work with the developer, he or she might be able to help you.
Although his idea of the cause of the performance problem might not be
correct, the developer might point you in a fruitful direction.
• Talk to the in-house developers Finally, if this is a product being developed
in-house, you can call or e-mail the in-house developers. This is pretty much the
same as contacting the external developers, but the in-house people might be
able to devote more time to your problem or point you to an internal knowledge
base.
By relying on the work of others, you might be able to solve your problem before you
even begin to investigate. At the very least, you will most likely be able to find some
promising avenues to investigate, so it is always best to see what others have found.
Now that you have exhausted the possibility of someone else solving the problem, the
performance investigation must begin. Later chapters describe the tools and methods
in detail, but here are a few tips to make things work better:
Following these tips can help you avoid false leads and help to determine the cause of
a performance problem.
As mentioned previously, it is really important to document what you are doing so that
you can go back at a later date and review it. If you have hunted down the
performance problem, you will have a big file of notes and URLs fresh in your mind.
They may be a jumbled disorganized mess, but as of now, you understand what they
mean and how they are organized. After you solve the problem, take some time to
rewrite what you have discovered and why you think that it is true. Include
This chapter provided a basic background for a performance investigation, and the
following chapters cover the Linux-specific performance tools themselves. You learn
how to use the tools, what type of information they can provide, and how to use them
in combination to find performance problems on a particular system.
< Day Day Up >
The performance tools commonly show the number of processes that are runnable and
the number of processes that are blocked waiting for I/O. Another common system
statistic is that of load average. The load on a system is the total amount of running
and runnable process. For example, if two processes were running and three were
available to run, the system's load would be five. The load average is the amount of
load over a given amount of time. Typically, the load average is taken over 1 minute, 5
minutes, and 15 minutes. This enables you to see how the load changes over time.
Most modern processors can run only one process or thread at a time. Although some
processors, such hyperthreaded processors, can actually run more than one process
simultaneously, Linux treats them as multiple single-threaded processors. To create
the illusion that a given single processor runs multiple tasks simultaneously, the Linux
kernel constantly switches between different processes. The switch between different
processes is called a context switch, because when it happens, the CPU saves all the
context information from the old process and retrieves all the context information for
the new process. The context contains a large amount of information that Linux tracks
for each process, including, among others, which instruction the process is executing,
which memory it has allocated, and which files the process has open. Switching these
contexts can involve moving a large amount of information, and a context switch can
be quite expensive. It is a good idea to minimize the number of context switches if
possible.
To avoid context switches, it is important to know how they can happen. First, context
switches can result from kernel scheduling. To guarantee that each process receives a
fair share of processor time, the kernel periodically interrupts the running process
and, if appropriate, the kernel scheduler decides to start another process rather than
let the current process continue executing. It is possible that your system will context
switch every time this periodic interrupt or timer occurs. The number of timer
interrupts per second varies per architecture and kernel version. One easy way to
check how often the interrupt fires is to use the /proc/interrupts file to determine
the number of interrupts that have occurred over a known amount of time. This is
demonstrated in Listing 2.1.
Listing 2.1.
In Listing 2.1, we ask the kernel to show us how many times the timer has fired, wait
10 seconds, and then ask again. That means that on this machine, the timer fires at a
rate of (24,070,093 24,060,043) interrupts / (10 seconds) or ~1,000 interrupts/sec. If
you have significantly more context switches than timer interrupts, the context
switches are most likely caused by an I/O request or some other long-running system
call (such as a sleep). When an application requests an operation that can not
complete immediately, the kernel starts the operation, saves the requesting process,
and tries to switch to another process if one is ready. This allows the processor to
keep busy if possible.
2.1.3. Interrupts
CPU utilization is a straightforward concept. At any given time, the CPU can be doing
one of seven things. First, it can be idle, which means that the processor is not
actually doing any work and is waiting for something to do. Second, the CPU can be
running user code, which is specified as "user" time. Third, the CPU can be executing
Most performance tools specify these values as a percentage of the total CPU time.
These times can range from 0 percent to 100 percent, but all three total 100 percent.
A system with a high "system" percentage is spending most of its time in the kernel.
Tools such as oprofile can help determine where this time is being spent. A system
that has a high "user" time spends most of its time running applications. The next
chapter shows how to use performance tools to track down problems in these cases. If
a system is spending most of its time iowait when it should be doing work, it is most
likely waiting for I/O from a device. It may be a disk, network card, or something else
causing the slowdown.
< Day Day Up >
vmstat stands for virtual memory statistics, which indicates that it will give you
information about the virtual memory system performance of your system.
Fortunately, it actually does much more than that. vmstat is a great command to get a
rough idea of how your system performs as a whole. It tells you
It is an excellent tool to use to get a rough idea of how the system performs.
vmstat can be run in two modes: sample mode and average mode. If no parameters
are specified, vmstat stat runs in average mode, where vmstat displays the average
value for all the statistics since system boot. However, if a delay is specified, the first
sample will be the average since system boot, but after that vmstat samples the
system every delay seconds and prints out the statistics. Table 2-1 describes the
options that vmstat accepts.
Option
Explanation
-n
By default, vmstat periodically prints out the column headers for each performance
statistic. This option disables that feature so that after the initial header, only
performance data displays. This proves helpful if you want to import the output of
-s
This displays a one-shot details output of system statistics that vmstat gathers. The
statistics are the totals since the system booted.
delay
vmstat provides a variety of different output statistics that enable you to track
different aspects of the system performance. Table 2-2 describes those related to CPU
performance. The next chapter covers those related to memory performance.
Column
Explanation
This is the number of currently runnable processes. These processes are not waiting
on I/O and are ready to run. Ideally, the number of runnable processes would match
the number of CPUs available.
This is the number of processes blocked and waiting for I/O to complete.
forks
in
cs
us
The is the total CPU time as a percentage spent on user processes (including "nice"
time).
sy
The is the total CPU time as a percentage spent in system code. This includes time
spent in the system, irq, and softirq state.
wa
The is the total CPU time as a percentage spent waiting for I/O.
id
The is the total CPU time as a percentage that the system is idle.
vmstat provides a good low-overhead view of system performance. Because all the
performance statistics are in text form and are printed to standard output, it is easy to
capture the data generated during a test and process or graph it later. Because
vmstat is such a low-overhead tool, it is practical to keep it running on a console or in
a window even on a very heavily loaded server when you need to monitor the health of
the system at a glance.
Listing 2.2.
Although vmstat's statistics since system boot can be useful to determine how heavily
loaded the machine has been, vmstat is most useful when it runs in sampling mode, as
shown in Listing 2.3. In sampling mode, vmstat prints the systems statistics after the
number of seconds passed with the delay parameter. It does this sampling count a
number of times. The first line of statistics in Listing 2.3 contains the system averages
since boot, as before, but then the periodic sample continues after that. This example
shows that there is very little activity on the system. We can see that no processes
were blocked during the run by looking at the 0 in the b. We can also see, by looking
in the r column, that fewer than 1 processes were running when vmstat sampled its
data.
vmstat is an excellent way to record how a system behaves under a load or during a
test condition. You can use vmstat to display how the system is behaving and, at the
same time, save the result to a file by using the Linux tee command. (Chapter 8,
"Utility Tools: Performance Tool Helpers," describes the tee command in more detail.)
If you only pass in the delay parameter, vmstat will sample indefinitely. Just start it
before the test, and interrupt it after the test has completed. The output file can be
imported into a spreadsheet, and used to see how the system reacts to the load or
various system events. Listing 2.4 shows the output of this technique. In this example,
we can look at the interrupt and context switches that the system is generating. We
can see the total number of interrupts and context switches in the in and cs columns
respectively.
The number of context switches looks good compared to the number of interrupts. The
scheduler is switching processes less than the number of timer interrupts that are
firing. This is most likely because the system is nearly idle, and most of the time when
the timer interrupt fires, the scheduler does not have any work to do, so it does not
switch from the idle process.
(Note: There is a bug in the version of vmstat that generated the following output. It
causes the system average line of output to display incorrect values. This bug has
been reported to the maintainer of vmstat and will be fixed soon, hopefully.)
Listing 2.4.
More recent versions of vmstat can even extract more detailed information about a
grab bag of different system statistics, as shown in Listing 2.5.
Listing 2.3. 46
main
The next chapter discusses the memory statistics, but we look at the CPU statistics
now. The first group of statistics, or "CPU ticks," shows how the CPU has spent its
time since system boot, where a "tick" is a unit of time. Although the condensed
vmstat output only showed four CPU states us, sy, id, and wa this shows how all the
CPU ticks are distributed. In addition, we can see the total number of interrupts and
context switches. One new addition is that of forks, which is basically the number of
new processes that have been created since system boot.
Listing 2.5.
top is the Swiss army knife of Linux system-monitoring tools. It does a good job of
putting a very large amount of system-wide performance information in a single
screen. What you display can also be changed interactively; so if a particular problem
creeps up as your system runs, you can modify what top is showing you.
Listing 2.4. 47
main
2.2.2.1 CPU Performance-Related Options
top actually takes options in two modes: command-line options and runtime options.
The command-line options determine how top displays its information. Table 2-3
shows the command-line options that influence the type and frequency of the
performance statistics that top displays.
Option
Explanation
d delay
n iterations
Number of iterations before exiting. top updates the statistics iterations times.
Show all the individual threads of an application rather than just display a total for
each application.
In a hyperthreaded or SMP system, display the summed CPU statistics rather than the
statistics for each CPU.
As you run top, you might want to fine-tune what you are observing to investigate a
particular problem. The output of top is highly customizable. Table 2-4 describes
options that change statistics shown during top's runtime.
Option
Explanation
f or F
This displays a configuration screen that enables you to select which process statistics
display on the screen.
o or O
This displays a configuration screen that enables you to change the order of the
displayed statistics.
The options described in Table 2-5 turn on or off the display of various system-wide
information. It can be helpful to turn off unneeded statistics to fit more processes on
the screen.
Option
Explanation
This toggles whether the load average and uptime information will be updated and
displayed.
This toggles the display of how each CPU spends its time. It also toggles information
about how many processes are currently running. Shows all the individual threads of
an application instead of just displaying a total for each application.
This toggles whether information about the system memory usage will be shown on
the screen. By default, the highest CPU consumers are displayed first. However, it
might be more useful to sort by other characteristics.
Table 2-6 describes the different sorting modes that top supports. Sorting by memory
consumption is particular useful to figure out which process consumes the most
amount of memory.
Table 2-4. 49
main
Table 2-6. top Output Sorting/Display Options
Option
Explanation
Sorts the tasks by their CPU usage. The highest CPU user displays first.
Sorts the tasks by the amount of CPU time they have used so far. The highest amount
displays first.
Sorts the tasks by their PID. The lowest PID displays first.
Sorts the tasks by their age. The newest PID is shown first. This is usually the opposite
of "sort by PID."
Hides tasks that are idle and are not consuming CPU.
Option
Explanation
us
sy
ni
id
wa
hi
si
load average
%CPU
PRI
The priority value of the process, where a higher value indicates a higher priority. RT
indicates that the task has real-time priority, a priority higher than the standard
range.
NI
The nice value of the process. The higher the nice value, the less the system has to
execute the process. Processes with high nice values tend to have very low priorities.
WCHAN
If a process is waiting on an I/O, this shows which kernel function it is waiting in.
STAT
This is the current status of a process, where the process is either sleeping (S),
running (R), zombied (killed but not yet dead) (Z), in an uninterruptable sleep (D), or
being traced (T).
TIME
The total amount CPU time (user and system) that this process has used since it
started executing.
COMMAND
LC
The number of the last CPU that this process was executing on.
FLAGS
This toggles whether the load average and uptime information will be updated and
displayed.
top provides a large amount of information about the different running processes and
is a great way to figure out which process is a resource hog.
Listing 2.6 is an example run of top. Once it starts, it periodically updates the screen
until you exit it. This demonstrates some of the system-wide statistics that top can
generate. First, we see the load average of the system over the past 1, 5, and 15
minutes. As we can see, the system has started to get busy recently (because
doom-3.x86). One CPU is busy with user code 90 percent of the time. The other is only
spending ~13 percent of its time in user code. Finally, we can see that 73 of the
processes are sleeping, and only 3 of them are currently running.
Listing 2.6.
catan> top
08:09:16 up 2 days, 18:44, 4 users, load average: 0.95, 0.44, 0.17
76 processes: 73 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 51.5% 0.0% 3.9% 0.0% 0.0% 0.0% 44.6%
cpu00 90.0% 0.0% 1.2% 0.0% 0.0% 0.0% 8.8%
cpu01 13.0% 0.0% 6.6% 0.0% 0.0% 0.0% 80.4%
Mem: 2037140k av, 1132120k used, 905020k free, 0k shrd, 86220k buff
689784k active, 151528k inactive
Swap: 2040244k av, 0k used, 2040244k free 322648k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
7642 root 25 0 647M 379M 7664 R 49.9 19.0 2:58 0 doom.x86
7661 ezolt 15 0 1372 1372 1052 R 0.1 0.0 0:00 1 top
1 root 15 0 528 528 452 S 0.0 0.0 0:05 1 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1
4 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 keventd
5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
6 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 bdflush
7 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 kswapd
8 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 kscand
10 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 kupdated
11 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd
Now pressing F while top is running brings the configuration screen shown in Listing
2.7. When you press the keys indicated (A for PID, B for PPID, etc.), top toggles
whether these statistics display in the previous screen. When all the desired statistics
are selected, press Enter to return to top's initial screen, which now shows the
current values of selected statistics. When configuring the statistics, all currently
activated fields are capitalized in the Current Field Order line and have an asterisk (*)
next to their name.
Listing 2.7.
To show you how customizable top is, Listing 2.8 shows a highly configured output
screen, which shows only the top options relevant to CPU usage.
Listing 2.8.
Listing 2.6. 53
main
Mem: 2037140k av, 1133548k used, 903592k free, 0k shrd, 86232k buff
690812k active, 151536k inactive
Swap: 2040244k av, 0k used, 2040244k free 322656k cached
PID USER PRI NI WCHAN FLAGS LC STAT %CPU TIME CPU COMMAND
7642 root 25 0 100100 1 R 49.6 10:30 1 doom.x86
1 root 15 0 400100 0 S 0.0 0:05 0 init
2 root RT 0 140 0 SW 0.0 0:00 0 migration/0
3 root RT 0 140 1 SW 0.0 0:00 1 migration/1
4 root 15 0 40 0 SW 0.0 0:00 0 keventd
5 root 34 19 40 0 SWN 0.0 0:00 0 ksoftirqd/0
6 root 34 19 40 1 SWN 0.0 0:00 1 ksoftirqd/1
9 root 25 0 40 0 SW 0.0 0:00 0 bdflush
7 root 15 0 840 0 SW 0.0 0:00 0 kswapd
8 root 15 0 40 0 SW 0.0 0:00 0 kscand
10 root 15 0 40 0 SW 0.0 0:00 0 kupdated
11 root 25 0 40 0 SW 0.0 0:00 0 mdrecoveryd
20 root 15 0 400040 0 SW 0.0 0:00 0 katad-1
Recently, the version of top provided by the most recent distributions has been
completely overhauled, and as a result, many of the command-line and interaction
options have changed. Although the basic ideas are similar, it has been streamlined,
and a few different display modes have been added.
Again, top presents a list, in decreasing order, of the top CPU-consuming processes.
top actually takes options in two modes: command-line options and runtime options.
The command-line options determine how top displays its information. Table 2-8
shows the command-line options that influence the type and frequency of the
performance statistics that top will display.
Listing 2.8. 54
main
Table 2-8. top Command-Line Options
Option
Explanation
-d delay
-n iterations
Number of iterations before exiting. top updates the statistics' iterations times.
-i
-b
Run in batch mode. Typically, top shows only a single screenful of information, and
processes that don't fit on the screen never display. This option shows all the
processes and can be very useful if you are saving top's output to a file or piping the
output to another command for processing.
As you run top, you may want to fine-tune what you are observing to investigate a
particular problem. Like the 2.x version of top, the output of top is highly
customizable. Table 2-9 describes options that change statistics shown during top's
runtime.
Option
Explanation
This displays an "alternate" display of process information that shows top consumers
of various system resources.
This toggles whether top will divide the CPU usage by the number of CPUs on the
system.
For example, if a process was consuming all of both CPUs on a two-CPU system, this
toggles whether top displays a CPU usage of 100% or 200%.
This displays a configuration screen that enables you to select which process statistics
display on the screen.
This displays a configuration screen that enables you to change the order of the
displayed statistics.
The options described in Table 2-10 turn on or off the display of various system-wide
information. It can be helpful to turn off unneeded statistics to fit more processes on
the screen.
Option
Explanation
1 (numeral 1)
This toggles whether the CPU usage will be broken down to the individual usage or
shown as a total.
This toggles whether the load average and uptime information will be updated and
displayed.
Option
Explanation
us
sy
ni
id
wa
hi
si
load average
%CPU
PRI
The priority value of the process, where a higher value indicates a higher priority. RT
indicates that the task has real-time priority, a priority higher than the standard
range.
NI
The nice value of the process. The higher the nice value, the less the system has to
execute the process. Processes with high nice values tend to have very low priorities.
WCHAN
If a process is waiting on an I/O, this shows which kernel function it is waiting in.
TIME
The total amount CPU time (user and system) that this process has used since it
started executing.
This is the current status of a process, where the process is either sleeping (S),
running (R), zombied (killed but not yet dead) (Z), in an uninterruptable sleep (D), or
being traced (T).
top provides a large amount of information about the different running processes and
is a great way to figure out which process is a resource hog. The v.3 version of top has
trimmed-down top and added some alternative views of similar data.
Listing 2.9 is an example run of top v3.0. Again, it will periodically update the screen
until you exit it. The statistics are similar to those of top v2.x, but are named slightly
differently.
Listing 2.9.
catan> top
top - 08:52:21 up 19 days, 21:38, 17 users, load average: 1.06, 1.13, 1.15
Tasks: 149 total, 1 running, 146 sleeping, 1 stopped, 1 zombie
Cpu(s): 0.8% us, 0.4% sy, 4.2% ni, 94.2% id, 0.1% wa, 0.0% hi, 0.3% si
Mem: 1034320k total, 1023188k used, 11132k free, 39920k buffers
Swap: 2040244k total, 214496k used, 1825748k free, 335488k cached
Now pressing f while top is running brings the configuration screen shown in Listing
2.10. When you press the keys indicated (A for PID, B for PPID, etc.), top toggles
Listing 2.10.
Listing 2.11 shows the new output mode of top, where many different statistics are
sorted and displayed on the same screen.
Listing 2.11.
Listing 2.9. 59
main
30403 24989 0:00.03 0.0 0.1 15 0 S 5808 4356 1452 9336 bash
29510 29505 7:19.59 0.0 5.9 16 0 S 125m 65m 59m 9336 firefox-bin
29505 29488 0:00.00 0.0 0.1 16 0 S 5652 4576 1076 9336 run-mozilla.sh
3 PID %MEM VIRT SWAP RES CODE DATA SHR nFLT nDRT S PR NI %CPU COMMAND
8414 25.0 374m 121m 252m 496 373m 98m 1547 0 S 16 0 0.0 soffice.bin
26364 6.8 400m 331m 68m 1696 398m 321m 2399 0 S 15 0 2.6 X
29510 5.9 125m 65m 59m 64 125m 31m 253 0 S 16 0 0.0 firefox-bin
26429 4.7 59760 10m 47m 404 57m 12m 1247 0 S 15 0 0.0 metacity
4 PID PPID UID USER RUSER TTY TIME+ %CPU %MEM S COMMAND
1371 1 43 xfs xfs ? 0:00.10 0.0 0.1 S xfs
1313 1 51 smmsp smmsp ? 0:00.08 0.0 0.2 S sendmail
982 1 29 rpcuser rpcuser ? 0:00.07 0.0 0.1 S rpc.statd
963 1 32 rpc rpc ? 0:06.23 0.0 0.1 S portmap
top v3.x provides a slightly cleaner interface to top. It simplifies some aspects of it
and provides a nice "summary" information screen that displays many of the resource
consumers in the system.
Table 2-12 describes the different options that change the output and the frequency of
the samples that procinfo displays.
Option
Explanation
-f
-d
Listing 2.11. 60
main
-D
-n sec
-Ffile
Option
Explanation
user
This is the amount of user time that the CPU has spent in days, hours, and minutes.
nice
This is the amount of nice time that the CPU has spent in days, hours, and minutes.
system
This is the amount of system time that the CPU has spent in days, hours, and minutes.
idle
This is the amount of idle time that the CPU has spent in days, hours, and minutes.
irq 0- N
This displays the number of the irq, the amount that has fired, and which kernel
driver is responsible for it.
Much like vmstat or top, procinfo is a low-overhead command that is good to leave
running in a console or window on the screen. It gives a good indication of a system's
health and performance.
Calling procinfo without any command options yields output similar to Listing 2.12.
Without any options, procinfo displays only one screenful of status and then exits.
procinfo is more useful when it is periodically updated using the -n second options.
This enables you to see how the system's performance is changing in real time.
Listing 2.12.
As you can see from Listing 2.12, procinfo provides a reasonable overview of the
system. We can see that, once again for the user, nice, system, and idle time, the
system is not very busy. One interesting thing to notice is that procinfo claims that
the system has spent more idle time than the system has been running (as indicated
by the uptime). This is because the system actually has four CPUs, so for every day of
wall time, four days of CPU time passes. The load average confirms that the system
has been relatively work-free for the recent past. For the past minute, on the average,
the system had less than one process ready to run; a load average of .47 indicates that
a single process was ready to run only 47 percent of the time. On a four-CPU system,
this large amount of CPU power is going to waste.
procinfo also gives us a good view of what devices on the system are causing
interrupts. We can see that the Nvidia card (nvidia), IDE controller (ide0), Ethernet
device (eth0), and sound card (es1371) have a relatively high number of interrupts.
This is as one would expect for a desktop workstation.
2.2.5. gnome-system-monitor
gnome-system-monitor can be invoked from the Gnome menu. (Under Red Hat 9 and
greater, this is under System Tools > System Monitor.) However, it can also be
invoked using the following command:
gnome-system-monitor
Figure 2-1.
Listing 2.12. 63
main
Figure 2-1. 64
main
Figure 2-2 shows a graphical view of system load and memory usage. This is really
what distinguishes gnome-system-monitor from top. You can easily see the current
state of the system and how it compares to the previous state.
Figure 2-2.
Figure 2-2. 65
main
The graphical view of data provided by gnome-system-monitor can make it easier and
faster to determine the state of the system, and how its behavior changes over time. It
also makes it easier to navigate the system-wide process information.
mpstat is a fairly simple command that shows you how your processors are behaving
based on time. The biggest benefit of mpstat is that it shows the time next to the
statistics, so you can look for a correlation between CPU usage and time of day.
If you have multiple CPUs or hyperthreading-enabled CPUs, mpstat can also break
down CPU usage based on the processor, so you can see whether a particular
processor is doing more work than the others. You can select which individual
processor you want to monitor or you can ask mpstat to monitor all of them.
Once again, delay specifies how often the samples will be taken, and count
determines how many times it will be run. Table 2-14 describes the command-line
options of mpstat.
Option
Explanation
-P { cpu | ALL }
This option tells mpstat which CPUs to monitor. cpu is the number between 0 and the
total CPUs minus 1.
delay
mpstat provides similar information to the other CPU performance tools, but it allows
the information to be attributed to each of the processors in a particular system. Table
Option
Explanation
user
This is the percentage of user time that the CPU has spent during the previous
sample.
nice
This is the percentage of time that the CPU has spent during the previous sample
running low-priority (or nice) processes.
system
This is the percentage of system time that the CPU has spent during the previous
sample.
iowait
This is the percentage of time that the CPU has spent during the previous sample
waiting on I/O.
irq
This is the percentage of time that the CPU has spent during the previous sample
handling interrupts.
softirq
This is the percentage of time that the CPU has spent during the previous sample
handling work that needed to be done by the kernel after an interrupt has been
handled.
idle
This is the percentage of time that the CPU has spent idle during the previous sample.
mpstat is a good tool for providing a breakdown of how each of the processors is
performing. Because mpstat provides a per-CPU breakdown, you can identify whether
one of the processors is becoming overloaded.
First, we ask mpstat to show us the CPU statistics for processor number 0. This is
shown in Listing 2.13.
Listing 2.13.
07:12:02 PM CPU %user %nice %sys %iowait %irq %soft %idle intr/s
07:12:03 PM 0 9.80 0.00 1.96 0.98 0.00 0.00 87.25 1217.65
07:12:04 PM 0 1.01 0.00 0.00 0.00 0.00 0.00 98.99 1112.12
07:12:05 PM 0 0.99 0.00 0.00 0.00 0.00 0.00 99.01 1055.45
07:12:06 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1072.00
07:12:07 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1075.76
07:12:08 PM 0 1.00 0.00 0.00 0.00 0.00 0.00 99.00 1067.00
07:12:09 PM 0 4.90 0.00 3.92 0.00 0.00 0.98 90.20 1045.10
07:12:10 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1069.70
07:12:11 PM 0 0.99 0.00 0.99 0.00 0.00 0.00 98.02 1070.30
07:12:12 PM 0 3.00 0.00 4.00 0.00 0.00 0.00 93.00 1067.00
Average: 0 2.19 0.00 1.10 0.10 0.00 0.10 96.51 1085.34
Listing 2.14 shows a similar command on very unloaded CPUs that both have
hyperthreading. You can see how the stats for all the CPUs are shown. One interesting
observation in this output is the fact that one CPU seems to handle all the interrupts.
If the system was heavy loaded with I/O, and all the interrupts were being handed by a
single processor, this could be the cause of a bottleneck, because one CPU is
overwhelmed, and the rest are waiting for work to do. You would be able to see this
with mpstat, if the processor handling all the interrupts had no idle time, whereas the
other processors did.
Listing 2.14.
07:13:21 PM CPU %user %nice %sys %iowait %irq %soft %idle intr/s
07:13:22 PM all 3.98 0.00 1.00 0.25 0.00 0.00 94.78 1322.00
07:13:22 PM 0 2.00 0.00 0.00 1.00 0.00 0.00 97.00 1137.00
07:13:22 PM 1 6.00 0.00 2.00 0.00 0.00 0.00 93.00 185.00
07:13:22 PM 2 1.00 0.00 0.00 0.00 0.00 0.00 99.00 0.00
07:13:22 PM 3 8.00 0.00 1.00 0.00 0.00 0.00 91.00 0.00
07:13:22 PM CPU %user %nice %sys %iowait %irq %soft %idle intr/s
07:13:23 PM all 2.00 0.00 0.50 0.00 0.00 0.00 97.50 1352.53
07:13:23 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1135.35
07:13:23 PM 1 6.06 0.00 2.02 0.00 0.00 0.00 92.93 193.94
07:13:23 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 101.01 16.16
07:13:23 PM 3 1.01 0.00 1.01 0.00 0.00 0.00 100.00 7.07
Average: CPU %user %nice %sys %iowait %irq %soft %idle intr/s
Average: all 2.99 0.00 0.75 0.12 0.00 0.00 96.13 1337.19
Average: 0 1.01 0.00 0.00 0.50 0.00 0.00 98.49 1136.18
Average: 1 6.03 0.00 2.01 0.00 0.00 0.00 92.96 189.45
Average: 2 0.50 0.00 0.00 0.00 0.00 0.00 100.00 8.04
Average: 3 4.52 0.00 1.01 0.00 0.00 0.00 95.48 3.52
mpstat can be used to determine whether the CPUs are fully utilized and relatively
balanced. By observing the number of interrupts each CPU is handling, it is possible to
find an imbalance. Details on how to control where interrupts are routing are provided
in the kernel source under Documentation/IRQ-affinity.txt.
sar has yet another approach to collecting system data. sar can efficiently record
system performance data collected into binary files that can be replayed at a later
date. sar is a low-overhead way to record information about how the system is
performing.
The sar command can be used to record performance information, replay previous
recorded information, and display real-time information about the current system. The
output of the sar command can be formatted to make it easy to pipe to relational
databases or to other Linux commands for processing.
Although sar reports about many different areas of Linux, the statistics are of two
different forms. One set of statistics is the instantaneous value at the time of the
sample. The other is a rate since the last sample. Table 2-16 describes the
command-line options of sar.
Option
Explanation
-c
This reports information about how many processes are being created per second.
Listing 2.14. 69
main
This reports the rates that interrupts have been occurring in the system.
-P {cpu | ALL}
This option specifies which CPU the statistics should be gathered from. If this isn't
specified, the system totals are reported.
-q
This reports information about the run queues and load averages of the machine.
-u
This reports information about CPU utilization of the system. (This is the default
output.)
-w
This reports the number of context switches that occurred in the system.
-o filename
This specifies the name of the binary output file that will store the performance
statistics.
-f filename
delay
count
sar offers a similar set (with different names) of the system-wide CPU performance
statistics that we have seen in the proceeding tools. The list is shown in Table 2-17.
Option
Explanation
user
This is the percentage of user time that the CPU has spent during the previous
sample.
nice
This is the percentage of time that the CPU has spent during the previous sample
running low-priority (or nice) processes.
system
This is the percentage of system time that the CPU has spent during the previous
sample.
iowait
This is the percentage of time that the CPU has spent during the previous sample
waiting on I/O.
idle
This is the percentage of time that the CPU was idle during the previous sample.
runq-sz
This is the size of the run queue when the sample was taken.
plist-sz
This is the number of processes present (running, sleeping, or waiting for I/O) when
the sample was taken.
ldavg-1
ldavg-5
ldavg-15
proc/s
This is the number of new processes created per second. (This is the same as the
forks statistic from vmstat.)
cswch
intr/s
One of the significant benefits of sar is that it enables you to save many different
types of time-stamped system data to log files for later retrieval and review. This can
prove very handy when trying to figure out why a particular machine is failing at a
particular time.
This first command shown in Listing 2.15 takes three samples of the CPU every
second, and stores the results in the binary file /tmp/apache_test. This command
does not have any visual output and just returns when it has completed.
Listing 2.15.
After the information has been stored in the /tmp/apache_test file, we can display it
in various formats. The default is human readable. This is shown in Listing 2.16. This
shows similar information to the other system monitoring commands, where we can
see how the processor was spending time at a particular time.
Listing 2.16.
However, sar can also output the statistics in a format that can be easily imported
into a relational database, as shown in Listing 2.17. This can be useful for storing a
large amount of performance data. Once it has been imported into a relational
database, the performance data can be analyzed with all of the tools of a relational
database.
Finally, sar can also output the statistics in a format that can be easily parsed by
standard Linux tools such as awk, perl, python, or grep. This output, which is
shown in Listing 2.18, can be fed into a script that will pull out interesting events, and
possibly even analyze different trends in the data.
Listing 2.18.
In addition to recording information in a file, sar can also be used to observe a system
in real time. In the example shown in Listing 2.19, the CPU state is sampled three
times with one second between them.
Listing 2.19.
The default display's purpose is to show information about the CPU, but other
information can also be displayed. For example, sar can show the number of context
Listing 2.17. 73
main
switches per second, and the number of memory pages that have been swapped in or
out. In Listing 2.20, sar samples the information two times, with one second between
them. In this case, we ask sar to show us the total number of context switches and
process creations that occur every second. We also ask sar for information about the
load average. We can see in this example that this machine has 163 process that are in
memory but not running. For the past minute, on average 1.12 processes have been
ready to run.
Listing 2.20.
08:23:29 PM proc/s
08:23:30 PM 0.00
08:23:29 PM cswch/s
08:23:30 PM 594.00
08:23:30 PM proc/s
08:23:31 PM 0.00
08:23:30 PM cswch/s
08:23:31 PM 812.87
Average: proc/s
Average: 0.00
Average: cswch/s
Average: 703.98
As you can see, sar is a powerful tool that can record many different performance
statistics. It provides a Linux-friendly interface that enables you to easily extract and
analyze the performance data.
2.2.8. oprofile
Listing 2.19. 74
main
and floating-point operations.
oprofile does not record every event that occurs; instead, it works with the
processor's performance hardware to sample every count events, where count is a
value that users specify when they start oprofile. The lower the value of count, the
more accurate the results are, but the higher the overhead of oprofile. By keeping
count to a reasonable value, oprofile can run with a very low overhead but still give
an amazingly accurate account of the performance of the system.
Sampling is very powerful, but be careful for some nonobvious gotchas when using it.
First, sampling may say that you are spending 90 percent of your time in a particular
routine, but it does not say why. There can be two possible causes for a high number
of cycles attributed to a particular routine. First, it is possible that this routine is the
bottleneck and is taking a long amount of time to execute. However, it may also be
that the function is taking a reasonable amount of time to execute, but is called a large
number of times. You can usually figure out which is the case by looking at the
samples around, the particularly hot line, or by instrumenting the code to count the
number of calls that are made to it.
The second problem of sampling is that you are never quite sure where a function is
being called from. Even if you figure out that the function is being called many times
and you track down all of the functions that call it, it is not necessarily clear which
function is doing the majority of the calling.
oprofile is actually a suite of pieces that work together to collect CPU performance
statistics. There are three main pieces of oprofile:
• The oprofile kernel module manipulates the processor and turns on and off
sampling.
• The oprofile daemon collects the samples and saves them to disk.
• The oprofile reporting tools take the collected samples and show the user how
they relate to the applications running on the system.
The oprofile suite hides the driver and daemon manipulation in the opcontrol
command. The opcontrol command is used to select which events the processor will
sample and start the sampling.
When controlling the daemon, you can invoke opcontrol using the following command
line:
This option's control, the profiling daemon, enables you to start and stop sampling and
to dump the samples from the daemon's memory to disk. When sampling, the
oprofile daemon stores a large amount of samples in internal buffers. However, it is
only possibly to analyze the samples that have been written (or dumped) to disk.
2.2.8. oprofile 75
main
Writing to disk can be an expensive operation, so oprofile only does it periodically.
As a result, after running a test and profiling with oprofile, the results may not be
available immediately, and you will have to wait until the daemon flushes the buffers
to disk. This can be very annoying when you want to begin analysis immediately, so
the opcontrol command enables you to force the dump of samples from the oprofile
daemon's internal buffers to disk. This enables you to begin a performance
investigation immediately after a test has completed.
Table 2-18 describes the command-line options for the opcontrol program that enable
you to control the operation of the daemon.
Option
Explanation
-s/--start
Starts profiling unless this uses a default event for the current processor
-d/--dump
Dumps the sampling information that is currently in the kernel sample buffers to the
disk.
--stop
By default, oprofile picks an event with a given frequency that is reasonable for the
processor and kernel that you are running it on. However, it has many more events
that can be monitored than the default. When you are listing and selecting an event,
opcontrol is invoked using the following command line:
The event specifier enables you to select which event is going to be sampled; how
frequently it will be sampled; and whether that sampling will take place in kernel
space, user space, or both. Table 2-19 describes the command-line option of
opcontrol that enables you to select different events to sample.
Option
Explanation
-l/--list-events
-event=:name:count: unitmask:kernel:user:
Used to specify what events will be sampled. The event name must be one of the
events that the processor supports. A valid event can be retrieved from the --list-
events option. The count parameter specifies that the processor will be sampled
every count times that event happens. The unitmask modifies what the event is going
to sample. For example, if you are sampling "reads from memory," the unit mask may
allow you to select only those reads that didn't hit in the cache. The kernel parameter
specifies whether oprofile should sample when the processor is running in kernel
space. The user parameter specifies whether oprofile should sample when the
processor is running in user space.
--vmlinux = kernel
Specifies which uncompressed kernel image oprofile will use to attribute samples to
various kernel functions.
After the samples have been collected and saved to disk, oprofile provides a different
tool, opreport, which enables you to view the samples that have been collected.
opreport is invoked using the following command line:
Typically, opreport displays all the samples collected by the system and which
executables (including the kernel) are responsible for them. The executables with the
highest number of samples are shown first, and are followed by all the executables
with samples. In a typical system, most of the samples are in a handful of executables
at the top of the list, with a very large number of executables contributing a very small
number of samples. To deal with this, opreport enables you to set a threshold, and
only executables with that percentage of the total samples or greater will be shown.
Alternatively, opreport can reverse the order of the executables that are shown, so
those with a high contribution are shown last. This way, the most important data is
printed last, and it will not scroll off the screen.
Table 2-20 describes these command-line options of opreport that enable you to
format the output of the sampling.
Option
Explanation
--reverse-sort / -r
Reverses the order of the sort. Typically, the images that caused the most events
display first.
--threshold / -t [percentage]
Causes opreport to only show images that have contributed percentage or more
amount of samples. This can be useful when there are many images with a very small
number of samples and you are only interested in the most significant.
Again, oprofile is a complicated tool, and these options show only the basics of what
oprofile can do. You learn more about the capabilities of oprofile in later chapters.
oprofile is a very powerful tool, but it can also be difficult to install. Appendix B,
"Installing oprofile," contains instructions on how to get oprofile installed and
running on a few of the major Linux distributions.
We begin the use of oprofile by setting it up for profiling. This first command, shown
in Listing 2.21, uses the opcontrol command to tell the oprofile suite where an
uncompressed image of the kernel is located. oprofile needs to know the location of
this file so that it can attribute samples to exact functions within the kernel.
Listing 2.21.
After we set up the path to the current kernel, we can begin profiling. The command
in Listing 2.22 tells oprofile to start sampling using the default event. This event
varies depending on the processor, but the default event for this processor is
CPU_CLK_UNHALTED. This event samples all of the CPU cycles where the processor is
not halted. The 233869 means that the processor will sample the instruction the
processor is executing every 233,869 events.
Now that we have started sampling, we want to begin to analyze the sampling results.
In Listing 2.23, we start to use the reporting tools to figure out what is happening in
the system. opreport reports what has been profiled so far.
Listing 2.23.
Uh oh! Even though the profiling has been happening for a little while, we are stopped
when opreport specifies that it cannot find any samples. This is because the opreport
command is looking for the samples on disk, but the oprofile daemon stores the
samples in memory and only periodically dumps them to disk. When we ask opreport
for a list of the samples, it does not find any on disk and reports that it cannot find any
samples. To alleviate this problem, we can force the daemon to flush the samples
immediately by issuing a dump option to opcontrol, as shown in Listing 2.24. This
command enables us to view the samples that have been collected.
Listing 2.24.
After we dump the samples to disk, we try again, and ask oprofile for the report, as
shown in Listing 2.25. This time, we have results. The report contains information
about the processor that it was collected on and the types of events that were
monitored. The report then lists in descending order the number of events that
occurred and which executable they occurred in. We can see that the Linux kernel is
taking up 50 percent of the total cycles, emacs is taking 14 percent, and libc is taking
12 percent. It is possible to dig deeper into executable and determine which function
is taking up all the time, but that is covered in Chapter 4, "Performance Tools:
Process-Specific CPU."
Listing 2.22. 79
main
Listing 2.25.
When we started the oprofile, we just used the default event that opcontrol chose
for us. Each processor has a very rich set of events that can be monitored. In Listing
2.26, we ask opcontrol to list all the events that are available for this particular CPU.
This list is quite long, but in this case, we can see that in addition to
CPU_CLK_UNHALTED, we can also monitor DATA_MEM_REFS and DCU_LINES_IN. These are
memory events caused by the memory subsystem, and we investigate them in later
chapters.
Listing 2.26.
CPU_CLK_UNHALTED: (counter: 0, 1)
DATA_MEM_REFS: (counter: 0, 1)
DCU_LINES_IN: (counter: 0, 1)
The command needed to specify which events we will monitor can be cumbersome, so
fortunately, we can also use oprofile's graphical oprof_start command to
graphically start and stop sampling. This enables us to select the events that we want
graphically without the need to figure out the exact way to specify on the command
Listing 2.25. 80
main
line the events that we want to monitor.
In the example of op_control shown in Figure 2-3, we tell oprofile that we want to
monitor DATA_MEM_REFS and L2_LD events at the same time. The DATA_MEM_REFS event
can tell us which applications use the memory subsystem a lot and which use the level
2 cache. In this particular processor, the processor's hardware has only two counters
that can be used for sampling, so only two events can be used simultaneously.
Figure 2-3.
Now that we have gathered the samples using the graphical interface to operofile,
we can now analyze the data that it has collected. In Listing 2.27, we ask opreport to
display the profile of samples that it has collected in a similar way to how we did when
we were monitoring cycles. In this case, we can see that the libmad library has 31
percent of the data memory references of the whole system and appears to be the
heaviest user of the memory subsystem.
Listing 2.27.
The output provided by opreport displays all the system libraries and executables that
contain any of the events that we were sampling. Note that not all the events have
been recorded; because we are sampling, only a subset of events are actually
recorded. This is usually not a problem, because if a particular library or executable is
a performance problem, it will likely cause high-cost events to happen many times. If
the sampling is random, these high-cost events will eventually be caught by the
sampling code.
Listing 2.26. 81
main
This chapter demonstrated how performance tools, such as sar and vmstat, can be
used to extract this system-wide performance information from a running system.
These tools are the first line of defense when diagnosing a system problem. They help
to determine how the system is behaving and which subsystem or application may be
particularly stressed. The next chapter focuses on the system-wide performance tools
that enable you to analyze the memory usage of the entire system.
< Day Day Up >
Listing 2.27. 82
main
Any given Linux system has a certain amount of RAM or physical memory. When
addressing this physical memory, Linux breaks it up into chunks or "pages" of
memory. When allocating or moving around memory, Linux operates on page-sized
pieces rather than individual bytes. When reporting some memory statistics, the Linux
kernel reports the number of pages per second, and this value can vary depending on
the architecture it is running on. Listing 3.1 creates a small application that displays
the number of bytes per page for the current architecture.
Listing 3.1.
#include <unistd.h>
int main(int argc, char *argv[])
{
printf("System page size: %d\n",getpagesize());
}
On the IA32 architecture, the page size is 4KB. In rare cases, these page-sized chunks
of memory can cause too much overhead to track, so the kernel manipulates memory
in much bigger chunks, known as HugePages. These are on the order of 2048KB
rather than 4KB and greatly reduce the overhead for managing very large amounts of
memory. Certain applications, such as Oracle, use these huge pages to load an
enormous amount of data in memory while minimizing the overhead that the Linux
kernel needs to manage it. If HugePages are not completely filled with data, these can
waste a significant amount of memory. A half-filled normal page wastes 2KB of
The Linux kernel can take a scattered collection of these physical pages and present to
applications a well laid-out virtual memory space.
All systems have a fixed amount of physical memory in the form of RAM chips. The
Linux kernel allows applications to run even if they require more memory than
available with the physical memory. The Linux kernel uses the hard drive as a
temporary memory. This hard drive space is called swap space.
Although swap is an excellent way to allow processes to run, it is terribly slow. It can
be up to 1,000 times slower for an application to use swap rather than physical
memory. If a system is performing poorly, it usually proves helpful to determine how
much swap the system is using.
Alternatively, if your system has much more physical memory than required by your
applications, Linux will cache recently used files in physical memory so that
subsequent accesses to that file do not require an access to the hard drive. This can
greatly speed up applications that access the hard drive frequently, which, obviously,
can prove especially useful for frequently launched applications. The first time the
application is launched, it needs to be read from the disk; if the application remains in
the cache, however, it needs to be read from the much quicker physical memory. This
disk cache differs from the processor cache mentioned in the previous chapter. Other
than oprofile, valgrind, and kcachegrind, most tools that report statistics about
"cache" are actually referring to disk cache.
In addition to cache, Linux also uses extra memory as buffers. To further optimize
applications, Linux sets aside memory to use for data that needs to be written to disk.
These set-asides are called buffers. If an application has to write something to the
disk, which would usually take a long time, Linux lets the application continue
immediately but saves the file data into a memory buffer. At some point in the future,
the buffer is flushed to disk, but the application can continue immediately.
It can be discouraging to see very little free memory in a system because of the cache
and buffer usage, but this is not necessarily a bad thing. By default, Linux tries to use
as much of your memory as possible. This is good. If Linux detects any free memory, it
caches applications and data in the free memory to speed up future accesses. Because
it is usually a few orders of magnitude faster to access things from memory rather
than disk, this can dramatically improve overall performance. When the system needs
the cache memory for more important things, the cache memory is erased and given
to the system. Subsequent access to the object that was previously cached has to go
out to disk to be filled.
Listing 3.1. 85
main
3.1.2.3 Active Versus Inactive Memory
Active memory is currently being used by a process. Inactive memory is memory that
is allocated but has not been used for a while. Nothing is essentially different between
the two types of memory. When required, the Linux kernel takes a process's least
recently used memory pages and moves them from the active to the inactive list. When
choosing which memory will be swapped to disk, the kernel chooses from the inactive
memory list.
For 32-bit processors (for example, IA32) with 1GB or more of physical of memory,
Linux must manage the physical memory as high and low memory. The high memory
is not directly accessible by the Linux kernel and must be mapped into the
low-memory range before it can be used. This is not a problem with 64-bit processors
(such as AMD64/ EM6T, Alpha, or Itanium) because they can directly address
additional memory that is available in current systems.
In addition to the memory that applications allocate, the Linux kernel consumes a
certain amount for bookkeeping purposes. This bookkeeping includes, for example,
keeping track of data arriving from network and disk I/O devices, as well as keeping
track of which processes are running and which are sleeping. To manage this
bookkeeping, the kernel has a series of caches that contains one or more slabs of
memory. Each slab consists of a set of one or more objects. The amount of slab
memory consumed by the kernel depends on which parts of the Linux kernel are being
used, and can change as the type of load on the machine changes.
< Day Day Up >
As you have seen before, vmstat can provide information about many different
performance aspects of a system although its primary purpose, as shown next, is to
provide information about virtual memory system performance. In addition to the CPU
performance statistics described in the previous chapter, it can also tell you the
following:
As you can see, vmstat provides (via the statistics it displays) a wealth of information
about the health and performance of the system in a single line of text.
In addition to the CPU statistics vmstat can provide, you can invoke vmstat with the
following command-line options when investigating memory statistics:
As before, you can run vmstat in two modes: sample mode and average mode. The
added command-line options enable you to get the performance statistics about how
the Linux kernel is using memory. Table 3-1 describes the options that vmstat
accepts.
Option
Explanation
-a
This changes the default output of memory statistics to indicate the active/inactive
amount of memory rather than information about buffer and cache usage.
This prints out the vm table. This is a grab bag of differentstatistics about the system
since it has booted. It cannot be run in sample mode. It contains both memory and
CPU statistics.
This prints out the kernel's slab info. This is the same information that can be
retrieved by typing cat/proc/slabinfo. This describes in detail how the kernel's
memory is allocated and can be helpful to determine what area of the kernel is
consuming the most memory.
Table 3-2 provides a list of the memory statistics that vmstat can provide. As with the
CPU statistics, when run in normal mode, the first line that vmstat provides is the
average values for all the rate statistics (so and si) and the instantaneous value for all
the numeric statistics (swpd, free, buff, cache, active, and inactive).
Column
Explanation
swpd
free
The amount of physical memory not being used by the operating system or
applications.
buff
The size (in KB) of the system buffers, or memory used to store data waiting to be
saved to disk. This memory allows an application to continue execution immediately
after it has issued a write call to the Linux kernel (instead of waiting until the data has
been committed to disk).
cache
The size (in KB) of the system cache or memory used to store data previously read
from disk. If an application needs this data again, it allows the kernel to fetch it from
memory rather than disk, thus increasing performance.
active
The amount of memory actively being used. The active/ inactive statistics are
orthogonal to the buffer/cache; buffer and cache memory can be active and inactive.
inactive
The amount of inactive memory (in KB), or memory that has not been used for a while
and is eligible to be swapped to disk.
si
The rate of memory (in KB/s) that has been swapped in from disk during the last
sample.
so
The rate of memory (in KB/s) that has been swapped out to disk during the last
sample.
pages paged in
The amount of memory (in pages) read from the disk(s) into the system buffers. (On
most IA32 systems, a page is 4KB.)
The amount of memory (in pages) written to the disk(s) from the system cache. (On
most IA32 systems, a page is 4KB.)
pages swapped in
The amount of memory (in pages) read from swap into system memory.
The amount of memory (in pages) written from system memory to the swap.
used swap
free swap
total swap
The total amount of swap that the system has; this is also the sum of used swap plus
free swap.
In Listing 3.2, as you saw in the previous chapter, if vmstat is invoked without any
command-line options, it displays average values for performance statistics since
system boot (si and so), and it shows the instantaneous values for the rest of them
(swpd, free, buff, and cache). In this case, we can see that the system has about
500MB of memory that has been swapped to disk. ~14MB of the system memory is
free. ~4MB is used for buffers that contain data that has yet to be flushed to disk.
~627MB is used for the disk cache that contains data that has been read off the disk
in the past.
Listing 3.2.
bash-2.05b$ vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 511012 14840 4412 642072 33 31 204 247 1110 1548 8 5 73 14
In Listing 3.3, we ask vmstat to display information about the number of active and
inactive pages. The amount of inactive pages indicates how much of the memory could
be swapped to disk and how much is currently being used. In this case, we can see
that 1310MB of memory is active, and only 78MB is considered inactive. This machine
has a large amount of memory, and much of it is being actively used.
Listing 3.3.
bash-2.05b$ vmstat -a
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free inact active si so bi bo in cs us sy id wa
2 1 514004 5640 79816 1341208 33 31 204 247 1111 1548 8 5 73 14
Next, in Listing 3.4, we look at a different system, one that is actively swapping data
in and out of memory. The si column indicates that swap data has been read in at a
rate of 480KB, 832KB, 764KB, 344KB, and 512KB during each of those sample
periods. The so column indicates that memory data has been written to swap at a rate
of 9KB, 0KB, 916KB, 0KB, 1068KB, 444KB, 792KB, during each of the samples. These
results could indicate that the system does not have enough memory to handle all the
running processes. A simultaneously high swap-in and swap-out rate can occur when a
process's memory is being saved to make way for an application that had been
previously swapped to disk. This can be disastrous if two running programs both need
more memory than the system can provide. For example, if two processes are using a
In this particular case, the swapping eventually stopped, so most likely the memory
that was swapped to disk was not immediately needed by the original process. This
means the swap usage was effective, the contents of the memory that was not being
used was written to disk, and the memory was then given to the process that needed
it.
Listing 3.4.
1 2 132476 3424 8592 53272 832 916 1356 916 1259 692 11 4 0 85
0 1 133544 2656 8624 53392 344 1068 1096 1068 1217 436 8 3 5 84
0 1 133988 2300 8620 54288 512 444 1796 444 1090 230 5 1 2 92
....
As shown in Listing 3.5, as you saw in the previous chapter, vmstat can show a vast
array of different system statistics. Now as we look at it, we can see some of the same
statistics that were present in some of the other output modes, such as active,
inactive, buffer, cache, and used swap. However, it also has a few new statistics, such
as total memory, which indicates that this system has a total of 1516MB of memory,
and total swap, which indicates that this system has a total of 2048MB of swap. It can
Listing 3.3. 91
main
be helpful to know the system totals when trying to figure out what percentage of the
swap and memory is currently being used. Another interesting statistic is the pages
paged in, which indicates the total number of pages that were read from the disk. This
statistic includes the pages that are read starting an application and those that the
application itself may be using.
Listing 3.5.
bash-2.05b$ vmstat -s
Finally, in Listing 3.6, we see that vmstat can provide information about how the Linux
kernel allocates its memory. As previously described, the Linux kernel has a series of
"slabs" to hold its dynamic data structures. vmstat displays each of the slabs (Cache),
shows how many of the elements are being used (Num), shows how many are
allocated (Total), shows the size of each element (Size), and shows the amount of
memory in pages (Pages) that the total slab is using. This can be helpful when tracking
down exactly how the kernel is using its memory.
Listing 3.6.
bash-2.05b$ vmstat -m
Listing 3.4. 92
main
ndisc_cache 1 24 160 24
raw6_sock 0 0 672 6
udp6_sock 0 0 640 6
tcp6_sock 404 441 1120 7
ip_fib_hash 39 202 16 202
ext3_inode_cache 1714 3632 512 8
...
vmstat provides an easy way to extract large amounts of information about the Linux
memory subsystem. When combined with the other information provided on the
default output screen, it provides a fair picture of the health and resource usage of the
system.
top does not have any special command-line options that manipulate its display of
memory statistics. It is invoked with the following command line:
top
However, once running, top enables you to select whether system-wide memory
information displays, and whether processes are sorted by memory usage. Sorting by
memory consumption proves particularly useful to determine which process is
consuming the most memory. Table 3-3 describes the different memory-related output
toggles.
Option
Explanation
This toggles whether information about the system memory usage will be shown on
the screen.
Listing 3.6. 93
main
M
Sorts the tasks by the amount of memory they are using. Because processes may have
allocated more memory than they are using, this sorts by resident set size. Resident
set size is the amount the processes are actually using rather than what they have
simply asked for.
Table 3-4 describes the memory performance statistics that top can provide for both
the entire system and individual processes. top has two different versions, 2.x and 3.x,
which have slightly different names for output statistics. Table 3-4 describes the
names for both versions.
Option
Explanation
%MEM
This is the percentage of the system's physical memory that this process is using.
SIZE (v 2.x)
VIRT (v 3.x)
This is the total size of the process's virtual memory usage. This includes all the
memory that the application has allocated but is not using.
SWAP
This is the amount of swap (in KB) that the process is using.
RSS (v 2.x)
RES (v 3.x)
This is the amount of physical memory that the application is actually using.
TRS (v 2.x)
CODE (v 3.x)
The total amount of physical memory (in KB) that the process's executable code is
using.
DSIZE (v 2.x)
DATA (v 3.x)
The total amount of memory (in KB) dedicated to a process's data and stack.
SHARE (v 2.x)
SHR (v 3.x)
The total amount of memory (in KB) that can be shared with other processors.
D (v 2.x)
nDRT (v 3.x)
The number of pages that are dirty and need to be flushed to disk.
Mem:
Of the physical memory, this indicates the total amount, the used amount, and the free
amount.
Of the swap, this indicates the total amount, the used amount, and the free amount.
active (v 2.x)
inactive (v 2.x)
The amount of physical memory that is inactive and hasn't been used in a while.
buffers
The total amount of physical memory (in KB) used to buffer values to be written to
disk.
top provides a large amount of memory information about the different running
processes. As discussed in later chapters, you can use this information to determine
exactly how an application allocates and uses memory.
Listing 3.7 is similar to the example run of top shown in the previous chapter.
However, in this example, notice that in the buffers, we have a total amount of
Listing 3.7.
Again, top can be customized to display only what you are interested in observing.
Listing 3.8 shows a highly configured output screen that shows only memory
performance statistics.
Listing 3.8.
VIRT RES SHR %MEM SWAP CODE DATA nFLT nDRT COMMAND
405m 71m 321m 7.1 333m 1696 403m 4328 0 X
70224 35m 20m 3.5 33m 280 68m 3898 0 gnome-terminal
2756 1104 1784 0.1 1652 52 2704 0 0 top
19728 5660 16m 0.5 13m 44 19m 17 0 clock-applet
2396 448 1316 0.0 1948 36 2360 16 0 init
0 0 0 0.0 0 0 0 0 0 migration/0
0 0 0 0.0 0 0 0 0 0 ksoftirqd/0
0 0 0 0.0 0 0 0 0 0 migration/1
0 0 0 0.0 0 0 0 0 0 ksoftirqd/1
0 0 0 0.0 0 0 0 0 0 migration/2
0 0 0 0.0 0 0 0 0 0 ksoftirqd/2
0 0 0 0.0 0 0 0 0 0 migration/3
3.2.3. procinfo II
procinfo does not have any options that change the output of the memory statistics
displayed and, as a result, is invoked with the following command:
procinfo
procinfo displays the basic memory system memory statistics, similar to top and
vmstat. These are shown in Table 3-5.
Option
Explanation
Total
Used
Free
Shared
Buffers
This is the amount of physical memory used as buffers for disk writes.
Listing 3.8. 97
main
Page in
The number of blocks (usually 1KB) read from disk. (This is broken on 2.6.x kernels.)
Page out
The number of blocks (usually 1KB) written to disk. (This is broken on 2.6.x kernels.)
Swap in
The number of memory pages read in from swap. (This statistic is broken on 2.6.x
kernels.)
Swap out
The number of memory pages written to swap. (This statistic is broken on 2.6.x
kernels.)
Much like vmstat or top, procinfo is a low-overhead command that is good to leave
running in a console or window on the screen. It gives a good indication of a system's
health and performance.
Listing 3.9 is a typical output for procinfo. As you can see, it reports summary
information about how the system is using virtual memory. In this case, the system
has a total of 312MB of memory; 301MB is in use by the kernel and applications,
11MB is used by system buffers, and 11MB is not used at all.
Listing 3.9.
Bootup: Sun Oct 24 10:03:43 2004 Load average: 0.44 0.53 0.51 3/110 32243
gnome-system-monitor can be invoked from the Gnome menu. (Under Red Hat 9 and
higher, this is in System Tools > System Monitor.) However, it can also be invoked
using the following command:
gnome-system-monitor
When you launch gnome-system-monitor and select the System Monitor tab, you see
a window similar to Figure 3-1. This window enables you to glance at the graph and
see how much physical memory and swap is currently being used, and how its usage
has changed over time. In this case, we see that 969MB of a total of 1,007MB is
currently being used. The memory usage has been relatively flat for a while.
Figure 3-1.
Listing 3.9. 99
main
The graphical view of data provided by gnome-system-monitor can make it easier and
faster; however, most of the details, such as how the memory is being used, are
missing.
3.2.5. free
free provides an overall view of how your system is using your memory, including the
amount of free memory. Although the free command may show that a particular
system does not have much free memory, this is not necessarily bad. Instead of letting
memory sit unused, the Linux kernel puts it to work as a cache for disk reads and as a
buffer for disk writes. This can dramatically increase system performance. Because
these cache and buffers can always be discarded and the memory can be used if
needed by applications, free shows you the amount of free memory plus or minus
these buffers.
Table 3-6 describes the parameters that modify the types of statistics that free
displays. Much like vmstat, free can periodically display updated memory statistics.
Option
Explanation
-s delay
This option causes free to print out new memory statistics every delay seconds.
-c count
This option causes free to print out new statistics for count times.
-l
This option shows you how much high memory and how much low memory are being
used.
free actually displays some of the most complete memory statistics of any of the
memory statistic tools. The statistics that it displays are shown in Table 3-7.
Statistic
Explanation
Total
Used
Free
Shared
Buffers
This is the amount of physical memory used as buffers for disk writes.
Cached
This is the amount of physical memory used as cache for disk reads.
-/+ buffers/cache
In the Used column, this shows the amount of memory that would be used if
buffers/cache were not counted as used memory. In the Free column, this shows the
amount of memory that would be free if buffers/cache were counted as free memory.
Low
The total amount of low memory or memory directly accessible by the kernel.
High
The total amount of high memory or memory not directly accessible by the kernel.
Totals
This shows the combination of physical memory and swap for the Total, Used, and
Free columns.
free provides information about the system-wide memory usage of Linux. It is a fairly
complete range of memory statistics.
Calling free without any command options gives you an overall view of the memory
subsystem.
As mentioned previously, Linux tries to use all the available memory if possible to
cache data and applications. In Listing 3.10, free tells us that we are currently using
234,720 bytes of memory; however, if you ignore the buffers and cache, we are only
using 122,772 bytes of memory. The opposite is true of the free column. We currently
have 150,428 bytes of memory free; if you also count the buffers and cached memory
(which you can, because Linux throws away those buffers if the memory is needed),
however, we have 262,376 bytes of memory free.
Listing 3.10.
Although you could just total the columns yourself, the -t flag shown in Listing 3.11
tells you the totals when adding both swap and real memory. In this case, the system
had 376MB of physical memory and 384MB of swap. The total amount of memory
available on the system is 376MB plus 384MB, or ~760MB. The total amount of free
memory was 134MB of physical memory plus 259MB of swap, yielding a total of
393MB of free memory.
Listing 3.11.
Listing 3.12.
fas% free -l
total used free shared buffers cached
Mem: 1552528 1546472 6056 0 7544 701408
Low: 897192 892800 4392
High: 655336 653672 1664
-/+ buffers/cache: 837520 715008
Swap: 2097096 566316 1530780
free gives a good idea of how the system memory is being used. It may take a little
while to get used to output format, but it contains all the important memory statistics.
3.2.6. slabtop
slabtop is similar to top, but instead of displaying information about the CPU and
memory usage of the processes in the system, slabtop shows in real-time how the
kernel is allocating its various caches and how full they are. Internally, the kernel has
a series of caches that are made up of one or more slabs. Each slab consists of a set of
one or more objects. These objects can be active (or used) or inactive (unused).
slabtop shows you the status of the different slabs. It shows you how full they are and
how much memory they are using.
Option
Explanation
--delay
--sort {order}
This specifies the order of the output. order can be one of the following:
This sorts by the total number of objects (active and inactive) in each slab for a
particular cache.
slabtop provides a glimpse into the data structures of the Linux kernel. Each of these
slab types is tied closed to the Linux kernel, and a description of each of these slabs is
beyond the scope of this book. If a particular slab is using a large amount of kernel
memory, reading the Linux kernel source code and searching the Web are two great
ways to figure out what these slabs are used for.
As shown in Listing 3.13, by default, slabtop fills the entire console and continually
updates the statistics every three seconds. In this example, you can see that the
size-64 slab has the most objects, only half of which are active.
Listing 3.13.
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
66124 34395 52% 0.06K 1084 61 4336K size-64
38700 35699 92% 0.05K 516 75 2064K buffer_head
30992 30046 96% 0.15K 1192 26 4768K dentry_cache
21910 21867 99% 0.27K 1565 14 6260K radix_tree_node
20648 20626 99% 0.50K 2581 8 10324K ext3_inode_cache
11781 7430 63% 0.03K 99 119 396K size-32
9675 8356 86% 0.09K 215 45 860K vm_area_struct
6024 2064 34% 0.62K 1004 6 4016K ntfs_big_inode_cache
4520 3633 80% 0.02K 20 226 80K anon_vma
4515 3891 86% 0.25K 301 15 1204K filp
All the advantages of using sar as a CPU performance tool, such as the easy recording
of samples, extraction to multiple output formats, and time stamping of each sample,
are still present when monitoring memory statistics. sar provides similar information
to the other memory statistics tools, such as the current values for the amount of free
memory, buffers, cached, and swap amount. However, it also provides the rate at
which these value change and provides information about the percentage of physical
memory and swap that is currently being consumed.
By default, sar displays only CPU performance statistics; so to retrieve any of the
memory subsystem statistics, you must use the options described in Table 3-9.
Option
Explanation
-B
This reports information about the number of blocks that the kernel swapped to and
from disk. In addition, for kernel versions after v2.5, it reports information about the
number of page faults.
-W
This reports the number of pages of swap that are brought in and out of the system.
This reports information about the memory being used in the system. It includes
information about the total free memory, swap, cache, and buffers being used.
sar provides a fairly complete view of the Linux memory subsystem. One advantage
that sar provides over other tools is that it shows the rate of change for many of the
important values in addition to the absolute values. You can use these values to see
exactly how memory usage is changing over time without having to figure out the
differences between values at each sample. Table 3-10 shows the memory statistics
that sar provides.
Statistic
Explanation
pgpgin/s
The amount of memory (in KB) that the kernel paged in from disk.
pgpgout/s
The amount of memory (in KB) that the kernel paged out to disk.
fault/s
The total number of faults that that the memory subsystem needed to fill. These may
or may not have required a disk access.
majflt/s
The total number of faults that the memory subsystem needed to fill and required a
disk access.
pswpin/s
The amount of swap (in pages) that the system brought into memory.
pswpout/s
The amount of memory (in pages) that the system wrote to swap.
kbmemfree
This is the total physical memory (in KB) that is currently free or not being used.
kbmemused
This is the total amount of physical memory (in KB) currently being used.
%memused
kbbuffers
This is the amount of physical memory used as buffers for disk writes.
kbcached
This is the amount of physical memory used as cache for disk reads.
kbswpfree
kbswpused
%swpused
kbswpcad
This is memory that is both swapped to disk and present in memory. If the memory is
needed, it can be immediately reused because the data is already present in the swap
area.
frmpg/s
The rate that the system is freeing memory pages. A negative number means the
system is allocating them.
bufpg/s
The rate that the system is using new memory pages as buffers. A negative number
means that number of buffers is shrinking, and the system is using less of them.
Although sar misses the high and low memory statistics, it provides nearly every other
memory statistic. The fact that it can also record network CPU and disk I/O statistics
makes it very powerful.
Listing 3.14 shows sar providing information about the current state of the memory
subsystem. From the results, we can see that the amount of memory that the system
used varies from 98.87 percent to 99.25 percent of the total memory. Over the course
of the observation, the amount of free memory drops from 11MB to 7MB. The
percentage of swap that is used hovers around ~11 percent. The system has ~266MB
of data cache and about 12MB in buffers that can be written to disk.
Listing 3.14.
Listing 3.15 shows that the system is consuming the free memory at a rate of ~82
pages per second during the first sample. Later, it frees ~16 pages, and then
consumes ~20 pages. The number pages of buffers grew only once during the
observation, at a rate of 2.02 pages per second. Finally, the pages of cache shrunk by
2.02, but in the end, they expanded by 64.36 pages/second.
Listing 3.15.
Listing 3.16 shows that the system wrote ~53 pages of memory to disk during the
third sample. The system has a relatively high fault count, which means that memory
pages are being allocated and used. Fortunately, none of these are major faults,
meaning that the system does not have to go to disk to fulfill the fault.
Listing 3.16.
As you can see, sar is a powerful tool that enhances the functionality of the other
system memory performance tools by adding the ability to archive, time stamp, and
simultaneously collect many different types of statistics.
3.2.8. /proc/meminfo
The Linux kernel provides a user-readable text file called /proc/meminfo that displays
current system-wide memory performance statistics. It provides a superset of the
system-wide memory statistics that can be acquired with vmstat, top, free, and
procinfo; but it is a little more difficult to use. If you want periodic updates, you must
write a script or some code for such. If you want to store the memory performance
information or coordinate it with CPU statistics, you must create a new tool or write a
script. However, /proc/meminfo provides the most complete view of system memory
usage.
The information in /proc/meminfo can be retrieved with the following command line:
cat /proc/meminfo
Statistic
Explanation
MemTotal
MemFree
Buffers
Cached
SwapCached
The amount of memory that exists in both the swap and the physical memory.
Active
Inactive
HighTotal
HighFree
LowTotal
LowFree
SwapTotal
SwapFree
Dirty
Writeback
Mapped
The total amount of memory brought into a process's virtual address space using mmap.
Slab
Committed_AS
The amount of memory required to almost never run out of memory with the current
workload. Normally, the kernel hands out more memory than it actually has with the
hope that applications overallocate. If all the applications were to use what they
allocate, this is the amount of physical memory you would need.
PageTables
VmallocTotal
VmallocUsed
VmallocChunk
HugePages_Total
HugePages_Free
/proc/meminfo provides a wealth of information about the current state of the Linux
memory subsystem.
Listing 3.17.
This chapter demonstrated how performance tools such as sar and vmstat can be
used to extract this system-wide memory performance information from a running
system. The output of these tools indicates how the system as a whole is using
available memory. The next chapter describes the tools available to investigate a
single process's CPU usage.
< Day Day Up >
The most basic split of where an application may spend its time is between kernel and
user time. Kernel time is the time spent in the Linux kernel, and user time is the
amount of time spent in application or library code. Linux has tools such time and ps
that can indicate (appropriately enough) whether an application is spending its time in
application or kernel code. It also has commands such as oprofile and strace that
enable you to trace which kernel calls are made on the behalf of the process, as well
as how long each of those calls took to complete.
Any application with even a minor amount of complexity relies on system libraries to
perform complex actions. These libraries may cause performance problems, so it is
important to be able to see how much time an application spends in a particular
library. Although it might not always be practical to modify the source code of the
libraries directly to fix a problem, it may be possible to change the application code to
call different or fewer library functions. The ltrace command and oprofile suite
provide a way to analyze the performance of libraries when they are used by
applications. Tools built in to the Linux loader, ld, helps you determine whether the
use of many libraries slows down an application's start time.
When the application is known to be the bottleneck, Linux provides tools that enable
you to profile an application to figure out where time is spent within an application.
Tools such as gprof and oprofile can generate profiles of an application that pin
down exactly which source line is causing large amounts of time to be spent.
< Day Day Up >
4.2.1. time
The time command performs a basic function when testing a command's performance,
yet it is often the first place to turn. The time command acts as a stopwatch and times
how long a command takes to execute. It measures three types of time. First, it
measures the real or elapsed time, which is the amount of time between when the
program started and finished execution. Next, it measures the user time, which is the
amount of time that the CPU spent executing application code on behalf of the
program. Finally, time measures system time, which is the amount of time the CPU
spent executing system or kernel code on behalf of the application.
The time command (see Table 4-1) is invoked in the following manner:
Option
Explanation
-v
This option presents a verbose display of the program's time and statistics. Some
statistics are zeroed out, but more statistics are valid with Linux kernel v2.6 than with
Linux kernel v2.4.
Most of the valid statistics are present in both the standard and verbose mode, but the
verbose mode provides a better description for each statistic.
The application is timed, and information about its CPU usage is displayed on
standard output after it has completed.
Table 4-2 describes the valid output statistic that the time command provides. The
rest are not measured and always display zero.
Column
Explanation
This is the number of seconds spent in the Linux kernel on behalf of the application.
This is the amount of time elapsed (in wall-clock time) between when the application
was launched and when it completed.
This is the percentage of the CPU that the process consumed as it was running.
The number of major page faults or those that required a page of memory to be read
from disk.
The number of minor page faults or those that could be filled without going to disk.
Swaps
The number of times the process yielded the CPU (for example, by going to sleep).
The number of times the CPU was taken from the process.
Exit status
This command is a good way to start an investigation. It displays how long the
application is taking to execute and how much of that time is spent in the Linux kernel
versus your application.
The time command included on Linux is a part of the cross-platform GNU tools. The
default command output prints a host of statistics about the commands run, even if
Linux does not support them. If the statistics are not available, time just prints a zero.
The following command is a simple invocation of the time command. You can see in
Listing 4.1 that the elapsed time (~3 seconds) is much greater than the sum of the
user (0.9 seconds) and system (0.13 seconds) time, because the application spends
most of its time waiting for input and little time using the processor.
Listing 4.1.
Listing 4.2 is an example of time displaying verbose output. As you can see, this
output shows much more than the typical output of time. Unfortunately, most of the
statistics are zeros, because they are not supported on Linux. For the most part, the
information provided in verbose mode is identical to the output provided in standard
mode, but the statistics' labels are much more descriptive. In this case, we can see
that this process used 15 percent of the CPU when it was running, and spent 1.15
seconds running user code with .12 seconds running kernel code. It accrued 2,087
major page faults, or memory faults that did not require a trip to disk; it accrued 371
page faults that did require a trip to disk. A high number of major faults would
indicate that the operating system was constantly going to disk when the application
tried to use memory, which most likely means that the kernel was swapping a
significant amount.
Listing 4.2.
Note that the bash shell has a built-in time command, so if you are running bash and
execute time without specifying the path to the executable, you get the following
output:
real 0m3.409s
user 0m0.960s
sys 0m0.090s
The bash built-in time command can be useful, but it provides a subset of the process
execution information.
4.2.2. strace
strace is a tool that traces the system calls that a program makes while executing.
System calls are function calls made into the Linux kernel by or on behalf of an
application. strace can show the exact system calls that were made and proves
incredibly useful to determine how an application is using the Linux kernel. Tracing
down the frequency and length of system calls can be especially valuable when
analyzing a large program or one you do not understand completely. By looking at the
strace output, you can get a feel for how the application is using the kernel and what
type of functions it depends on.
strace can also be useful when you completely understand an application, but if that
application makes calls to system libraries (such as libc or GTK.) In this case, even
though you know where the application makes every system call, the libraries might
be making more system calls on behalf of your application. strace can quickly show
you what calls these libraries are making.
strace [-c] [-p pid] [-o file] [--help] [ command [ arg ... ]]
If strace is run without any options, it displays all the system calls made by the given
command on standard error. This can be helpful when trying to figure out why an
application is spending a large amount of time in the kernel. Table 4-3 describes a few
strace options that are also helpful when tracing a performance problem.
Option
Explanation
-c
This causes strace to print out a summary of statistics rather than an individual list of
all the system calls that are made.
-p pid
This attaches to the process with the given PID and starts tracing.
-o file
--help
Table 4-4 explains the statistics present in output of the strace summary option. Each
line of output describes a set of statistics for a particular system call.
Column
Explanation
% time
Of the total time spent making system calls, this is the percentage of time spent on
this one.
seconds
usecs/call
This is the number of microseconds spent per system call of this type.
calls
errors
This is the number of times that this system call returned an error.
Although the options just described are most relevant to a performance investigation,
strace can also filter the types of system calls that it traces. The options to select the
system calls to trace are described in detail with the --help option and in the strace
man page. For general performance tuning, it is usually not necessary to use them; if
needed, however, they exist.
Listing 4.3 is an example of using strace to gather statistics about which system calls
an application is making. As you can see, strace provides a nice profile of the
system's calls made on behalf of an application, which, in this case, is oowriter. In
this example, we look at how oowriter is using the read system call. We can see that
read is taking the 20 percent of the time by consuming a total of 0.44 seconds. It is
called 2,427 times and, on average, each call takes 184 microseconds. Of those calls,
26 return an error.
Listing 4.3.
strace does a good job of tracking a process, but it does introduce some overhead
when it is running on an application. As a result, the number of calls that strace
reports is probably more reliable than the amount of time that it reports for each call.
Use the times provided by strace as a starting point for investigation rather than a
highly accurate measurement of how much time was spent in each call.
4.2.3. ltrace
ltrace is similar in concept to strace, but it traces the calls that an application makes
to libraries rather than to the kernel. Although it is primarily used to provide an exact
trace of the arguments and return values of library calls, you can also use ltrace to
summarize how much time was spent in each call. This enables you to figure out both
what library calls the application is making and how long each is taking.
Be careful when using ltrace because it can generate misleading results. The time
spent may be counted twice if one library function calls another. For example, if
library function foo() calls function bar(), the time reported for function foo() will be all
the time spent running the code in function foo() plus all the time spent in function
bar().
With this caveat in mind, it still is a useful tool to figure out how an application is
behaving.
In the preceding invocation, command is the command that you want ltrace to trace.
Without any options to ltrace, it displays all the library calls to standard error. Table
4-5 describes the ltrace options that are most relevant to a performance investigation.
Option
Explanation
-c
This option causes ltrace to print a summary of all the calls after the command has
completed.
-S
ltrace TRaces system calls in addition to library calls, which is identical to the
functionality strace provides.
-p pid
-o file
--help
Again, the summary mode provides performance statistics about the library calls made
during an application's execution. Table 4-6 describes the meanings of these statistics.
Column
Explanation
% time
Of the total time spent making library calls, this is the percentage of time spent on this
one.
seconds
usecs/call
This is the number of microseconds spent per library call of this type.
calls
function
Much like strace, ltrace has a large number of options that can modify the functions
that it traces. These options are described by the ltrace --help command and in
detail in the ltrace man page.
Listing 4.4 is a simple example of ltrace running on the xeyes command. xeyes is an
X Window application that pops up a pair of eyes that follow your mouse pointer
around the screen.
Listing 4.4.
In Listing 4.4, the library functions XSetWMProtocols, hypot, and XQueryPointer take
18.65 percent, 17.19 percent, and 12.06 percent of the total time spent in libraries.
The call to the second most time-consuming function, hypot, is made 702 times, and
the call to most time-consuming function, XSetWMProtocols, is made only once. Unless
our application can completely remove the call to XSetWMProtocols, we are likely
stuck with whatever time it takes. It is best to turn our attention to hypot. Each call to
this function is relatively lightweight; so if we can reduce the number of times that it
is called, we may be able to speed up the application. hypot would probably be the
first function to be investigated if the xeyes application was a performance problem.
Initially, we would determine what hypot does, but it is unclear where it may be
documented. Possibly, we could figure out which library hypot belongs to and read the
documentation for that library. In this case, we do not have to find the library first,
because a man page exists for the hypot function. Running man hypot tells us that the
hypot function will calculate the distance (hypotenuse) between two points and is part
of the math library, libm. However, functions in libraries may have no man pages, so
we would need to be able to determine what library a function is part of without them.
Unfortunately, ltrace does not make it at obvious which library a function is from. To
figure it out, we have to use the Linux tools ldd and objdump. First, ldd is used to
display which libraries are used by a dynamically linked application. Then, objdump is
used to search each of those libraries for the given function. In Listing 4.5, we use ldd
to see which libraries are used by the xeyes application.
Now that the ldd command has shown the libraries that xeyes uses, we can use the
objdump command to figure out which library the function is in. In Listing 4.6, we look
for the hypot symbol in each of the libraries that xeyes is linked to. The -T option of
objdump lists all the symbols (mostly functions) that the library relies on or provides.
By using fgrep to look at output lines that have .text in it, we can see which libraries
export the hypot function. In this case, we can see that the libm library is the only
library that contains the hypot function.
Listing 4.6.
The next step might be to look through the source of xeyes to figure out where hypot
is called and, if possible, reduce the number of calls made to it. An alternative solution
is to look at the source of hypot and try to optimize the source code of the library.
By enabling you to investigate which library calls are taking a long time to complete,
ltrace enables you to determine the cost of each library call that an application
makes.
It provides detailed static and dynamic statistics about currently running processes.
ps provides static information, such as command name and PID, as well as dynamic
information, such as current use of memory and CPU.
ps has many different options and can retrieve many different statistics about the
state of a running application. The following invocations are those options most
related to CPU performance and will show information about the given PID:
The command ps is probably one of the oldest and feature-rich commands to extract
performance information, which can make its use overwhelming. By only looking at a
subset of the total functionality, it is much more manageable. Table 4-7 contains the
options that are most relevant to CPU performance.
Option
Explanation
-o <statistic>
This option enables you to specify exactly what process statistics you want to track.
The different statistics are specified in a comma-separated list with no spaces.
etime
Statistic: Elapsed time is the amount of time since the program began execution.
time
Statistic: CPU time is the amount of system plus user time the process spent running
on the CPU.
pcpu
command
-A
-u user
Shows statistics about all processes with this effective user ID.
-U user
This example shows a test application that is consuming 88 percent of the CPU and
has been running for 6 seconds, but has only consumed 5 seconds of CPU time:
When a dynamically linked application is executed, the Linux loader, ld.so, runs first.
ld.so loads all the application's libraries and connects symbols that the application
uses with the functions the libraries provide. Because different libraries were
originally linked at different and possibly overlapping places in memory, the linker
needs to sort through all the symbols and make sure that each lives at a different
place in memory. When a symbol is moved from one virtual address to another, this is
called a relocation. It takes time for the loader to do this, and it is much better if it
does not need to be done at all. The prelink application aims to do that by rearranging
the system libraries of the entire systems so that they do not overlap. An application
with a high number of relocations may not have been prelinked.
The Linux loader usually runs without any intervention from the user, and by just
executing a dynamic program, it is run automatically. Although the execution of the
loader is hidden from the user, it still takes time to run and can potentially slow down
an application's startup time. When you ask for loader statistics, the loader shows the
amount of work it is doing and enables you to figure out whether it is a bottleneck.
The ld command is invisibly run for every Linux application that uses shared libraries.
By setting the appropriate environment variables, we can ask it to dump information
about its execution. The following invocation influences ld execution:
The debugging capabilities of the loader are completely controlled with environmental
variables. Table 4-8 describes these variables.
Option
Explanation
LD_DEBUG=statistics
LD_DEBUG=help
Table 4-9 describes some of the statistics that ld.so can provide. Time is given in
clock cycles. To convert this to wall time, you must divide by the processor's clock
speed. (This information is available from cat /proc/cpuinfo.)
Column
Explanation
The total amount of time (in clock cycles) spent in the load before the application
started to execute.
The total amount of time (in clock cycles) spent relocating symbols.
number of relocations
The number of new relocation calculations done before the application's execution
began.
The number of relocations that were precalculated and used before the application
started to execute.
The time needed to load all the libraries that an application is using.
The total number of relocations made during an application run (including those made
by dlopen).
The information provided by ld can prove helpful in determining how much time is
being spent setting up dynamic libraries before an application begins executing.
Listing 4.8.
2647:
2647:
4.2.6. gprof
A powerful way to profile applications on Linux is to use the gprof profiling command.
gprof can show the call graph of application and sample where the application time is
spent. gprof works by first instrumenting your application and then running the
application to generate a sample file. gprof is very powerful, but requires application
source and adds instrumentation overhead. Although it can accurately determine the
number of times a function is called and approximate the amount of time spent in a
function, the gprof instrumentation will likely change the timing characteristics of the
application and slow down its execution.
To profile an application with gprof, you must have access to application source. You
must then compile that application with a gcc command similar to the following:
First, you must compile the application with profiling turned on, using gcc's -gp
option. You must take care not to strip the executable, and it is even more helpful to
turn on symbols when compiling using the -g3 option. Symbol information is
necessary to use the source annotation feature of gprof. When you run your
instrumented application, an output file is generated. You can then use the gprof
command to display the results. The gprof command is invoked as follows:
The options described in Table 4-10 specify what information gprof displays.
Option
Explanation
--brief
This option abbreviates the output of gprof. By default, gprof prints out all the
performance information and a legend to describe what each metric means. This
suppresses the legend.
-p or --flat-profile
This option prints out the total amount of time spent and the number of calls to each
function in the application.
-q or --graph
This option prints out a call graph of the profiled application. This shows how the
functions in the program called each other, how much time was spent in each of the
functions, and how much was spent in the functions of the children.
-A or --annotated-source
This shows the profiling information next to the original source code.
Not all the output statistics are available for a particular profile. Which output statistic
is available depends on how the application was compiled for profiling.
When profiling an application with gprof, the first step is to compile the application
with profiling information. The compiler (gcc) inserts profiling information into the
application and, when the application is run, it is saved into a file named gmon.out.
The burn test application is fairly simple. It clears a large area of memory and then
calls two functions, a() and b(), which each touch this memory. Function a() touches
the memory 10 times as often as function b().
After we run in it, we can analyze the output. This is shown in Listing 4.9.
In Listing 4.9, you can see gprof telling us what we already knew about the
application. It has two functions, a() and b(). Each function is called once, and a()
takes 10 times (91 percent) the amount of time to complete than b() (8.99 percent).
5.06 seconds of time is spent in the function a(), and .5 seconds is spent in function
b().
Listing 4.10 shows the call graph for the test application. The <spontaneous>
comment listed in the output means that although gprof did not record any samples in
main(), it deduced that main() must have been run, because functions a() and b()
both had samples, and main was the only function in the code that called them. gprof
most likely did not record any samples in main() because it is a very short function.
Listing 4.10.
<spontaneous>
[1] 100.0 0.00 5.56 main [1]
5.06 0.00 1/1 a [2]
0.50 0.00 1/1 b [3]
-----------------------------------------------
5.06 0.00 1/1 main [1]
[2] 91.0 5.06 0.00 1 a [2]
-----------------------------------------------
0.50 0.00 1/1 main [1]
[3] 9.0 0.50 0.00 1 b [3]
-----------------------------------------------
[2] a [3] b
Finally, gprof can annotate the source code to show how often each function is called.
Notice that Listing 4.11 does not show the time spent in the function; instead, it shows
Listing 4.11.
void a(void)
1 -> {
int i=0,j=0;
for (j=0;j<10*ITER ; j++)
for (i=0;i<SIZE;i=i+STRIDE)
{
test[i]++;
}
}
void b(void)
1 -> {
int i=0,j=0;
for (j=0;j<ITER; j++)
for (i=0;i<SIZE;i=i+STRIDE)
{
test[i]++;
}
}
main()
##### -> {
/* Arbitrary value*/
memset(test, 42, SIZE);
a();
b();
}
Top 10 Lines:
Line Count
10 1
20 1
Execution Summary:
gprof provides a good summary of how many times functions or source lines in an
application have been run and how long they took.
As discussed in Chapter 2, "Performance Tools: System CPU," you can use oprofile
to track down the location of different events in the system or an application.
oprofile is a lower-overhead tool than gprof. Unlike gprof, it does not require an
application to be recompiled to be used. oprofile can also measure events not
supported by gprof. Currently, oprofile can only support call graphs like those that
gprof can generate with a kernel patch, whereas gprof can run on any Linux kernel.
oprofile has a series of tools that display samples that have been collected. The first
tool, opreport, displays information about how samples are distributed to the
functions within executables and libraries. It is invoked as follows:
Table 4-11 describes a few commands that can modify the level of information that
opreport provides.
Option
Explanation
-d or --details
-f or --long-filenames
This shows the complete path name of the application being analyzed.
-l or --symbols
This shows how an application's samples are distributed to its symbols. This enables
you to see what functions have the most samples attributed to them.
The next command that you can use to extract information about performance samples
is opannotate. opannotate can attribute samples to specific source lines or assembly
instructions. It is invoked as follows:
The options described in Table 4-12 enable you to specify exactly what information
opannotate will provide. One word of caution: because of limitations in the processor
hardware counters at the source line and instruction level, the sample attributions
may not be on the exact line that caused them; however, they will be near the actual
event.
Option
Explanation
-s or --source --
This shows the collected samples next to the application's source code.
-a or --assembly
This shows the samples collected next to the assembly code of the application.
-s and -a
If both -s and -a are specified, opannotate intermingles the source and assembly
code with the samples.
When using opannotate and opreport, it is always best to specify the full path name
to the application. If you do not, you may receive a cryptic error message (if oprofile
cannot find the application's samples). By default, when displaying results, opreport
only shows the executable name, which can be ambiguous in a system with multiple
executables or libraries with identical names. Always specify the -f option so that
opreport shows the complete path to the application.
oprofile also provides a command, opgprof, that can export the samples collected by
oprofile into a form that gprof can digest. It is invoked in the following way:
opgprof application
After the program finishes, we must dump oprofile's buffers to disk, or else the
samples will not be available to opreport. We do that using the following command:
Next, in Listing 4.12 we ask opreport to tell us about the samples relevant to our test
application, /tmp/burn. This gives us an overall view of how many cycles our
application consumed. In this case, we see that 9,939 samples were taken for our
application. As we dig into the oprofile tools, we will see how these samples are
distributed within the burn application.
Listing 4.12.
Listing 4.13.
In Listing 4.14, we ask opreport to show us which virtual addresses have samples
attributed to them. In this case, it appears as if the instruction at address 0x0804838a
has 75 percent of the samples. However, it is currently unclear what this instruction is
doing or why.
Listing 4.14.
Generally, it is more useful to know the source line that is using all the CPU time
rather than the virtual address of the instruction that is using it. It is not always easy
to figure out the correspondence between a source line and a particular instruction;
so, in Listing 4.15, we ask opannotate to do the hard work and show us the samples
relative to the original source code (rather than the instruction's virtual address).
*/
/*
* 9936 100.0000
*/
:#include <string.h>
:
:#define ITER 10000
:#define SIZE 10000000
:#define STRIDE 10000
:
:char test[SIZE];
:
:void a(void)
:{ /* a total: 9033 90.9118 */
: int i=0,j=0;
8 0.0805 : for (j=0;j<10*ITER ; j++)
8603 86.5841 : for (i=0;i<SIZE;i=i+STRIDE)
: {
422 4.2472 : test[i]++;
: }
:}
:
:void b(void)
:{ /* b total: 903 9.0882 */
: int i=0,j=0;
: for (j=0;j<ITER; j++)
853 8.5849 : for (i=0;i<SIZE;i=i+STRIDE)
: {
50 0.5032 : test[i]++;
: }
As you can see in Listing 4.15, opannotate attributes most of the samples (86.59
percent) to the for loop in function b(). Unfortunately, this is a portion of the for loop
that should not be expensive. Adding a fixed amount to an integer is very fast on
modern processors, so the samples that oprofile reported were likely attributed to
the wrong source line. The line below, test[i]++;, should be very expensive because
it accesses the memory subsystem. This line is where the samples should have been
attributed.
oprofile can mis-attribute samples for a few reasons beyond its control. First, the
processor does not always interrupt on the exact line that caused the event to occur.
This may cause samples to be attributed to instructions near the cause of the event,
rather than to the exact instruction that caused the event. Second, when source code
is compiled, compilers often rearrange instructions to make the executable more
efficient. After a compiler has finished optimizing, code may not execute in the order
that it was written. Separate source lines may have been rearranged and combined.
As a result, a particular instruction may be the result of more than one source line, or
may even be a compiler-generated intermediate piece of code that does not exist in
the original source. As a result, when the compiler optimizes the code and generates
machine instructions, there may no longer be a one-to-one mapping between the
original lines of source code and the generated machine instructions. This can make it
difficult or impossible for oprofile (and debuggers) to figure out exactly which line of
source code corresponds to each machine instruction. However, oprofile tries to be
as close as possible, so you can usually look a few lines above and below the line with
the high sample count and figure out which code is truly expensive. If necessary, you
can use opannotate to show the exact assembly instructions and virtual addresses
that are receiving all the samples. It may be possible to figure out what the assembly
instructions are doing and then map it back to your original source code by hand.
oprofile's sample attribution is not perfect, but it is usually close enough. Even with
these limitations, the profiles provided by oprofile show the approximate source line
to investigate, which is usually enough to figure out where the application is slowing
down.
4.2.8. Languages: Static (C and C++) Versus Dynamic (Java and Mono)
The majority of the Linux performance tools support analysis of static languages such
as C and C++, and all the tools described in this chapter work with applications
written in these languages. The tools ltrace, strace, and time work with applications
written in dynamic languages such as Java, Mono, Python, or Perl. However, the
4.2.8. Languages: Static (C and C++) Versus Dynamic (Java and Mono) 143
main
profiling tools gprof and oprofile cannot be used with these types of applications.
Fortunately, most dynamic languages provide non-Linux specific profiling
infrastructures that you can use to generate similar types of profiles.
For Java applications, if the java command is run with the -Xrunhprof command-line
option, -Xrunhprof profiles the application. More details are available at
http://antprof.sourceforge.net/hprof.html. For Mono applications, if the mono
executable is passed the --profile flag, it profiles the application. More details are
available at http://www.go-mono.com/performance.html. Perl and Python have similar
profile functionality, with Perl's Devel::DProf described at
http://www.perl.com/pub/a/2004/06/25/profiling.html, and Python's profiler
described at http://docs.python.org/lib/profile.html, respectively.
4.2.8. Languages: Static (C and C++) Versus Dynamic (Java and Mono) 144
main
Subsequent chapters investigate how to find bottlenecks that are not CPU bound. In
particular, you learn about the tools used to find I/O bottlenecks, such as a saturated
disk or an overloaded network.
< Day Day Up >
When an application is using physical memory, it begins to interact with the CPU's
cache subsystem. Modern CPUs have multiple levels of cache. The fastest cache is
closest to the CPU (also called L1 or Level 1 cache) and is the smallest in size.
Suppose, for instance, that the CPU has only two levels of cache: L1 and L2. When the
CPU requests a piece of memory, the processor checks to see whether it is already in
the L1 cache. If it is, the CPU uses it. If it was not in the L1 cache, the processor
generates a L1 cache miss. It then checks in the L2 cache; if the data is in the L2
cache, it is used. If the data is not in the L2 cache, an L2 cache miss occurs, and the
processor must go to physical memory to retrieve the information. Ultimately, it would
be best if the processor never goes to physical memory (because it finds the data in
the L1 or even L2 cache). Smart cache use rearranging an application's data
structures and reducing code size, for example may make it possible to reduce the
number of caches misses and increase performance. cachegrind and oprofile are
great tools to find information about how an application is using the cache and about
which functions and data structures are causing cache misses.
< Day Day Up >
5.2.1. ps
ps has many different options and can retrieve many different statistics about the
state of a running application. As you saw in the previous chapter, ps can retrieve
information about the CPU that a process is spending, but it also can retrieve
information about the amount and type of memory that a process is using. It can be
invoked with the following command line:
Table 5-1 describes the different types of memory statistics that ps can display for a
given PID.
Option
Explanation
-o <statistic>
Enables you to specify exactly what process statistics you want to track. The different
statistics are specified in a comma-separated list with no spaces.
vsz
Statistic: The virtual set size is the amount of virtual memory that the application is
using. Because Linux only allocated physical memory when an application tries to use
it, this value may be much greater than the amount of physical memory the application
is using.
rss
Statistic: The resident set size is the amount of physical memory the application is
currently using.
tsiz
Statistic: Text size is the virtual size of the program code. Once again, this isn't the
physical size but rather the virtual size; however, it is a good indication of the size of
the program.
dsiz
Statistic: Data size is the virtual size of the program's data usage. This is a good
indication of the size of the data structures and stack of the application.
majflt
Statistic: Major faults are the number of page faults that caused Linux to read a page
from disk on behalf of the process. This may happen if the process accessed a piece of
data or instruction that remained on the disk and Linux loaded it seamlessly for the
application.
minflt
Statistic: Minor faults are the number of faults that Linux could fulfill without
resorting to a disk read. This might happen if the application touches a piece of
memory that has been allocated by the Linux kernel. In this case, it is not necessary to
go to disk, because the kernel can just pick a free piece of memory and assign it to the
application.
pmep
Statistic: The percentage of the system memory that the process is consuming.
command
As mentioned in the preceding chapter, ps is flexible in regards to how you select the
group of PIDs for which statistics display. ps help provides information on how to
specify different groups of PIDs.
Listing 5.1 shows the burn test application running on the system. We ask ps to tell us
information about the memory statistics of the process.
As Listing 5.1 shows, the burn application has a very small text size (1KB), but a very
large data size (11,122KB). Of the total virtual size (11,124KB), the process has a
slightly smaller resident set size (10,004KB), which represents the total amount of
physical memory that the process is actually using. In addition, most of the faults
generated by burn were minor faults, so most of the memory faults were due to
memory allocation rather than loading in a large amount of text or data from the
program image on the disk.
5.2.2. /proc/<PID>
The Linux kernel provides a virtual file system that enables you to extract information
about the processes running on the system. The information provided by the /proc file
system is usually only used by performance tools such as ps to extract performance
data from the kernel. Although it is not normally necessary to dig through the files in
/proc, it does provide some information that you cannot retrieve with other
performance tools. In addition to many other statistics, /proc provides information
about a process's use of memory and mapping of libraries.
The interface to /proc is straightforward. /proc provides many virtual files that you
can cat to extract their information. Each running PID in the system has a
subdirectory in /proc. Within this subdirectory is a series of files containing
information about that PID. One of these files, status, provides information about the
status of a given process PID. You can retrieve this information by using the following
command:
cat /proc/<PID>/status
Table 5-2 describes the memory statistics displayed in the status file.
Option
Explanation
VmSize
This is the process's virtual set size, which is the amount of virtual memory that the
application is using. Because Linux only allocates physical memory when an
application tries to use it, this value may be much greater than the amount of physical
memory the application is actually using. This is the same as the vsz parameter
provided by ps.
VmLck
This is the amount of memory that has been locked by this process. Locked memory
cannot be swapped to disk.
VmRSS
This is the resident set size or amount of physical memory the application is currently
using. This is the same as the rss statistic provided by ps.
VmData
This is the data size or the virtual size of the program's data usage. Unlike ps dsiz
statistic, this does not include stack information.
VmStk
VmExe
This is the virtual size of the executable memory that the program has. It does not
include libraries that the process is using.
VmLib
Another one of the files present in the <PID> directory is the maps file. This provides
information about how the process's virtual address space is used. You can retrieve it
by using the following command:
cat /proc/<PID>/maps
Option
Explanation
Address
This is the address range within the process where the library is mapped.
Permissions
These are the permissions of the memory region, where r = read, w = write, x =
execute, s = shared, and p = private (copy on write).
Offset
This is the offset into the library/application where the memory region mapping
begins.
Device
This is the device (minor and major number) where this particular file exists.
Inode
Pathname
This is the path name of the file that is mapped into the process.
The information that /proc provides can help you understand how an application is
allocating memory and which libraries it is using.
Listing 5.2 shows the burn test application running on the system. First, we use ps to
find the PID (4540) of burn. Then we extract the process's memory statistics using the
/proc status file.
Listing 5.2.
As Listing 5.2 shows, once again we see that the burn application has a very small text
size (4KB) and stack size (8KB), but a very large data size (9,776KB) and a reasonably
sized library size (1,312KB). The small text size means that the process does not have
much executable code, whereas the moderate library size means that it is using a
library to support its execution. The small stack size means that the process is not
calling deeply nested functions or is not calling functions that use large or many
temporary variables. The VmLck size of 0KB means that the process has not locked any
pages into memory, making them unswappable. The VmRSS size of 10,004KB means
that the application is currently using 10,004KB of physical memory, although it has
either allocated or mapped the VmSize or 11,124KB. If the application begins to use
the memory that it has allocated but is not currently using, the VmRSS size increases
but leaves the VmSize unchanged.
Listing 5.3.
As you see in Listing 5.3, the burn application is using two libraries: ld and libc. The
text section (denoted by the permission r-xp) of libc has a range of 0x4002f000
through 0x40162000 or a size of 0x133000 or 1,257,472 bytes.
The data section (denoted by permission rw-p) of libc has a range of 40162000
through 40166000 or a size of 0x4000 or 16,384 bytes. The text size of libc is bigger
than ld's text size of 0x15000 or 86,016 bytes. The data size of libc is also bigger
than ld's text size of 0x1000 or 4,096 bytes. libc is the big library that burn is linking
in.
/proc proves to be a useful way to extract performance statistics directly from the
kernel. Because the statistics are text based, you can use the standard Linux tools to
access them.
5.2.3. memprof
memprof is a graphical application, but has a few command-line options that modify its
execution. It is invoked with the following command:
memprof profiles the given "application" and creates a graphical display of its memory
usage. Although memprof can be run on any application, it can provide more
information if the application and the libraries that it relies on are compiled with
debugging symbols.
Table 5-4 describes the options that manipulate the behavior of memprof if it is
monitoring an application that calls fork or exec. This normally happens when an
application launches a new process or executes a new command.
Option
Explanation
--follow-fork
This option will cause memprof to launch a new window for the newly forked process.
--follow-exec
This option will cause memprof to continue profiling an application after it has called
exec.
Once invoked, memprof creates a window with a series of menus and options that
enable you to select an application that you are going to profile.
Suppose that I have the example code in Listing 5.4 and I want to profile it. In this
application, which I call memory_eater, the function foo() does not allocate any
memory, but it calls the function bar(), which does.
Listing 5.4.
#include <stdlib.h>
void bar(void)
{
malloc(10000);
}
void foo(void)
{
int i;
for (i=0; i<100;i++)
bar();
}
int main()
{
foo();
while(1);
}
After compiling this application with the -g3 flag (so that the application has symbols
included), we use memprof to profile this application:
memprof creates the application window shown in Figure 5-1. As you can see, it shows
memory usage information about the memory_eater application, as well as a series of
buttons and menus that enable you to manipulate the profile.
Figure 5-1.
If you click the Profile button, memprof shows the memory profile of the application.
The first information box in Figure 5-2 shows how much memory each function is
consuming (denoted by "self"), as well as the sum of the memory that the function and
its children are consuming (denoted by "total"). As expected, the foo() function does
not allocate any memory, so its self value is 0, whereas its total value is 100,000,
because it is calling a function that does allocate memory.
Figure 5-2.
The children and callers information boxes change as you click different functions in
the top box. This way, you can see which functions of an application are using
memory.
memprof provides a way to graphically traverse through a large amount of data about
memory allocation. It provides an easy way to determine the memory allocation of a
given function and each functions that it calls.
Although very useful, its cache statistics are inexact (because valgrind is only a
simulation of the processor rather than an actual piece of hardware). valgrind will
not account for cache misses normally caused by system calls to the Linux kernel or
cache misses that happen because of context switching. In addition, valgrind runs
applications at a much slower speed than a natively executing program. However,
valgrind provides a great first approximation of the cache usage of an application.
valgrind can be run on any executable; if the program has been compiled with
symbols (passed as -g3 to gcc when compiling), however, it will be able to pinpoint
the exact line of code responsible for the cache usage.
When using valgrind to analyze cache usage for a particular application, there are
two phases: collection and annotation. The collection phase starts with the following
command line:
valgrind is a flexible tool that has a few different "skins" that allow it to perform
different types of analysis. During the collection phase, valgrind uses the cachegrind
skin to collect information about cache usage. The application in the preceding
command line represents the application to profile. The collection phase prints
summary information to the screen, but it also saves more detailed statistics in a file
named cachegrind.out.pid, where pid is the PID of the profiled application as it was
running. When the collection phase is complete, the command cg_annoate is used to
map the cache usage back to the application source code. cg_annote is invoked in the
following way:
cg_annotate takes the information generated by valgrind and uses it to annotate the
application that was profiled. The --pid option is required, where pid is the PID of the
profile that you are interested in. By default, cg_annotate just shows cache usage at
the function level. If you set --auto=yes, cache usage displays at the source-line level.
This example shows valgrind (v2.0) running on a simple application. The application
clears a large area of memory and then calls two functions, a() and b(), and each
touches this memory. Function a() touches the memory ten times as often as function
b().
First, as shown in Listing 5.5, we run valgrind on the application using the
cachegrind skin.
Listing 5.5.
In the run of Listing 5.5, the application executed 11,317,111 instructions; this is
shown by the I refs statistics. The process had an astonishingly low number of
misses in both the L1 (215) and L2 (216) instruction cache, as denoted by an I1 and
L2i miss rate of 0.0 percent. The process had a total number of 6,908,012 data
references, 4,405,958 were reads and 2,502,054 were writes. 24.9 percent of the
reads and 12.4 percent of the writes could not be satisfied by the L1 cache. Luckily,
we can almost always satisfy the reads in the L2 data, and they are shown to have a
miss rate of 0 percent. The writes are still a problem with a miss rate of 12.4 percent.
In this application, memory access of the data is the problem to investigate.
The ideal application would have a very low number of instruction cache and data
cache misses. To eliminate instruction cache misses, it may be possible to recompile
the application with different compiler options or trim code, so that the hot code does
not have to share icache space with the code that is not used often. To eliminate data
cache misses, use arrays for data structures rather than linked lists, if possible, and
reduce the size of elements in data structures, and access memory in a cache-friendly
way. In any event, valgrind helps to point out which accesses/data structures should
be optimized. This application run summary shows that data accesses are the main
problem.
As shown in Listing 5.5, this command displays cache usage statistics for the overall
run. However, when developing an application, or investigating a performance
problem, it is often more interesting to see where cache misses happen rather than
just the totals during an application's runtime. To determine which functions are
responsible for the cache misses, we run cg_annotate, as shown in Listing 5.6. This
shows us which functions are responsible for which cache misses. As we expect, the
function a() has 10 times (1,000,000) the misses of the function b() (100,000).
Listing 5.6.
----------------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
----------------------------------------------------------------------------------------
8,009,011 2 2 4,003,003 1,000,000 989 1,004 0 0 burn.c:a
2,500,019 3 3 6 1 1 2,500,001 312,500 312,500 ???:__GI_memset
800,911 2 2 400,303 100,000 0 104 0 0 burn.c:b
Listing 5.7.
----------------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:function
----------------------------------------------------------------------------------------
8,009,011 2 2 4,003,003 1,000,000 989 1,004 0 0 burn.c:a
2,500,019 3 3 6 1 1 2,500,001 312,500 312,500 ???:__GI_memset
800,911 2 2 400,303 100,000 0 104 0 0 burn.c:b
-- line 2 ----------------------------------------
. . . . . . . . .
. . . . . . . . . #define ITER 100
. . . . . . . . . #define SZ 10000000
. . . . . . . . . #define STRI 10000
. . . . . . . . .
. . . . . . . . . char test[SZ];
. . . . . . . . .
. . . . . . . . . void a(void)
3 0 0 . . . 1 0 0 {
2 0 0 . . . 2 0 0 int i=0,j=0;
5,004 1 1 2,001 0 0 1 0 0 for(j=0;j<10*ITER ; j++)
5,004,000 0 0 2,001,000 0 0 1,000 0 0 for(i=0;i<SZ;i=i+STRI)
. . . . . . . . . {
3,000,000 1 1 2,000,000 1,000,000 989 . . . test[i]++;
. . . . . . . . . }
2 0 0 2 0 0 . . . }
. . . . . . . . .
. . . . . . . . . void b(void)
3 1 1 . . . 1 0 0 {
2 0 0 . . . 2 0 0 int i=0,j=0;
504 0 0 201 0 0 1 0 0 for (j=0;j<ITER; j++)
500,400 1 1 200,100 0 0 100 0 0 for (i=0;i<SZ;i=i+STRI
. . . . . . . . . {
300,000 0 0 200,000 100,000 0 . . . test[i]++;
. . . . . . . . . }
2 0 0 2 0 0 . . . }
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . main()
6 2 2 . . . 1 0 0 {
. . . . . . . . .
. . . . . . . . . /* Arbitrary value*/
6 0 0 . . . 4 0 0 memset(test, 42, SZ);
1 0 0 . . . 1 0 0 a();
1 0 0 . . . 1 0 0 b();
2 0 0 2 1 1 . . . }
-----------------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
-----------------------------------------------------------------------------------------
78 3 3 100 100 78 0 0 0 percentage of events annotated
The different level of detail (program level, function level, and line level) that
valgrind/cachegrind provides can give you a good idea of which parts of an
application are accessing memory and effectively using the processor caches.
Similar to valgrind, when using kcachegrind to analyze cache usage for a particular
application, there are two phases: collection and annotation. The collection phase
starts with the following command line:
calltree application
The calltree command accepts many different options to manipulate the information
to be collected. Table 5-5 shows some of the more important options.
Option
Explanation
--help
This provides a brief explanation of all the different collection methods that calltree
supports.
-dump-instr=yes|no
This is the amount of memory that has been locked by this process. Locked memory
cannot be swapped to disk.
--trace-jump=yes|no
This includes branching information, or information about which path is taken at each
branch.
calltree can record many different statistics. Refer to calltree's help option for
more details.
When the collection phase is complete, the command kcachegrind is used to map the
cache usage back to the application source code. kcachegrind is invoked in the
following way:
kcachegrind cachegrind.out.pid
kcachegrind displays the cache profiles statistics that have been collected and
enables you to navigate through the results.
The first step in using kcachegrind is to compile the application with symbols to allow
sample-to-source line mappings. This is done with the following command:
Next, run calltree against that application, as shown in Listing 5.8. This provides
output similar to cachegrind, but, most importantly, it generates a cachegrind.out
file, which will be used by kcachegrind.
Listing 5.8.
When we have the cachegrind.out file, we can start kcachegrind (v.0.54) to analyze
the data by using the following command:
This brings up the window shown in Figure 5-3. The window shows a flat profile of all
the cache misses in the left pane. By default, data read misses from the L1 cache are
shown.
Figure 5-3.
Next, in Figure 5-4, in the upper-right pane, we can see a visualization of the callee
map, or all the functions (a() and b()) that the function in the left pane (main) calls.
In the lower-right pane, we can see the application's call graph.
Figure 5-4.
Finally, in Figure 5-5, we select a different function to examine in the left pane. We
also select a different event to examine (instruction fetches) using the upper-right
pane. Finally, we can visualize the loops in the assembly code using the lower-right
pane.
Figure 5-5.
These examples barely scratch the surface of what kcachegrind can do, and the best
way to learn about it is to try it. kcachegrind is an extraordinarily useful tool for those
who like to investigate performance issues visually.
As you have seen in previous chapters, oprofile is a powerful tool that can help to
determine where application time is spent. However, oprofile can also work with the
processor's performance counters to provide an accurate view of how it performs.
This discussion of oprofile does not add any new command-line options because they
have already been described in section outlining CPU performance. However, one
command becomes more important as you start to sample events different from
oprofile's defaults. Different processors and architectures can sample different sets
of events and oprofile's op_help command displays the list of events that your
current processor supports.
As mentioned previously, the events that oprofile can monitor are processor specific,
so these examples are run on my current machine, which is a Pentium-III. On the
Pentium-III, we can use performance counters exposed by oprofile to gather the
similar information that was provided by valgrind with the cachegrind skin. This
uses the performance counter hardware rather than software simulation. Using the
performance hardware presents two pitfalls. First, we must deal with the underlying
hardware limitations of the performance counters. On the Pentium-III, oprofile can
only measure two events simultaneously, whereas cachegrind could measure many
types of memory events simultaneously. This means that for oprofile to measure the
same events as cachegrind, we must run the application multiple times and change
the events that oprofile monitors during each run. The second pitfall is the fact that
oprofile does not provide an exact count of the events like cachegrind, only samples
the counters, so we will be able to see where the events most likely occur, but we will
not be able to see the exact number. In fact, we will only receive (1/sample rate)
number of samples. If an application only causes an event to happen a few times, it
might not be recorded at all. Although it can be frustrating to not know the exact
number of events when debugging a performance problem, it is usually most
important to figure the relative number of samples between code lines. Even though
you will not be able to directly determine the total number of events that occurred at a
particular source line, you will be able to figure out the line with most events, and that
is usually enough to start debugging a performance problem. This inability to retrieve
exact event counts is present in every form of CPU sampling and will be present on
any processor performance hardware that implements it. However, it is really this
limitation that allows this performance hardware to exist at all; without it, the
performance monitoring would cause too much overhead.
oprofile can monitor many different types of events, which vary on different
processors. This section outlines some of the important events on the Pentium-III for
an example. The Pentium-III can monitor the L1 data cache (called the DCU on the
Pentium-III) by using the following events (events and descriptions provided by
op_help):
The Pentium-III actually supports many more than these, but these are the basic load
and store events. (cachegrind calls a load a "read" and a store a "write.") We can use
On the Pentium-III, there are a similar series of events to monitor the instruction
cache. For the L1 instruction cache (what the Pentium-III calls the "IFU"), we can
measure the number of reads (or fetches) and misses directly by using the following
events:
We can also measure the number of instructions that were fetched from the L2 cache
by using the following event:
Unfortunately, as mentioned previously, the processor shares the L2 cache with the
instruction and data, so there is no way to distinguish cache misses from data usage
and instruction usage. We can use these events to approximate the same information
that cachegrind provided.
If it is so difficult to extract the same information, why use oprofile at all? First,
oprofile is much lower overhead (< 10 percent) and can be run on production
applications. Second, oprofile can extract the exact events that are happening. This
is much better than a possibly inaccurate simulator that does not take into account
cache usage by the operating system or other applications.
Although these events are available on the Pentium-III, they are not necessarily
available on any other processors. Each family of Intel and AMD processors has a
different set of events that can be used to reveal different amounts of information
To understand how you can use oprofile to extract cache information, compare the
cache usage statistics of an application using both the virtual CPU of cachegrind and
the actual CPU with oprofile. Let's run the burn program that we have been using as
an example. Once again, it is an application that executes ten times the number of
instructions in function a() as in function b(). It also accesses ten times the amount of
data in function a() as in function b(). Here is the output of cachegrind for this demo
application:
------------------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
------------------------------------------------------------------------------------------
883,497,211 215 214 440,332,658 110,000,288 1,277 2,610,954 312,534 312,533 PROGRAM TOT
------------------------------------------------------------------------------------------
Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw file:functio
------------------------------------------------------------------------------------------
800,900,011 2 2 400,300,003 100,000,000 989 100,004 0 0 ???:a
80,090,011 2 2 40,030,003 10,000,000 0 10,004 0 0 ???:b
In the next few examples, I run the data collection phase of oprofile multiple times. I
use oprof_start to set up the events for the particular run, and then run the demo
application. Because my CPU has only two counters, I have to do this multiple times.
This means that different sets of events will be monitored during the different
executions of the program. Because my application does not change how it executes
from each run, each run should produce similar results. This is not necessarily true for
a more complicated application, such as a Web server or database, each of which can
dramatically change how it executes based on the requests made to it. However, for
the simple test application, it works just fine.
After I collect this sampling information, we use opreport to extract the collected
information. As shown in Listing 5.9, we query oprofile about the amount of data
memory references that were made and how many times there was a miss in the L1
data CPU (DCU). As cachegrind told us, ten times the number of both memory
references and L1 data misses occur in function a().
Listing 5.9.
Now look at Listing 5.10, which shows opreport examining similar information for the
instruction cache. Notice that the instructions executed in function a() are ten times
the number as those in function b(), as we expect. However, notice that the number of
L1 I misses differ from what cachegrind predicts. It is most likely that other
applications and the kernel are polluting the cache and causing burn to miss in the
icache. (Remember that cachegrind does not take kernel or other application cache
usage into account.) This also probably happened with the data cache, but because the
number of data cache misses caused by the application were so high, the extra events
were lost in the noise.
Listing 5.10.
As you can see when comparing the output of cachegrind and oprofile, using
oprofile to gather information about memory information is powerful because
oprofile is low overhead and uses the processor hardware directly, but it can be
difficult to find events that match those that you are interested in.
5.2.7. ipcs
If ipcs is invoked without any parameters, it gives a summary of all the shared
memory on the system. This includes information about the owner and size of the
shared memory segment. Table 5-6 describes options that cause ipcs to display
different types information about the shared memory in the system.
Option
Explanation
-t
This shows the time when the shared memory was created, when a process last
attached to it, and when a process last detached from it.
-u
This provides a summary about how much shared memory is being used and whether
it has been swapped or is in memory.
-l
-p
This shows the PIDs of the processes that created and last used the shared memory
segments.
This shows users who are the creators and owners of the shared memory segments.
First, in Listing 5.11, we ask ipcs how much of the system memory is being used for
shared memory. This is a good overall indication of the state of shared memory in the
system.
In this case, we can see that 21 different segments or pieces of shared memory have
been allocated. All these segments consume a total of 1,585 pages of memory; 720 of
these exist in physical memory and 412 have been swapped to disk.
Next, in Listing 5.12, we ask ipcs for a general overview of all the shared memory
segments in the system. This indicates who is using each memory segment. In this
case, we see a list of all the shared segments. For one in particular, the one with a
share memory ID of 65538, the user (ezolt) is the owner. It has a permission of 600 (a
typical UNIX permission), which in this case, means that only ezolt can read and
write to it. It has 393,216 bytes, and 2 processes are attached to it.
Listing 5.12.
Finally, we can figure out exactly which processes created the shared memory
segments and which other processes are using them, as shown in Listing 5.13. For the
segment with shmid 32769, we can see that the PID 1229 created it and 11954 was
the last to use it.
Listing 5.13.
After we have the PID responsible for the allocation and use, we can use a command
such as ps -o command PID to track the PID back to the process name.
If shared memory usage becomes a significant amount of the system total, ipcs is a
good way to track down the exact programs that are creating and using the shared
As with the CPU performance tools, most of the tools discussed in this chapter support
analysis of static languages such as C and C++. Of the tools that we investigated, only
ps, /proc, and ipcs work with dynamic languages such as Java, Mono, Python, and
Perl. The cache and memory-profiling tools, such as oprofile, cachegrind, and
memprof, do not. As with CPU profiling, each of these languages provides custom tools
to extract information about memory usage.
For Java applications, if the java command is run with the -Xrunhprof command-line
option, it profiles the application's memory usage. You can find more details at
http://antprof.sourceforge.net/hprof.html or by running the java command with the
-Xrunhprof:help option. For Mono applications, if the mono executable is passed the
--profile flag, it also profiles the memory usage of the application. You can find
more details about this at http://www.go-mono.com/performance.html. Perl and
Python do not appear to have similar functionality.
The next chapter moves away from memory to investigate disk I/O bottlenecks.
< Day Day Up >
• Determine the amount of total amount and type (read/write) of disk I/O on a
system (vmstat).
• Determine which devices are servicing most of the disk I/O (vmstat, iostat,
sar).
• Determine how effectively a particular disk is fielding I/O requests (iostat).
• Determine which processes are using a given set of files (lsof).
< Day Day Up >
When an application does a read or write, the Linux kernel may have a copy of the file
stored into its cache or buffers and returns the requested information without ever
accessing the disk. If the Linux kernel does not have a copy of the data stored in
memory, however, it adds a request to the disk's I/O queue. If the Linux kernel notices
that multiple requests are asking for contiguous locations on the disk, it merges them
into a single big request. This merging increases overall disk performance by
eliminating the seek time for the second request. When the request has been placed in
the disk queue, if the disk is not currently busy, it starts to service the I/O request. If
the disk is busy, the request waits in the queue until the drive is available, and then it
is serviced.
< Day Day Up >
As you saw in Chapter 2, "Performance Tools: System CPU," vmstat is a great tool to
give an overall view of how the system is performing. In addition to CPU and memory
statistics, vmstat can provide a system-wide view of I/O performance.
While using vmstat to retrieve disk I/O statistics from the system, you must invoke it
as follows:
Table 6-1 describes the other command-line parameters that influence the disk I/O
statistics that vmstat will display.
Option
Explanation
-D
This displays Linux I/O subsystem total statistics. This option can give you a good idea
of how your I/O subsystem is being used, but it won't give statistics on individual
disks. The statistics given are the totals since system boot, rather than just those that
occurred between this sample and the previous sample.
-d
This option displays individual disk statistics at a rate of one sample per interval.
The statistics are the totals since system boot, rather than just those that occurred
between this sample and the previous sample.
-p partition
This displays performance statistics about the given partition at a rate of one sample
per interval. The statistics are the totals since system boot, rather than just those
that occurred between this sample and the previous sample.
interval
count
If you run vmstat without any parameters other than [interval] and [count], it
shows you the default output. This output contains three columns relevant to disk I/O
performance: bo, bi, and wa. These statistics are described in Table 6-2.
Statistic
Explanation
bo
This indicates the number of total blocks written to disk in the previous interval. (In
vmstat, block size for a disk is typically 1,024 bytes.)
bi
This shows the number of blocks read from the disk in the previous interval. (In
vmstat, block size for a disk is typically 1,024 bytes.)
wa
This indicates the amount of CPU time spent waiting for I/O to complete. The rate of
disk blocks written per second.
When running with the -D mode, vmstat provides statistical information about the
system's disk I/O system as a whole. Information about these statistics is provided in
Table 6-3. (Note that more information about these statistics is available in the Linux
kernel source package, under Documentation/iostats.txt.)
Statistic
Explanation
disks
partitions
total reads
merged reads
The total number of times that different reads to adjacent locations on the disk were
merged to improve performance.
read sectors
The total number of sectors read from disk. (A sector is usually 512 bytes.)
milli reading
The amount of time (in ms) spent reading from the disk.
writes
merged writes
The total number of times that different writes to adjacent locations on the disk were
merged to improve performance.
written sectors
The total number of sectors written to disk. (A sector is usually 512 bytes.)
milli writing
inprogress IO
The total number of I/O that are currently in progress. Note that there is a bug in
recent versions (v3.2) of vmstat in which this is incorrectly divided by 1,000, which
almost always yields a 0.
milli spent IO
This is the number of milliseconds spent waiting for I/O to complete. Note that there is
a bug in recent versions (v3.2) of vmstat in which this is the number of seconds spent
on I/O rather than milliseconds.
The -d option of vmstat displays I/O statistics of each individual disk. These statistics
are similar to those of the -D option and are described in Table 6-4.
Statistic
Explanation
reads: total
reads: merged
The total number of times that different reads to adjacent locations on the disk were
merged to improve performance.
reads: sectors
reads: ms
The amount of time (in ms) spent reading from the disk.
writes: total
The total number of writes that have been requested for this disk.
writes: merged
The total number of times that different writes to adjacent locations on the disk were
merged to improve performance.
writes: sectors
The total number of sectors written to disk. (A sector is usually 512 bytes.)
writes: ms
IO: cur
The total number of I/O that are currently in progress. Note that there is a bug in
recent versions of vmstat in which this is incorrectly divided by 1,000, which almost
always yields a 0.
IO: s
Finally, when asked to provide partition-specific statistics, vmstat displays those listed
in Table 6-5.
Statistic
Explanation
reads
The total number of reads that have been requested for this partition.
read sectors
writes
The total number of writes that resulted in I/O for this partition.
requested writes
The total number of reads that have been requested for this partition.
The default vmstat output provides a coarse indication of system disk I/O, but a good
level. The options provided by vmstat enable you to reveal more details about which
device is responsible for the I/O. The primary advantage of vmstat over other I/O tools
is that it is present on almost every Linux distribution.
The number of I/O statistics that vmstat can present to the Linux user has been
growing with recent releases of vmstat. The examples shown in this section rely on
vmstat version 3.2.0 or greater. In addition, the extended disk statistics provided by
vmstat are only available on Linux systems with a kernel version greater than 2.5.70.
In the first example, shown in Listing 6.1, we are just invoking vmstat for three
samples with an interval of 1 second. vmstat outputs the system-wide performance
overview that we saw in Chapter 2.
Listing 6.1.
Listing 6.1 shows that during one of the samples, the system read 24,448 disk blocks.
As mentioned previously, the block size for a disk is 1,024 bytes, so this means that
the system is reading in data at about 23MB per second. We can also see that during
this sample, the CPU was spending a significant portion of time waiting for I/O to
complete. The CPU waits on I/O 63 percent of the time during the sample in which the
disk was reading at ~23MB per second, and it waits on I/O 49 percent for the next
sample, in which the disk was reading at ~19MB per second.
Next, in Listing 6.2, we ask vmstat to provide information about the I/O subsystem's
performance since system boot.
Listing 6.2.
In Listing 6.2, vmstat provides I/O statistic totals for all the disk devices in the system.
As mentioned previously, when reading and writing to a disk, the Linux kernel tries to
Although the previous example displayed I/O statistics for the entire system, the
following example in Listing 6.3 shows the statistics broken down for each individual
disk.
Listing 6.3.
Listing 6.4 shows that 60 (19,059 18,999) reads and 94 writes (24,795 24,795)
have been issued to partition hde3. This view can prove particularly useful if you are
trying to determine which partition of a disk is seeing the most usage.
Listing 6.4.
iostat is like vmstat, but it is a tool dedicated to the display of the disk I/O subsystem
statistics. iostat provides a per-device and per-partition breakdown of how many
blocks are written to and from a particular disk. (Blocks in iostat are usually sized at
512 bytes.) In addition, iostat can provide extensive information about how a disk is
being utilized, as well as how long Linux spends waiting to submit requests to the
disk.
Much like vmstat, iostat can display performance statistics at regular intervals.
Different options modify the statistics that iostat displays. These options are
described in Table 6-6.
Option
Explanation
-d
This displays only information about disk I/O rather than the default display, which
includes information about CPU usage as well.
-k
-x
device
interval
count
The default output of iostat displays the performance statistics described in Table
6-7.
Statistic
Explanation
tps
Transfers per second. This is the number of reads and writes to the drive/partition per
second.
Blk_read/s
Blk_wrtn/s
Blk_read
Blk_wrtn
When you invoke iostat with the -x parameter, it displays extended statistics about
the disk I/O subsystem. These extended statistics are described in Table 6-8.
Statistic
Explanation
rrqm/s
The number of reads merged before they were issued to the disk.
wrqm/s
The number of writes merged before they were issued to the disk.
r/s
w/s
rsec/s
wsec/s
rkB/s
wkB/s
avgrq-sz
avgqu-sz
await
The average time (in ms) for a request to be completely serviced. This average
includes the time that the request was waiting in the disk's queue plus the amount of
time it was serviced by the disk.
svctm
The average service time (in ms) for requests submitted to the disk. This indicates how
long on average the disk took to complete a request. Unlike await, it does not include
the amount of time spent waiting in the queue.
Listing 6.5 shows an example iostat run while a disk benchmark is writing a test file
to the file system on the /dev/hda2 partition. The first sample iostat displays is the
total system average since system boot time. The second sample (and any that would
follow) is the statistics from each 1-second interval.
Listing 6.5.
One interesting note in the preceding example is that /dev/hda3 had a small amount
of activity. In the system being tested, /dev/hda3 is a swap partition. Any activity
recorded from this partition is caused by the kernel swapping memory to disk. In this
way, iostat provides an indirect method to determine how much disk I/O in the
system is the result of swapping.
Listing 6.6.
In Listing 6.6, you can see that the average queue size is pretty high (~237 to 538)
and, as a result, the amount of time that a request must wait (~422.44ms to
538.60ms) is much greater than the amount of time it takes to service the request
(7.63ms to 11.90ms). These high average service times, along with the fact that the
utilization is 100 percent, show that the disk is completely saturated.
The extended iostat output provides so many statistics that it only fits on a single line
in a very wide terminal. However, this information is nearly all that you need to
identify a particular disk as a bottleneck.
6.2.3. sar
As discussed in Chapter 2, "Performance Tools: System CPU," sar can collect the
performance statistics of many different areas of the Linux system. In addition to CPU
and memory statistics, it can collect information about the disk I/O subsystem.
When using sar to monitor disk I/O statistics, you can invoke it with the following
command line:
Typically, sar displays information about the CPU usage in a system; to display disk
usage statistics instead, you must use the -d option. sar can only display disk I/O
statistics with a kernel version higher than 2.5.70. The statistics that it displays are
described in Table 6-9.
Statistic
Explanation
tps
Transfers per second. This is the number of reads and writes to the drive/partition per
second.
rd_sec/s
wr_sec/s
The number of sectors is taken directly from the kernel, and although it is possible for
it to vary, the size is usually 512 bytes.
In Listing 6.7, sar is used to collect information about the I/O of the devices on the
system. sar lists the devices by their major and minor number rather than their
names.
Listing 6.7.
sar has a limited number of disk I/O statistics when compared to iostat. However,
the capability of sar to simultaneously record many different types of statistics may
make up for these shortcomings.
lsof provides a way to determine which processes have a particular file open. In
addition to tracking down the user of a single file, lsof can display the processes
using the files in a particular directory. It can also recursively search through an
entire directory tree and list the processes using files in that directory tree. lsof can
prove helpful when narrowing down which applications are generating I/O.
You can invoke lsof with the following command line to investigate which files
processes have open:
Typically, lsof displays which processes are using a given file. However, by using the
+d and +D options, it is possible for lsof to display this information for more than one
file. Table 6-10 describes the command-line options of lsof that prove helpful when
tracking down an I/O performance problem.
Option
Explanation
-r delay
+D directory
This causes lsof to recursively search all the files in the given directory and report on
which processes are using them.
+d directory
This causes lsof to report on which processes are using the files in the given
directory.
lsof displays the statistics described in Table 6-11 when showing which processes are
using the specified files.
Statistic
Explanation
COMMAND
PID
USER
FD
The file descriptor of the file, or tex for a executable, mem for a memory mapped file.
TYPE
DEVICE
SIZE
NODE
Listing 6.8 shows lsof being run on the /usr/bin directory. This run shows which
processes are accessing all of the files in /usr/bin.
Listing 6.8.
Usually a system administrator has a good idea about what application uses the disk,
but not always. Many times, for example, I have been using my Linux system when the
disks started grinding for apparently no reason. I can usually run top and look for a
process that might be causing the problem. By eliminating processes that I believe are
not doing I/O, I can usually find the culprit. However, this requires knowledge of what
the various applications are supposed to do. It is also error prone, because the guess
about which processes are not causing the problem might be wrong. In addition, for a
system with many users or many running applications, it is not always practical or
easy to determine which application might be causing the problem. Other UNIXes
support the inblk and oublk parameters to ps, which show you the amount of disk I/O
issued on behalf of a particular process. Currently, the Linux kernel does not track the
I/O of a process, so the ps tool has no way to gather this information.
You can use lsof to determine which processes are accessing files on a particular
partition. After you list all PIDs accessing the files, you can then attach to each of the
PIDs with strace and figure out which one is doing a significant amount of I/O.
Although this method works, it is really a Band-Aid solution, because the number of
processes accessing a partition could be large and it is time-consuming to attach and
analyze the system calls of each process. This may also miss short-lived processes, and
may unacceptably slow down processes when they are being traced.
This is an area where the Linux kernel could be improved. The ability to quickly track
which processes are generating I/O would allow for much quicker diagnosis of I/O
performance-related problems.
< Day Day Up >
The next chapter examines the tools that enable you to determine the cause of
network bottlenecks.
< Day Day Up >
• Determine the speed and duplex settings of the Ethernet devices in the system
(mii-tool, ethtool).
• Determine the amount of network traffic flowing over each Ethernet interface
(ifconfig, sar, gkrellm, iptraf, netstat, etherape).
• Determine the types of IP traffic flowing in to and out of the system (gkrellm,
iptraf, netstat, etherape).
• Determine the amount of each type of IP traffic flowing in to and out of the
system (gkrellm, iptraf, etherape).
• Determine which applications are generating IP traffic (netstat).
< Day Day Up >
Stacked above the link layer is a network layer. This layer uses the Internet Protocol
(IP) and Internet Control Message Protocol (ICMP) to address and route packets of
data from machine to machine. IP/ICMP make their best-effort attempt to pass the
packets between machines, but they make no guarantees about whether a packet
actually arrives at its destination.
Stacked above the network layer is the transport layer, which defines the Transport
Control Protocol (TCP) and User Datagram Protocol (UDP). TCP is a reliable protocol
that guarantees that a message is either delivered over the network or generates an
error if the message is not delivered. TCP's sibling protocol, UDP, is an unreliable
protocol that deliberately (to achieve the highest data rates) does not guarantee
message delivery. UDP and TCP add the concept of a "service" to IP. UDP and TCP
receive messages on numbered "ports." By convention, each type of network service is
assigned a different number. For example, Hypertext Transfer Protocol (HTTP) is
typically port 80, Secure Shell (SSH) is typically port 22, and File Transport Protocol
(FTP) is typically port 23. In a Linux system, the file /etc/services defines all the
ports and the types of service they provide.
The final layer is the application layer. It includes all the different applications that
use the layers below to transmit packets over the network. These include applications
such Web servers, SSH clients, or even peer-to-peer (P2P) file-sharing clients such as
bittorrent.
The lowest three layers (link, network, and transport) are implemented or controlled
within the Linux kernel. The kernel provides statistics about how each layer is
performing, including information about the bandwidth usage and error count as data
flows through each of the layers. The tools covered in this chapter enable you to
extract and view those statistics.
At the lowest levels of the network stack, Linux can detect the rate at which data
traffic is flowing through the link layer. The link layer, which is typically Ethernet,
sends information into the network as a series of frames. Even though the layers
above may have pieces of information much larger than the frame size, the link layer
breaks everything up into frames to send them over the network. This maximum size
of data in a frame is known as the maximum transfer unit (MTU). You can use network
configuration tools such as ip or ifconfig to set the MTU. For Ethernet, the
At the physical layer, frames flow over the physical network; the Linux kernel collects
a number of different statistics about the number and types of frames:
Several of the Linux network performance tools can display the number of frames of
each type that have passed through each network device. These tools often require a
device name, so it is important to understand how Linux names network devices to
understand which name represents which device. Ethernet devices are named ethN,
where eth0 is the first device, eth1 is the second device, and so on. PPP devices are
named pppN in the same manner as Ethernet devices. The loopback device, which is
used to network with the local machine, is named lo.
For TCP or UDP traffic, Linux uses the socket/port abstraction to connect two
machines. When connecting to a remote machine, the local application uses a network
socket to open a port on a remote machine. As mentioned previously, most common
network services have an agreed-upon port number, so a given application will be able
to connect to the correct port on the remote machine. For example, port 80 is
commonly used for HTTP. When loading a Web page, browsers connect to port 80 on
remote machines. The Web server of the remote machine listens for connections on
port 80, and when a connection occurs, the Web server sets up the connection for
transfer of the Web page.
The Linux network performance tools can track the amount of data that flows over a
particular network port. Because port numbers are unique for each service, it is
possible to determine the amount of network traffic flowing to a particular service.
< Day Day Up >
mii-tool requires root access to be used. It is invoked with the following command
line:
mii-tool prints the Ethernet settings for the given device. If no devices are specified,
mii-tool displays information about all the available Ethernet devices. If the -v
option is used, mii-tool displays verbose statistics about the offered and negotiated
network capabilities.
Listing 7.1 shows the configuration of eth0 on the system. The first line tells us that
the Ethernet device is currently using a 100BASE-T full-duplex connection. The next
few lines describe the capabilities of the network card in the machine and the
capabilities that the card has detected of the network device on the other end of the
wire.
Listing 7.1.
mii-tool provides low-level information about how the physical level of the etheRnet
device is configured.
7.2.2. ethtool
ethtool requires root access to be used. It is invoked with the following command
line:
ethtool [device]
ethtool prints out configuration information about the given Ethernet device. If no
devices are provided, ethtool prints statistics for all the Ethernet devices in the
system. The options to change the current Ethernet settings are described in detail in
the ethtool main page.
Listing 7.2 shows the configuration of eth0 on the system. Although the device
supports many different speed and link settings, it is currently connected to a
full-duplex 1,000Mbps link.
Listing 7.2.
ethtool is simple to run, and it can quickly provide information about an improperly
configured network device.
The primary job of ifconfig is to set up and configure the network interfaces in a
Linux box. It also provides rudimentary performance statistics about all the network
devices in the system. ifconfig is available on almost every Linux machine that uses
networking.
ifconfig [device]
If no device is specified, ifconfig shows statistics about all the active network
devices. Table 7-1 describes the performance statistics that ifconfig provides.
Column
Explanation
RX packets
TX packets
errors
dropped
overruns
frame
carrier
The number of packets discarded because of link media failure (such as a faulty
cable).
Listing 7.3 shows the network performance statistics from all the devices in the
system. In this case, we have an Ethernet card (eth0) and the loopback (lo) device. In
this example, the Ethernet card has received ~790Mb of data and has transmitted
~319Mb.
Listing 7.3.
The statistics provided by ifconfig represent the cumulative amount since system
boot. If you bring down a network device and then bring it back up, the statistics do
not reset. If you run ifconfig at regular intervals, you can eyeball the rate of change
in the various statistics. You can automate this by using the watch command or a shell
script, both of which are described in the next chapter.
7.2.4. ip
Some of the network tools, such as ifconfig, are being phased out in favor of the new
command: ip. ip enables you to configure many different aspect of Linux networking,
but it can also display performance statistics about each network device.
When extracting performance statistics, you invoke ip with the following command
line:
ip -s [-s] link
If you call ip with these options, it prints statistics about all the network devices in the
system, including the loopback (lo) and simple Internet transition (sit0) device. The
sit0 device allows IPv6 packets to be encapsulated in IPv4 packets and exists to ease
the transition between IPv4 and IPv6. If the extra -s is provided to ip, it provides a
more detailed list of low-level Ethernet statistics. Table 7-2 describes some of the
performance statistics provided by ip.
Column
Explanation
bytes
packets
errors
dropped
The number of packets that were not sent or received as a result of a lack of resources
on the network card.
overruns
The number of times the network did not have enough buffer space to send or receive
more packets.
7.2.4. ip 202
main
mcast
carrier
The number of packets discarded because of link media failure (such as a faulty
cable).
collsns
This is the number of collisions that the device experienced when transmitting. These
occur when two devices are trying to use the network at the exact same time.
ip is a very versatile tool for Linux network configuration, and although its main
function is the configuration of the network, you can use it to extract low-level device
statistics as well.
Listing 7.4 shows the network performance statistics from all the devices in the
system. In this case, we have an Ethernet card, the loopback device, and the sit0
tunnel device. In this example, the Ethernet card has received ~820Mb of data and
has transmitted ~799Mb.
Listing 7.4.
4460 67 0 0 0 0
4460 67 0 0 0 0
799273378 920999 0 0 0 0
820603574 930929 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Much like ifconfig, ip provides system totals for statistics since the system has
booted. If you use watch (described in the next chapter), you can monitor how these
values change over time.
7.2.5. sar
As discussed in previous chapters, sar is one of the most versatile Linux performance
tools. It can monitor many different things, archive statistics, and even display
information in a format that is usable by other tools. sar does not always provide as
much detail as the area-specific performance tools, but it provides a good overview.
Network performance statistics are no different. sar provides information about the
link-level performance of the network, as do ip and ifconfig; however, it also
provides some rudimentary statistics about the number of sockets opened by the
transport layer.
sar collects many different types of performance statistics. Table 7-3 describes the
command-line options used by sar to display network performance statistics.
Option
Explanation
-n DEV
Shows statistics about the number of packets and bytes sent and received by each
device.
-n EDEV
Shows information about the transmit and receive errors for each device.
-n SOCK
Shows information about the total number of sockets (TCP, UDP, and RAW) in use.
-n FULL
interval
count
The network performance options that sar provides are described in Table 7-4.
Option
Explanation
rxpck/s
txpck/s
rxbyt/s
txbyt/s
rxcmp/s
txcmp/s
rxmcst/s
rxerr/s
txerr/s
coll/s
rxdrop/s
The rate of received frames dropped due to Linux kernel buffer shortages.
txdrop/s
The rate of transmitted frames dropped due to Linux kernel buffer shortages.
txcarr/s
rxfram/s
rxfifo/s
txfifo/s
totsck
tcpsck
udpsck
rawsck
ip-frag
Considering all the statistics that sar can gather, it really does provide the most
system-level performance statistics in a single location.
In Listing 7.5, we examine the transmit and receive statistics of all the network
devices in the system. As you can see, the eth0 device is the most active. In the first
sample, eth0 is receiving ~63,000 bytes per second (rxbyt/s) and transmitting
~45,000 bytes per second (txbyt/s). No compressed packets are sent (txcmp) or
received (rxcmp). (Compressed packets are usually present during SLIP or PPP
connections.)
Listing 7.5.
In Listing 7.6, we examine the number of open sockets in the system. We can see the
total number of open sockets and the TCP, RAW, and UDP sockets. sar also displays
the number of fragmented IP packets.
sar provides a good overview of the system's performance. However, when we are
investigating a performance problem, we really want to understand what processes or
services are consuming a particular resource. sar does not provide this level of detail,
but it does enable us to observe the overall system network I/O statistics.
7.2.6. gkrellm
gkrellm is a graphical monitor that enables you to keep an eye on many different
system performance statistics. It draws charts of different performance statistics,
including CPU usage, disk I/O, and network usage. It can be "themed" to change its
appearance, and even accepts plug-ins to monitor events not included in the default
release.
gkrellm provides similar information to sar, ip, and ipconfig, but unlike the other
tools, it provides a graphical view of the data. In addition, it can provide information
about the traffic flowing through particular UDP and TCP ports. This is the first tool
that we have seen that can show which services are consuming different amounts of
network bandwidth.
gkrellm
None of gkrellm's command-line options configure the statistics that it monitors. You
do all configurations graphically after gkrellm is started. To bring up the
configuration screen, you can either right-click the gkrellm's title bar and select
Configuration, or just press F1 when your cursor is in any area of the window. This
brings up a configuration window (see Figure 7-1).
Figure 7-1.
Figure 7-2 shows the network configuration window. It is used to configure which
statistics and which devices are shown in the final gkrellm output window.
You can configure gkrellm to monitor the activity on a particular range of TCP ports.
Doing so enables you to monitor the exact ports used by services such as HTTP or FTP
and to measure the amount of bandwidth that they are using. In Figure 7-2, we have
configured gkrellm to monitor the ports used by the bittorrent (BT) P2P application
and the Web server (HTTP).
As stated previously, gkrellm can monitor many different types of events. In Figure
7-3, we pruned the output so that only statistics relevant to network traffic and use is
displayed.
Figure 7-3.
As you can see in Figure 7-3, the top two graphs are the bandwidth used for the ports
(BT and HTTP) that we set up in the configuration section, and the bottom two
graphics are the statistics for each of the network devices (eth0 and lo). There is a
small amount of bittorrent (BT) traffic, but no Web server traffic (HTTP). The Ethernet
device eth0 had some large activity in the past, but is settling down now. The lighter
shade in the eth0 indicates the number of bytes received, and the darker shade
indicates the number of bytes transmitted.
gkrellm is a powerful graphical tool that makes it easy to diagnose the status of the
system at a glance.
7.2.7. iptraf
Like the other tools mentioned previously in this chapter, it can provide information
about the rate at which each network device is sending frames. However, it can also
display information about the type and size of the TCP/IP packet and about which
ports are being used for network traffic.
If iptraf is called with no parameters, it brings up a menu that enables you to select
the interface to monitor and type of information that you want to monitor. Table 7-5
describes the command-line parameters that enable you to see the amount of network
traffic on a particular interface or network service.
Option
Explanation
-d interface
Detailed statistics for an interface including receive, transmit, and error rates
-s interface
Statistics about which IP ports are being used on an interface and how many bytes are
flowing through them
-t <minutes>
iptraf has many more modes and configuration options. Read its included
documentation for more information.
iptraf creates a display similar to Figure 7-4 when it is invoked with the following
command:
Figure 7-4.
This command specifies that iptraf should display detailed statistics about Ethernet
device eth0 and exit after it has run for 1 minute. In this case, we can see that
186.8kbps are received and 175.5kbps are transmitted by the eth0 network device.
The next command, whose results are shown in Figure 7-5, asks iptraf to show
information about the amount of network traffic from each UDP or TCP port. iptraf
was invoked with the following command:
Figure 7-5.
Because the TCP or UDP ports of well-known services are fixed, you can use this to
determine how much traffic each service is handling. Figure 7-5 shows that 29kb of
HTTP data has been sent from eth0 and 25kb has been received.
7.2.8. netstat
Option
Explanation
-p
Displays the PID/program name responsible for opening each of the displayed sockets
-c
--interfaces=<name>
--statistics|-s
IP/UDP/ICMP/TCP statistics
--tcp|-t
--udp|-u
--raw|-w
netstanetstat also accepts some command-line options not described here. See the
netstat man page for more details.
Listing 7.7 asks netstat to show the active TCP connections and to continually update
this information. Every second, netstat displays new TCP network statistics. netstat
does not enable you to set the length of time that it will monitor, so it will only stop if
it is killed or interrupted (Ctrl-C).
Listing 7.7.
Listing 7.8 asks netstat to once again print the TCP socket information, but this time,
we also ask it to display the program that is responsible for this socket. In this case,
we can see that SSH and mozilla-bin are the applications that are initiating the TCP
connections.
Listing 7.8.
Listing 7.9 asks netstat to provide statistics about the UDP traffic that the system has
received since boot.
Listing 7.9.
Listing 7.10 asks netstat to provide information about the amount of network traffic
flowing through the eth0 interface.
Listing 7.10.
netstat provides a great number of network performance statistics about sockets and
interfaces in a running Linux system. It is the only network-performance tool that
maps the sockets used back to the PID of the process that is using it, and is therefore
very useful.
7.2.9. etherape
etherape is a little rough around the edges (in interface and documentation), but it
provides a unique visual insight into how the network is connected, what types of
services are being requested, and which nodes are requesting them. It creates a graph
whose nodes represent the systems on the network. The nodes that are
communicating have lines connecting them that increase in size as more network
traffic flows between them. As a particular system's network usage increases, the size
of the circle representing that system also increases. The lines connecting the
different systems are colored differently depending on the protocols they are using to
communicate with each other.
etherape uses the libpcap library to capture the network packets and, as a result, it
must be run as root. etherape is invoked using the following command line:
Table 7-7 describes some of the command-line options that change the interface that
etherape monitors and whether resolved host names are printed on each node.
Option
Explanation
-n, --numeric
Shows only the IP number of the hosts rather than the resolved names
All in all, etherape's documentation is rather sparse. The etherape man page
describes a few more command lines that change its appearance and behavior, but the
best way to learn it is to try it. In general, etherape is a great way to visualize the
network.
Figure 7-6 shows etherape monitoring a relatively simple network. If we match up the
color of the protocol to the color of the biggest circle, we see that this node is
generating a high amount of SSH traffic. From the figure, it can be difficult to
determine which node is causing this SSH traffic. Although not pictured, if we
double-click the big circle, etherape creates a window with statistics pertaining to the
node responsible for the traffic. We can use this to investigate each of the generators
of network traffic and investigate their node names.
Figure 7-6.
The next chapters describe some of the common Linux tools that make using
performance tools easier. They are not performance tools themselves, but they make
using the performance tools more palatable. They can also help to visualize and
analyze the results of the tools, as well as automate some of the more repetitive tasks.
< Day Day Up >
• Automate the display and collection of periodic performance data (bash, watch).
• Record all commands and output displayed during a performance investigation
(tee, script).
• Import, analyze, and graph performance data (gnumeric).
• Determine the libraries that an application is using (ldd).
• Determine which functions are part of which libraries (objdump).
• Investigate runtime characteristics of an application (gdb).
• Create performance tool/debugging-friendly applications (gcc).
< Day Day Up >
In addition to the tools for recording and automation, Linux provides powerful analysis
tools that can help you understand the implications of performance statistics. Whereas
most performance tools generate performance statistics as text output, it is not always
easy to see patterns and trends over time. Linux provides the powerful gnumeric
spreadsheet, which can import, analyze, and graph performance data. When you
graph the data, the cause of a performance problem may become apparent, or it may
at least open up new areas of investigation.
Linux also provides tools that enable you to determine which libraries an application
relies on, as well as tools that display all the functions that a given library provides.
The ldd command provides the list of all the shared libraries that a particular
application is using. This can prove helpful if you are trying to track the number and
location of the libraries that an application uses. Linux also provides the objdump
command, which enables you to search through a given library or application to
display all the functions that it provides. By combining the ldd and objdump
commands, you can take the output of ltrace, which only provides the names of the
Finally, Linux also provides tools that enable you to create performance-tool-friendly
applications, in addition to tools that enable you to interactively debug and investigate
the attributes of running applications. The GNU compiler collection (gcc) can insert
debugging information into applications that aid oprofile in finding the exact line
and source file of a specific performance problem. In addition, the GNU debugger
(gdb) can also be used to find information about running applications not available by
default to various performance tools.
< Day Day Up >
8.2. Tools
Used together, the following tools can greatly enhance the effectiveness and ease of
use of the performance tools described in previous chapters.
8.2.1. bash
bash is the default Linux command-line shell, and you most likely use it every time you
interact with the Linux command line. bash has a powerful scripting language that is
typically used to create shell scripts. However, the scripting language can also be
called from the command line and enables you to easily automate some of the more
tedious tasks during a performance investigation.
bash provides a series of commands that can be used together to periodically run a
particular command. Most Linux users have bash as their default shell, so just logging
in to a machine or opening a terminal brings up a bash prompt. If you are not using
bash, you can invoke it by typing bash.
After you have a bash command prompt, you can enter a series of bash scripting
commands to automate the continuous execution of a particular command. This
feature proves most useful when you need to periodically extract performance
statistics using a particular command. These scripting options are described in Table
8-1.
Option
Explanation
while condition
do
done
Although some performance tools, such as vmstat and sar, periodically display
updated performance statistics, other commands, such as ps and ifconfig, do not.
bash can call commands such as ps and ifconfig to periodically display their
statistics. For example, in Listing 8.1, we ask bash to do something in a while loop
based on the condition TRue. Because the TRue command is always true, the while
loop will never exit. Next, the commands that will be executed after each iteration
start after the do command. These commands ask bash to sleep for one second and
then run ifconfig to extract performance information about the eth0 controller.
However, because we are only interested in the received packets, we grep output of
ifconfig for the string "RX packets". Finally, we issue the done command to tell
bash we are done with the loop. Because the TRue command always returns true, this
entire loop will run forever unless we interrupt it with a <Ctrl-C>.
Listing 8.1.
With the bash script in Listing 8.1, you see network performance statistics updated
every second. The same loop can be used to monitor other events by changing the
ifconfig command to some other command, and the amount of time between updates
can also be varied by changing the amount of sleep. This simple loop is easy to type
directly into the command line and enables you to automate the display of any
performance statistics that interest you.
8.2.2. tee
tee is a simple command that enables you to simultaneously save the standard output
of a command to a file and display it. tee also proves useful when you want to save a
performance tool's output and view it at the same time, such as when you are
monitoring the performance statistics of a live system, but also storing them for later
analysis.
tee takes the output provided by <command> and saves it to the specified file, but also
prints it to standard output. If the -a option is specified, tee appends the output to the
file instead of overwriting it.
Listing 8.2 shows tee being used to record the output of vmstat. As you can see, tee
displays the output that vmstat has generated, but it also saves it in the file
/tmp/vmstat_out. Saving the output of vmstat enables us to analyze or graph the
performance data at a later date.
Listing 8.2.
tee is a simple command, but it is powerful because it makes it easy to record the
output of a given performance tool.
8.2.3. script
The script command is used to save all the input and output generated during a shell
session into a text file. This text file can be used later to both replay the executed
commands and review the results. When investigating a performance problem, it is
useful to have a record of the exact command lines executed so that you can later
review the exact tests you performed. Having a record of the executed commands
means that you also can easily cut and paste the command lines when investigating a
different problem. In addition, it is useful to have a record of the performance results
so that you can review them later when looking for new insights.
script is a relatively simple command. When run, it just starts a new shell and
records all the keystrokes and input and the output generated during the life of the
shell into a text file. script is invoked with the following command line:
By default, script places all the output into a file called typescript unless you
specify a different one. Table 8-2 describes some of the command-line options of
script.
Option
Explanation
-a
-t
Adds timing information about the amount of time between each output/input. This
prints the number of characters displayed and the amount of time elapsed between
the display of each group of characters.
file
One word of warning: script literally captures every type of output that was sent to
the screen. If you have colored or bold output, this shows up as esc characters within
the output file. These characters can significantly clutter the output and are not
usually useful. If you set the TERM environmental variable to dumb (using setenv TERM
dumb for csh-based shells and export TERM=dumb for sh-based shells), applications will
not output the escape characters. This provides a more readable output.
In addition, the timing information provided by script clutters the output. Although it
can be useful to have automatically generated timing information, it may be easier to
not use script's timing, and instead just time the important commands with the time
command mentioned in the previous chapter.
As stated previously, we will have more readable script output if we set the terminal
to dumb. We can do that with the following command:
Next, we actually start the script command. Listing 8.3 shows script being started
with an output file of ps_output. script continues to record the session until you exit
the shell with the exit command or a <Ctrl-D>.
Listing 8.3.
Next, in Listing 8.4, we look at the output recorded by script. As you can see, it
contains all the commands and output that we generated.
Listing 8.4.
script is a great command to accurately record all interaction during a session. The
files that script generates are tiny compared to the size of modern hard drives.
Recording a performance investigation session and saving it for later review is always
a good idea. At worst, it is a small amount of wasted effort and disk space to record
the session. At best, the saved sessions can be looked at later and do not require you
to rerun the commands recorded in that session.
8.2.4. watch
By default, the watch command runs a command every second and displays its output
on the screen. watch is useful when working with performance tools that do not
periodically display updated results. For example, some tools, such as ifconfig and
ps, display the current performance statistics and then exit. Because watch
periodically runs these commands and displays their output, it is possible to see by
glancing at the screen which statistics are changing and how fast they are changing.
If called with no parameters, watch just displays the output of the given command
every second until you interrupt it. In the default output, it can often be difficult to see
what has changed from screen to screen, so watch provides options that highlight the
differences between each output. This can make it easier to spot the differences in
output between each sample. Table 8-3 describes the command-line options that watch
accepts.
Option
Explanation
-d[=cumulative]
This option highlights the output that has changed between each sample. If the
cumulative option is used, an area is highlighted if it has ever changed, not just if it
has changed between samples.
-n sec
watch is a great tool to see how a performance statistic changes over time. It is not a
complicated tool, but does its job well. It really fills a void when using performance
tools that cannot periodically display updated output. When using these tools, you can
run watch in a window and glance at it periodically to see how the statistic changes.
The first example, in Listing 8.5, shows watch being run with the ps command. We are
asking ps to show us the number of minor faults that each process is generating.
watch clears the screen and updates this information every 10 seconds. Note that it
may be necessary to enclose the command that you want to run in quotation marks so
that watch does not confuse the options of the command that you are trying to execute
with its own options.
Listing 8.5.
MINFLT CMD
1467 bash
41 watch -n 1 ps -o minflt,cmd
66 ps -o minflt,cmd
watch is a tool whose basic function could easily be written as a simple shell script.
However, watch is easier than using a shell script because it is almost always available
and just works. Remember that performance tools such as ifconfig or ps display
statistics only once, whereas watch makes it easier to follow (with only a glance) how
the statistics change.
8.2.5. gnumeric
When investigating a performance problem, the performance tools often generate vast
amounts of performance statistics. It can sometimes be problematic to sort through
this data and find the trends and patterns that demonstrate how the system is
behaving. Spreadsheets in general, and gnumeric in particular, provide three different
aspects that make this task easier. First, gnumeric provides built-in functions, such as
max, min, average, and standard deviation, which enable you to numerically analyze
the performance data. Second, gnumeric provides a flexible way to import the tabular
text data commonly output by many performance tools. Finally, gnumeric provides a
powerful graphing utility that can visualize the performance data generated by the
performance tools. This can prove invaluable when searching for data trends over long
periods of time. It is also especially useful when looking for correlations between
different types of data (such as the correlation between disk I/O and CPU usage). It is
often hard to see patterns in text output, but in graphical form, the system's behavior
can be much clearer. Other spreadsheets, such as OpenOffice's oocalc, could also be
used, but gnumeric's powerful text importer and graphing tools make it the easiest to
use.
gnumeric can generate many different types of graphs and has many different
functions to analyze data. The best way to see gnumeric's power and flexibility is to
load some data and experiment with it.
Listing 8.6.
This opens a blank spreadsheet where we can import the vmstat data.
Selecting File > Open in gnumeric brings up a dialog (not shown) that enables you to
select both the file to open and the type of file. We select Text Import (Configurable)
for file type, and we are guided through a series of screens to select which columns of
the vmstat_output file map to which columns of the spreadsheet. For vmstat, it is
useful to start importing at the second line of text, because the second line contains
the names and sizing appropriate for each column. It is also useful to select
Fixed-Width for importing the data because that is how vmstat outputs its data. After
successfully importing the data, we see the spreadsheet in Figure 8-1.
Next, we graph the data that we have imported. In Figure 8-2, we create a stacked
graph of the different CPU usages (us, sys, id, wa). Because these statistics should
always total 100 percent (or close to it), we can see which state dominates at each
time. In this case, the system is idle most of the time, but it has a big amount of wait
time in the first quarter of the graph.
Figure 8-2.
Graphs can be a powerful way to see how the performance statistics of a single run of
a test change over time. It can also prove useful to see how different runs compare to
each other. When graphing data from different runs, be sure to use the same scale for
each of the graphs. This allows you to compare and contrast the data more easily.
gnumeric is a lightweight application that enables you to quickly and easily import and
graph/analyze vast amounts of performance data. It is a great tool to play around with
performance data to see whether any interesting characteristics appear.
8.2.6. ldd
ldd can be used to display which libraries a particular binary relies on. ldd helps track
down the location of a library function that an application may be using. By figuring
out all the libraries that an application is using, it is possible to search through each of
them for the library that contains a given function.
ldd <binary>
ldd then displays a list of all the libraries that this binary requires and which files in
the system are fulfilling those requirements.
Listing 8.7 shows ldd being used on the ls binary. In this particular case, we can see
that ls relies on the following libraries: linux-gate.so.1, librt.so.1, libacl.so.1,
libselinux.so.1, libc.so.6, libpthread.so.0, ld-linux.so.2, and libattr.so.1.
ldd is a relatively simple tool, but it can be invaluable when trying to track down
exactly which libraries an application is using and where they are located on the
system.
8.2.7. objdump
objdump is a complicated and powerful tool for analyzing various aspects of binaries or
libraries. Although it has many other capabilities, it can be used to determine which
functions a given library provides.
objdump -T <binary>
When object is invoked with the -T option, it displays all the symbols that this
library/binary either relies on or provides. These symbols can be data structures or
functions. Every line of the objdump output that contains .text is a function that this
binary provides.
Listing 8.8 shows objdump used to analyze the gtk library. Because we are only
interested in the symbols that libgtk.so provides, we use fgrep to prune the output
to only those lines that contain .text. In this case, we can see that some of the
functions that libgtk.so provides are gtk_arg_values_equal,
gtk_tooltips_set_colors, and gtk_viewport_set_hadjustment.
....
When using performance tools (such as ltrace), which display the library functions an
application calls (but not the libraries themselves), objdump helps locate the shared
library each function is present in.
gdb is a powerful application debugger that can help investigate many different
aspects of a running application. gdb has three features that make it a valuable tool
when diagnosing performance problems. First, gdb can attach to a currently running
process. Second, gdb can display a backtrace for that process, which shows the
current source line and the call tree. Attaching to a process and extracting a
backtrace can be a quick way to find some of the more obvious performance problems.
However, if the application is not stuck in a single location, it may be hard to diagnose
the problem using gdb, and a system-wide profiler, such as oprofile, is a much better
choice. Finally, gdb can map a virtual address back to a particular function. gdb may
do a better job of figuring out the location of the virtual address than performance
tools. For example, if oprofile gives information about where events occur in relation
to a virtual address rather than a function name, gdb can be used to figure out the
function for that address.
gdb is invoked with the following command line, in which pid is the process that gdb
will attach to:
gdb -p pid
After gdb has attached to the process, it enters an interactive mode in which you can
examine the current execution location and runtime variables for the given process.
Table 8-4 describes one of the commands that you can use to examine the running
process.
Option
Explanation
bt
gdb has many more command-line options and runtime controls that are more
appropriate for debugging rather than a performance investigation. See the gdb man
page or type help at the gdb prompt for more information.
Listing 8.9.
void a(void)
{
while(1);
}
main()
{
a();
}
Listing 8.10 launches the application and attaches to its pid with gdb. We ask gdb to
generate a backtrace, which shows us exactly what code is currently executing and,
what set of function calls leads to the current location. As expected, gdb shows us that
we were executing the infinite loop in a(), and that this was called from main().
Listing 8.10.
Finally, in Listing 8.11, we ask gdb to show us where the virtual address 0x0804832F
is located, and gdb shows that that address is part of the function main.
Listing 8.11.
(gdb) x 0x0804832f
0x804832f <main+21>: 0x9090c3c9
gcc is the most popular compiler used by Linux systems. Like all compilers, gcc takes
source code (such as C, C++, or Objective-C) and generates binaries. It provides many
options to optimize the resultant binary, as well as options that make it easier to track
the performance of an application. The details of gcc's performance optimization
options are not covered in this book, but you should investigate them when trying to
increase an application's performance. gcc provides performance optimization options
that enable you to tune the performance of compiled binaries using architecture
generic optimizations (using -01, -02, -03), architecture-specific optimizations
(-march and -mcpu), and feedback-directed optimization (using -fprofile-arcs and
-fbranch-probabilities). More details on each of the optimization options are
provided in the gcc man page.
gcc has an enormous number of options that influence how it compiles an application.
If you feel brave, take a look at them in the gcc man page. The particular options that
can help during a performance investigation are shown in Table 8-5.
Option
Explanation
-g[1 | 2 | 3]
The -g option adds debugging information to the binary with a default level of 2. If a
level is specified, gcc adjusts the amount of debugging information stored in the
binary. Level 1 provides only enough information to generate backtraces, but no
information on the source-line mappings of particular lines of code. Level 3 provides
more information than level 2, such as the macro definitions present in the source.
-pg
Probably the best way to understand the type of debugging information that gcc can
provide is to see a simple example. In Listing 8.12, we have the source for the C
application, deep.c, which just calls a series of functions and then prints out the string
"hi" a number of times depending on what number was passed in. The application's
main function calls function a(), which calls function b() and then prints out "hi".
int main()
{
a(10);
}
First, as shown in Listing 8.13, we compile the application without any debugging
information. We then start the application in the debugger and add a breakpoint to the
b() function. When we run the application, it stops at function b(), and we ask for a
backtrace. gdb can figure out the backtrace, but it does not know what values were
passed between functions or where the function exists in the original source file.
Listing 8.13.
In Listing 8.14, we compile the same application with debugging information turned
on. Now when we run gdb and generate a backtrace, we can see which values were
passed to each function call and the exact line of source where a particular line of
code resides.
Debugging information can significantly add to the size of the final executable that
gcc generates. However, the information that it provides is invaluable when tracking a
performance problem.
In the upcoming chapters, we put together all the tools presented so far and solve
some real-life performance problems.
< Day Day Up >
• Start with a misbehaving system and use the Linux performance tools to track
down the misbehaving kernel functions or applications.
• Start with a misbehaving application and use the Linux performance tools to
track down the misbehaving functions or source lines.
• Track down excess usage of the CPU, memory, disk I/O, and network.
< Day Day Up >
For example, it may be necessary (or even cheaper) to just upgrade the amount of
system memory rather than track down which applications are using system memory,
and then tune them so that they reduce their usage. The decision to just upgrade the
system hardware rather than track down and tune a particular performance problem
depends on the problem and is a value judgment of the individual investigating it. It
really depends on which option is cheaper, either time-wise (to investigate the
problem) or money-wise (to buy new hardware). Ultimately, in some situations, tuning
will be the preferred or only option, so that is what this chapter describes.
< Day Day Up >
As stated in previous chapters, it is a good idea to save the results of each test that
you perform. This enables you to review the results later and even to send the results
to someone else if the investigation is inconclusive.
When investigating a problem, it is best to start with a system that has as little
unrelated programs running as possible, so close or kill any unneeded applications or
processes. A clean system helps eliminate the potentially confusing interference
caused by any extraneous applications.
If you have a specific application or program that is not performing as it should, jump
to Section 9.3. If no particular application is sluggish and, instead, the entire Linux
system is not performing as it should, jump to Section 9.4.
< Day Day Up >
Figure 9-1 shows the steps that we will take to optimize the application.
Figure 9-1.
Use top or ps to determine how much memory the application is using. If the
application is consuming more memory than it should, go to Section 9.6.6; otherwise,
continue to Section 9.3.2.
If the amount of time that the application takes to start up is a problem, go to Section
9.3.3; otherwise, go to Section 9.3.4.
To test whether the loader is a problem, set the ld environmental variables described
in the previous chapters. If the ld statistics show a significant delay when mapping all
the symbols, try to reduce the number and size of libraries that the application is
using, or try to prelink the binaries.
If the loader does appears to be the problem, go to Section 9.9. If it does not, continue
on to Section 9.3.4.
Use top or ps to determine the amount of CPU that the application uses. If the
application is a heavy CPU user, or takes a particularly long time to complete, the
application has a CPU usage problem.
Quite often, different parts of an application will have different performances. It may
be necessary to isolate those parts that have poor performance so that their
performance statistics are measured by the performance tools without measuring the
statistics of those parts that do not have a negative performance impact. To facilitate
this, it may be necessary to change an application's behavior to make it easier to
profile. If a particular part of the application is performance-critical, when measuring
the performance aspects of the total application, you would either try to measure only
the performance statistics when the critical part is executing or make the
performance-critical part run for such a long amount of time that the performance
statistics from the uninteresting parts of the application are such a small part of the
total performance statistics that they are irrelevant. Try to minimize the work that
application is doing so that it only executes the performance-critical functions. For
example, if we were collecting performance statistics from the entire run of an
application, we would not want the startup and exit procedures to be a significant
amount of the total time of the application runtime. In this case, it would be useful to
start the application, run the time-consuming part many times, and then exit
If the application's CPU usage is a problem, skip to Section 9.5. If it is not a problem,
go to Section 9.3.5.
Otherwise, you have encountered an application performance issue that is not covered
in this book. Go to Section 9.9.
< Day Day Up >
Because we are investigating a system-wide problem, the cause can be anywhere from
user applications to system libraries to the Linux kernel. Fortunately, with Linux,
unlike many other operating systems, you can get the source for most if not all
applications on the system. If necessary, you can fix the problem and submit the fix to
the maintainers of that particular piece. In the worst case, you can run a fixed version
locally. This is the power of open-source software.
Figure 9-2.
Use top, procinfo, or mpstat and determine where the system is spending its time. If
the entire system is spending less than 5 percent of the total time in idle and wait
modes, your system is CPU-bound. Proceed to Section 9.4.3. Otherwise, proceed to
Section 9.4.2.
Use top or mpstat to determine whether an individual CPU has less than 5 percent in
idle and wait modes. If it does, one or more CPU is CPU-bound; in this case, go to
Section 9.4.4.
9.4.3. Are One or More Processes Using Most of the System CPU?
The next step is to figure out whether any particular application or group of
applications is using the CPU. The easiest way to do this is to run top. By default, top
Otherwise, go to Section 9.5.1 once for each process to determine where it is spending
its time.
The next step is to figure out whether any particular application or group of
applications is using the individual CPUs. The easiest way to do this is to run top. By
default, top sorts the processes that use the CPU in descending order. When reporting
CPU usage for a process, top shows the total CPU and system time that the
application uses. For example, if an application spends 20 percent of the CPU in user
space code, and 30 percent of the CPU in system code, top will report that the
application has consumed 50 percent of the CPU.
First, run top, and then add the last CPU to the fields that top displays. Turn on Irix
mode so that top shows the amount of CPU time used per processor rather than the
total system. For each processor that has a high utilization, sum up the CPU time of
the application or applications running on it. If the sum of the application time is less
than 75 percent of the sum of the kernel plus user time for that CPU, it appears as the
kernel is spending a significant amount of time on something other than the
applications; in this case, go to Section 9.4.5. Otherwise, the applications are likely to
be the cause of the CPU usage; for each application, go to Section 9.5.1.
It appears as if the kernel is spending a lot of time doing work not on behalf of an
application. One explanation for this is an I/O card that is raising many interrupts,
such as a busy network card. Run procinfo or cat /proc/interrupts to determine
how many interrupts are being fired, how often they are being fired, and which
devices are causing them. This may provide a hint as to what the system is doing.
Record this information and proceed to Section 9.4.6.
Finally, we will find out exactly what the kernel is doing. Run oprofile on the system
and record which kernel functions consume a significant amount of time (greater than
10 percent of the total time). Try reading the kernel source for those functions or
searching the Web for references to those functions. It might not be immediately clear
what exactly those functions do, but try to figure out what kernel subsystem the
functions are in. Just determining which subsystem is being used (such as memory,
network, scheduling, or disk) might be enough to determine what is going wrong.
9.4.3. Are One or More Processes Using Most of the System CPU? 247
main
It also might be possible to figure out why these functions are called based on what
they are doing. If the functions are device specific, try to figure out why the particular
device is being used (especially if it also has a high number of interrupts). E-mail
others who may have seen similar problems, and possibly contact kernel developers.
Go to Section 9.9.
The next step is the check whether the amount of swap space being used is increasing.
Many of the system-wide performance tools such as top, vmstat, procinfo, and
gnome-system-info provide this information. If the amount of swap is increasing, you
need to figure out what part of the system is using more memory. To do this, go to
Section 9.6.1.
While running top, check to see whether the system is spending a high percentage of
time in the wait state. If this is greater than 50 percent, the system is spending a large
amount of time waiting for I/O, and we have to determine what type of I/O this is. Go
to Section 9.4.9.
If the system is not spending a large amount of time waiting for I/O, you have reached
a problem not covered in this book. Go to Section 9.9.
Next, run vmstat (or iostat) and see how many blocks are being written to and from
the disk. If a large number of blocks are being written to and read from the disk, this
may be a disk bottleneck. Go to Section 9.7.1. Otherwise, continue to Section 9.4.10.
Next, we see whether the system is using a significant amount of network I/O. It is
easiest to run iptraf, ifconfig, or sar and see how much data is being transferred
on each network device. If the network traffic is near the capacity of the network
device, this may be a network bottleneck. Go to Section 9.8.1. If none of the network
devices seem to be passing network traffic, the kernel is waiting on some other I/O
device that is not covered in this book. It may be useful to see what functions the
kernel is calling and what devices are interrupting the kernel. Go to Section 9.4.5.
< Day Day Up >
Figure 9-3 shows the method for investigating a processs CPU usage.
Figure 9-3.
You can use the time command to determine whether an application is spending its
time in kernel or user mode. oprofile can also be used to determine where time is
spent. By profiling per process, it is possible to see whether a process is spending its
time in the kernel or user space.
9.5.2. Which System Calls Is the Process Making, and How Long Do They Take to Complete?
Next, run strace to see which system calls are made and how long they take to
complete. You can also run oprofile to see which kernel functions are being called.
Now that the problem has been identified, it is up to you to fix it. Go to Section 9.9.
Next, run oprofile on the application using the cycle event to determine which
functions are using all the CPU cycles (that is, which functions are spending all the
application time).
Keep in mind that although oprofile shows you how much time was spent in a
process, when profiling at the function level, it is not clear whether a particular
function is hot because it is called very often or whether it just takes a long time to
complete.
One way to determine which case is true is to acquire a source-level annotation from
oprofile and look for instructions/source lines that should have little overhead (such
as assignments). The number of samples that they have will approximate the number
of times that the function was called relative to other high-cost source lines. Again,
this is only approximate because oprofile samples only the CPU, and out-of-order
processors can misattribute some cycles.
It is also helpful to get a call graph of the functions to determine how the hot functions
are being called. To do this, go to Section 9.5.4.
Next, you can figure out how and why the time-consuming functions are being called.
Running the application with gprof can show the call tree for each function. If the
time-consuming functions are in a library, you can use ltrace to see which functions.
Finally, you can use newer versions of oprofile that support call-tree tracing.
Alternatively, you can run the application in gdb and set a breakpoint at the hot
function. You can then run that application, and it will break during every call to the
hot function. At this point, you can generate a backtrace and see exactly which
functions and source lines made the call.
Knowing which functions call the hot functions may enable you to eliminate or reduce
the calls to these functions, and correspondingly speed up the application.
If reducing the calls to the time-consuming functions did not speed up the application,
or it is not possible to eliminate these functions, go to Section 9.5.5.
Next, run oprofile, cachegrind, and kcache against your application to see whether
the time-consuming functions or source lines are those with a high number of cache
misses. If they are, try to rearrange or compress your data structures and accesses to
make them more cache friendly. If the hot lines do not correspond to high cache
misses, try to rearrange your algorithm to reduce the number of times that the
particular line or function is executed.
In any event, the tools have told you as much as they can, so go to Section 9.9.
< Day Day Up >
Figure 9-4 shows the flowchart of decisions that we will make to figure out how the
system memory is being used.
Figure 9-4.
To track down what is using the system's memory, you first have to determine whether
the kernel itself is allocating memory. Run slabtop and see whether the total size of
the kernel's memory is increasing. If it is increasing, jump to Section 9.6.2.
If the kernel's memory usage is not increasing, it may be a particular process causing
the increase. To track down which process is responsible for the increase in memory
usage, go to Section 9.6.3.
If the kernel's memory usage is increasing, once again run slabtop to determine what
type of memory the kernel is allocating. The name of the slab can give some indication
about why that memory is being allocated. You can find more details on each slab
name in the kernel source and through Web searches. By just searching the kernel
source for the name of that slab and determining which files it is used in, it may
become clear why it is allocated. After you determine which subsystem is allocating all
that memory, try to tune the amount of maximum memory that the particular
subsystem can consume, or reduce the usage of that subsystem.
Go to Section 9.9.
Next, you can use top or ps to see whether a particular process's resident set size is
increasing. It is easiest to add the rss field to the output of top and sort by memory
usage. If a particular process is increasingly using more memory, we need to figure
out what type of memory it is using. To figure out what type of memory the application
9.5.5. Do Cache Misses Correspond to the Hot Functions or Source Lines? 251
main
Use ipcs to determine whether the amount of shared memory being used is
increasing. If it is, go to Section 9.6.5 to determine which processes are using the
memory. Otherwise, you have a system memory leak not covered in this book. Go to
Section 9.9.
Use ipcs to determine which processes are using and allocating the shared memory.
After the processes that use the shared memory have been identified, investigate the
individual processes to determine why the memory is being using for each. For
example, look in the application's source code for calls to shmget (to allocate shared
memory) or shmat (to attach to it). Read the application's documentation and look for
options that explain and can reduce the application's use of shared memory.
The easiest way to see what types of memory the process is using is to look at its
status in the /proc file system. This file, cat /proc/<pid>/status, gives a breakdown
of the processs memory usage.
If the process has a large and increasing VmStk, this means that the processs stack
size is increasing. To analyze why, go to Section 9.6.7.
If the process has a large VmExe, that means that the executable size is big. To figure
out which functions in the executable contribute to this size, go to Section 9.6.8. If the
process has a large VmLib, that means that the process is using either a large number
of shared libraries or a few large-sized shared libraries. To figure out which libraries
contribute to this size, go to Section 9.6.9. If the process has a large and increasing
VmData, this means that the processs data area, or heap, is increasing. To analyze why,
go to Section 9.6.10.
To figure out which functions are allocating large amounts of stack, we have to use
gdb and a little bit of trickery. First, attach to the running process using gdb. Then,
ask gdb for a backtrace using bt. Next, print out the stack pointer using info
registers esp (on i386). This prints out the current value of the stack pointer. Now
type up and print out the stack pointer. The difference (in hex) between the previous
stack pointer and the current stack pointer is the amount of stack that the previous
function is using. Continue this up the backtrace, and you will be able to see which
When you figure out which function is consuming most of the stack, or whether it is a
combination of functions, you can modify the application to reduce the size and
number of calls to this function (or these functions). Go to Section 9.9.
If the executable has a sizable amount of memory being used, it may be useful to
determine which functions are taking up the greatest amount of space and prune
unnecessary functionality. For an executable or library compiled with symbols, it is
possible to ask nm to show the size of all the symbols and sort them with the following
command:
nm -S size-sort
With the knowledge of the size of each function, it may be possible to reduce their size
or remove unnecessary code from the application.
Go to Section 9.9.
9.6.9. How Big Are the Libraries That the Process Uses?
The easiest way to see which libraries a process is using and their individual sizes is to
look at the processs map in the /proc file system. This file, cat /proc/<pid>/map, will
shows each of the libraries and the size of their code and data. When you know which
libraries a process is using, it may be possible to eliminate the usage of large libraries
or use alternative and smaller libraries. However, you must be careful, because
removing large libraries may not reduce overall system memory usage.
If any other applications are using the library, which you can determine by running
lsof on the library, the libraries will already be loaded into memory. Any new
applications that use it do not require an additional copy of the library to be loaded
into memory. Switching your application to use a different library (even if it is smaller)
actually increases total memory usage. This new library will not be used by any other
processes and will require new memory to be allocated. The best solution may be to
shrink the size of the libraries themselves or modify them so that they use less
memory to store library-specific data. If this is possible, all applications will benefit.
To find the size of the functions in a particular library, go to Section 9.6.8; otherwise,
go to Section 9.9.
If your application is written in C or C++, you can figure out which functions are
allocating heap memory by using the memory profiler memprof. memprof can
After you know which functions allocate the largest amounts of memory, it may be
possible to reduce the size of memory that is allocated. Programmers often
overallocate memory just to be on the safe side because memory is cheap and
out-of-bounds errors are hard to detect. However, if a particular allocation is causing
memory problems, careful analysis of the minimum allocation makes it possible to
significantly reduce memory usage and still be safe. Go to Section 9.9.
< Day Day Up >
Figure 9-5 shows the steps we take to determine the cause of disk I/O usage.
Figure 9-5.
Run iostat in the extended statistic mode and look for partitions that have an average
wait (await) greater than zero. await is the average number of milliseconds that
requests are waiting to be filled. The higher this number, the more the disk is
overloaded. You can confirm this overload by looking at the amount of read and write
traffic on a disk and determining whether it is close to the maximum amount that the
drive can handle.
As mentioned in the chapter on disk I/O, this is where it can be difficult to determine
which process is causing a large amount of I/O, so we must try to work around the
lack of tools to do this directly. By running top, you first look for processes that are
nonidle. For each of these processes, proceed to Section 9.7.3.
First, use strace to trace all the system calls that an application is making that have
to do with file I/O, using strace -e trace=file. We can then strace using summary
information to see how long each call is taking. If certain read and write calls are
taking a long time to complete, this process may be the cause of the I/O slowdown. By
running strace in normal mode, it is possible to see which file descriptors it is reading
and writing from. To map these file descriptors back to files on a file system, we can
look in the proc file system. The files in /proc/<pid>/fd/ are symbolic links from the
file descriptor number to the actual files. An ls -la of this directory shows which files
this process is using. By knowing which files the process is accessing, it might be
possible to reduce the amount of I/O the process is doing, spread it more evenly
between multiple disks, or even move it to a faster disk.
After you determine which files the process is accessing, go to Section 9.9.
< Day Day Up >
Figure 9-6 shows the steps that we take to investigate a network performance
problem.
Figure 9-6.
The first thing to do is to use ethtool to determine what hardware speed each
Ethernet device is set to. If you record this information, you then investigate whether
any of the network devices are saturated. Ethernet devices and/or switches can be
easily mis-configured, and ethtool shows what speed each device believes that it is
operating at. After you determine the theoretical limit of each of the Ethernet devices,
use iptraf (of even ifconfig) to determine the amount of traffic that is flowing over
each interface. If any of the network devices appear to be saturated, go to Section
9.8.3; otherwise, go to Section 9.8.2.
Network traffic can also appear to be slow because of a high number of network
errors. Use ifconfig to determine whether any of the interfaces are generating a
large number of errors. A large number of errors can be the result of a mismatched
Ethernet card / Ethernet switch setting. Contact your network administrator, search
the Web for people with similar problems, or e-mail questions to one of the Linux
networking newsgroups.
Go to Section 9.9.
If a particular device is servicing a large amount of data, use iptraf to track down
what types of traffic that device is sending and receiving. When you know the type of
traffic that the device is handling, advance to Section 9.8.4.
If no application is responsible for this traffic, some system on the network may be
bombarding your system with unwanted traffic. To determine which system is sending
all this traffic, use iptraf or etherape.
If it is possible, contact the owner of this system and try to figure out why this is
happening. If the owner is unreachable, it might be possible to set up ipfilters
within the Linux kernel to always drop this particular traffic, or to set up a firewall
between the remote machine and the local machine to intercept the traffic.
Go to Section 9.9.
Determining which socket is being used is a two-step process. First, we can use
strace to trace all the I/O system calls that an application is making by using strace
-e trace=file. This shows which file descriptors the process is reading and writing
from. Second, we map these file descriptors back to a socket by looking in the proc
file system. The files in /proc/<pid>/fd/ are symbolic links from the file descriptor
number to the actual files or sockets. An ls -la of this directory shows all the file
descriptors of this particular process. Those with socket in the name are network
sockets. You can then use this information to determine inside the program which
socket is causing all the communication.
Go to Section 9.9.
< Day Day Up >
The next few chapters show this method being used to find performance problems on
a Linux system.
< Day Day Up >
• Figure out which source lines are using all the CPU in a CPU-bound application.
• Use ltrace and oprofile to figure out how often an application is calling
various internal and external functions.
• Look for patterns in the applications source, and search online for information
about how an application behaves and possible solutions.
• Use this chapter as a template for tracking down a CPU-related performance
problem.
< Day Day Up >
As the disk and network bottlenecks are removed, the application becomes
CPU-bound. In addition, it is often easier to buy faster disks or more memory than to
upgrade a CPU, so if a process is CPU-bound, it is an important skill to be able to hunt
down and fix a CPU performance problem rather than just buy a new system.
< Day Day Up >
Listing 10.1.
top - 08:24:48 up 7 days, 9:08, 6 users, load average: 1.04, 0.64, 0.76
From this, we can deduce that GIMP actually spawns a separate process to run when
running the filter. So when the filter is running, we can then use ps to track how much
CPU time the process is using and when it has finished. When we have the PID of the
filter using top, we can run the loop in Listing 10.2 and ask ps to periodically observe
how much CPU time the filter is using.
Listing 10.2.
Note
When running the lic filter on the reference image (which is a fetching picture of my
basement) and using the ps method just mentioned to time the filter, we can see from
Listing 10.2 that it takes 2 minutes and 46 seconds to run on the entire image. This
time is our baseline time. Now that we know the amount of time that the filter takes to
run out of the box, we can set our goal for the performance hunt. It is not always clear
how to set a reasonable goal for a performance investigation. A reasonable value for a
goal can depend on several factors, including the amount of tuning that has already
been done on the particular problem and the requirements of the user. It is often best
to set the goal based on another quicker-performance application that does a similar
thing. Unfortunately, we do not know of any GIMP filters that do similar work, so we
have to make a guess. Because a 5 or 10 percent gain in performance is usually a
reasonable goal for a relatively untuned piece of code, we'll set a goal of a 10 percent
speedup, or a runtime of 2 minutes and 30 seconds.
Now that we have picked our goal, we need a way to guarantee that our performance
optimizations are not unacceptably changing the results of the filter. In this case, we
will run the original filter on the reference image and save the result into another file.
We can then compare the output of our optimized filter to the output of the original
filter and see whether the optimizations have changed the output.
< Day Day Up >
For the GIMP, we download the latest GIMP tarball from its Web site, and then
recompile it. In the case of GIMP and much open-source software, the first step in
recompilation is running the configure command, which generates the makefiles that
will be used to build the application. The configure command passes any flags
present in the CFLAGS environmental variable into the makefile. In this case, because
we want the GIMP to be built with symbols, we set the CFLAGS variable to contain -g3.
This causes symbols to be included in the binaries that are built. This command is
shown in Listing 10.3 and overrides the current value of the CFLAGS environmental
variable and sets it to -g3.
Listing 10.3.
We then make and install the version of GIMP with all the symbols included, and when
we run this version, the performance tools will tell us where time is being spent.
< Day Day Up >
In the case of oprofile, we can start oprofile, run the filter, and then stop oprofile
after the filter has been completed. Because the lic filter takes up approximately 90
percent of the CPU when running, the system-wide samples that oprofile collects will
be mainly relevant for the lic filter. When lic starts to run, we start oprofile in
another window; when lic finishes in that other window, we stop oprofile. The
starting and stopping of oprofile is shown in Listing 10.4.
Listing 10.4.
ltrace must be run a little differently. After the filter has been started, ltrace can be
attached to the running process. Unlike oprofile, attaching ltrace to a process
brings the entire process to a crawl. This can inaccurately inflate the amount of time
taken for each library call; however, it provides information about the number of times
each call is made. Listing 10.5 shows a listing from ltrace.
Listing 10.5.
To get the full number of library calls, it is possible to let ltrace run until completion;
however, it takes a really long time, so in this case, we pressed <Ctrl-C> after a long
period of time had elapsed. This will not always work, because an application may go
through different stages of execution, and if you stop it early, you may not have a
complete picture of what functions the application is calling. However, this short
sample will at least give us a starting point for analysis.
< Day Day Up >
First, we use oprofile to look at how the entire system was spending time. This is
shown in Listing 10.6.
Listing 10.6.
As Listing 10.6 shows, 75 percent of the CPU time was spent in the lic process or
GIMP-related libraries. Most likely, these libraries are called by the lic process, a fact
that we can confirm by combining the information that ltrace gives us with the
information from oprofile. Listing 10.7 shows the library calls made for a small
portion of the run of the filter.
Listing 10.7.
Next, we investigate the information that oprofile gives us about where CPU time is
being spent in each of the libraries, and see whether the hot functions in the libraries
are the same as those that the filter calls. For each of the three top CPU-using images,
we ask opreport to give us more details about which functions in the library are
spending all the time. The results are shown in Listing 10.8 for the libgimp,
libgimp-color libraries, and the lic process.
Listing 10.8.
...
....
....
As you can see by comparing the output of ltrace in Listing 10.8, and the oprofile
output in Listing 10.9, the lic filter is repeatedly calling the library functions that are
spending all the time.
Next, we investigate the source code of the lic filter to determine how it is
structured, what exactly its hot functions are doing, and how the filter calls the GIMP
library functions. The lic function that generated the most samples is the getpixel
function, shown by the opannotate output in Listing 10.9. opannotate shows the
number of samples, followed by the total percentage of samples in a column to the left
of the source. This enables you to look through the source and see which exact source
lines are hot.
Listing 10.9.
Listing 10.10.
:static void
:peek (GimpPixelRgn *src_rgn,
: gint x,
: gint y,
: GimpRGB *color)
481 1.7937 :{ /* peek total: 4485 16.7251 */
: static guchar data[4] = { 0, };
:
1373 5.1201 : gimp_pixel_rgn_get_pixel (src_rgn, data, x, y);
2458 9.1662 : gimp_rgba_set_uchar (color, data[0], data[1], data[2],
data[3]);
173 0.6451 :}
Although it is not quite clear exactly what the filter is doing or what the library calls
are used for, there are a few curious points. First, peek sounds like a function that
would retrieve pixels from the image so that the filter can process them. We can check
this hunch shortly. Second, most of the time spent in the filter does not appear to be
spent running a mathematical algorithm on the image data. Instead of spending all the
CPU time running calculations based on the values of the pixels, this filter appears to
spend most of the time retrieving pixels to be manipulated. If this is really the case,
perhaps it can be fixed.
< Day Day Up >
First, we search the Web for pixel_rgn_get_pixel and try to determine what it does. After
a few false starts, the following link and information revealed in Listing 10.11 confirm our
suspicions about what pixel_rgn_get_pixel does.
Listing 10.11.
"There are calls for pixel_rgn_get_ pixel, row, col, and rect, which grab
data from the image and dump it into a buffer that you've pre-allocated.
And there are set calls to match. Look for "Pixel Regions" in gimp.h."
(from http://gimp-plug-ins.sourceforge.net/doc/Writing/html/sect-
image.html )
In addition, the information in Listing 10.12 suggests that it is a good idea to avoid using
pixel_rgn_get_ calls.
Listing 10.12.
"Note that these calls are relatively slow, they can easily be the
slowest thing in your plug-in. Do not get (or set) pixels one at a time
using pixel_rgn_[get|set]_pixel if there is any other way. " (from
http://www.home.unix-ag.org/simon/gimp/guadec2002/gimp-
plugin/html/imagedata.html)
In addition, the Web search yields information about the gimp_rgb_set_uchar function by
simply turning up the source for the function. As shown in Listing 10.13, this call just packs
the red, green, and blue values into a GimpRGB structure that represents a single color.
Listing 10.13.
void
gimp_rgb_set_uchar (GimpRGB *rgb,
guchar r,
guchar g,
guchar b)
{
g_return_if_fail (rgb != NULL);
Information gleaned from the Web confirms our suspicion: The pixel_rgn_get_ pixel
function is a way to extract image data from the image, and gimp_rgba_set_uchar is just a
way to take the color data returned by pixel_rgn_get_pixel and put it into the GimpRGB
data structure.
Not only do we see how these functions are used, other pages also hint that they may not be
the best functions to use if we want the filter to perform at its peak. One Web page
(http://www.home.unix-ag.org/simon/gimp/guadec2002/gimp-plugin/html/efficientaccess.html
suggests that it may be possible to increase performance by using the GIMP image cache.
Another Web site (http://gimp-plug-ins.sourceforge.net/doc/Writing/html/sect-tiles.html)
suggests that it might be possible to increase performance by rewriting the filter to access
the image data more efficiently.
< Day Day Up >
Even though this might seem like enough cache, the GIMP might possibly still need
more. The simple way to test this is to increase the cache to a very large value and see
whether that improves performance. So, in this case, we increase the amount of cache
to 10 times the amount that is normally used. After increasing this value and
rerunning the filter, we receive a time of 2 minutes and 40 seconds. This is an
increase of 6 seconds, but we have not reached our goal of 2 minutes and 30 seconds.
This says that we must look in other areas to increase the performance.
< Day Day Up >
GIMP can provide a way for the filter programmer to directly access the tiles of an
image. The filter can then access the image data as if it were accessing a data array,
instead of requiring a call into a GIMP library. However, there is a catch. When you
have direct access to the pixel data, it is only for the current tile. GIMP will then
iterate over all the tiles in the image, allowing you to ultimately have access to all the
pixels in the image, but you cannot access them all simultaneously. It is only possible
to look at the pixels from a single tile, and this is incompatible with how lic accesses
data. When the lic filter is generating a new pixel at a particular location, it
calculates its new value based on the values of the pixels that surround it. Therefore,
when generating new pixels on the edge of a tile, the lic filter requires pixel data
from all the pixels around it. Unfortunately, these pixels may be on the previous tile or
the next tile in the image. Because this pixel information is not available, the image
filter will not work with this optimized access method.
< Day Day Up >
Because the calls to the GIMP library are expensive, we would only like to do them
once for each pixel rather than nine times. It is possible to optimize access to the
image by reading the entire image into a local array when the filter starts up, and then
accessing this local array as the filter runs, rather than calling the GIMP library
routines each time we want to access the data. This method should significantly
reduce the overhead for looking up the pixel data. Instead of a couple of function calls
for each data access, we just access our local array. On filter initialization, the array is
allocated with malloc and filled with the pixel data. This is shown in Listing 10.14.
Listing 10.14.
g_image_width = width;
g_image_height = height;
g_cached_image = malloc(sizeof(GimpRGB)*width*height);
current_pixel = g_cached_image;
/* Malloc */
for (y = 0; y < height; y++)
{
for (x = 0; x < width; x++)
{
gimp_pixel_rgn_get_pixel (src_rgn, data, x, y);
gimp_rgba_set_uchar (current_pixel, data[0], data[1], data[2],
data[3]);
current_pixel++;
}
}
In addition, the peek routine has been rewritten just to access this local array rather
than call into the GIMP library functions. This is shown in Listing 10.15.
Listing 10.15.
So, does it work? When we run the filter using the new method, runtime has
decreased to 56 seconds! This is well within our goal of 2 minutes and 30 seconds, and
it is a significant boost in performance.
The performance, though impressive, did not come for free. We made one of the
classic trade-offs in performance engineering: We increased performance at the
expense of memory usage. For example, when a 1280x1024 image is used with this
filter, we require 5 additional megabytes of memory. For very big images, it may not
be practical to cache this data; for reasonably sized images, however, a 5MB increase
in memory usage seems like a good sacrifice for a filter that is more than two times as
fast.
< Day Day Up >
This would normally be a cause for concern, because this might indicate that
optimization changed the behavior of the filter. However, a closer examination of the
filter's source code showed several places where random noise was used to slightly
jitter the image before the filter was run. Any two runs of the filter would be different,
so the optimization was likely not to blame. Because the differences between the two
images were so visually small, we can assume that the optimization did not introduce
any problems.
< Day Day Up >
We beat our optimization goal, and then verified that our optimization did not change
the output of the application.
Whereas this chapter focused on optimizing a single application's runtime, the next
chapter presents a performance hunt that concentrates on reducing the amount of
latency when interacting with X Windows. Reducing latency can be tricky, because a
single event often sets off a nonobvious set of other events. The hard part is figuring
out what events are being called and how long each of them are taking.
< Day Day Up >
• Use ltrace and oprofile to figure out where latency is being generated in a
latency-sensitive application.
• Use gdb to generate a stack trace for each call to a "hot" function.
• Use performance tools to determine where time is spent for an application that
uses many different shared libraries.
• Use this chapter as a template to find the cause of high latency in a
latency-sensitive application.
< Day Day Up >
Why should we optimize this? Even though the amount of time to open a pop-up may
be less than a second, it is still slow enough that users can perceive the lag between
when they right-click the mouse and when the menu shows up. This sluggish pop-up
gives the GNOME user the impression that the computer is running slowly. People
notice a slight delay, and it can make interaction with nautilus annoying or give the
impression that the desktop is slow.
This particular performance problem is different from the GIMP problem of the
preceding chapter. First, the core components of the desktop (in this case, GNOME)
are typically more complicated and interlocked than a typical desktop application. The
components typically rely on a variety of subsystems and shared libraries to do their
work. Whereas the GIMP was a relatively self-contained application, making it easier
to profile and recompile when necessary, the GNOME desktop is made up of many
different interlocking components. The components may require multiple processes
and shared libraries, each performing a different task on behalf of the desktop.
nautilus, in particular, is linked to 72 different shared libraries. Tracking down exactly
which piece of code is spending time, how much it is spending, and why it is spending
it, can be a daunting task.
The significant second difference of this performance investigation from the GIMP
investigation is that the times we are trying to reduce are on the order of milliseconds
rather than seconds or minutes. When the times are so small, it can be difficult to
make sure that the profiling data that you are capturing is actually the result of the
event that you are trying to measure rather than just the noise around trying to stop
and start the profiling tools. However, this short time period also makes it practical to
trace all aspects of what the application does for the interesting period of time.
< Day Day Up >
Because right-clicking 100 times would be tedious, and a human (unless very well
trained) could not reliably open up a pop-up menu 100 times in a repeatable manner,
we must automate it. To reliably open up the pop-up menu 100 times, we rely on the
xautomation package. The xautomation package is available at
http://hoopajoo.net/projects/xautomation.html. It can send arbitrary X Window events
to the X server, mimicking a user. After downloading the xautomation tar file,
unzipping and compiling it, we can use it to automate the right mouse click.
Unlike with the GIMP, we cannot simply measure the amount of CPU time used by
nautilus to evaluate the time needed to create 100 pop-up menus. This is mainly
because nautilus does not start immediately before a menu is opened and end
immediately after. We are going to use wall-clock time to see how much time it takes
to complete this task. This requires that the system not have any other things running
while we run the test.
Listing 11.1 shows the shell script of xautomation commands that are used to open
100 pop-up menus in the nautilus file browser. When we run the test, we have to make
sure that we have oriented the nautilus window so that none of the clicks actually
opens a pop-up menu on a folder, and that instead all the pop-ups occur on the
background. This is important because the code paths for the different pop-up menus
could be radically different.
Listing 11.1.
#!/bin/bash
for i in 'seq 1 100';
do
echo $i
./xte 'mousemove 100 100' 'mouseclick 3' 'mouseclick 3'
./xte 'mousemove 200 100' 'mouseclick 3' 'mouseclick 3'
done
The commands in Listing 11.1 move the cursor to position (100,100) on the X screen,
and click the right mouse button (button 3). This brings up a menu. Then they click the
right mouse button again, and this closes the menu. They then move to X position
(100,100), and repeat the process.
Next, we use time to see how much the script of these 100 iterations takes to
complete. This is our baseline time. When we do our optimizations, we will check them
against this time to see whether they have improved. This baseline time for the stock
Fedora 2 version of nautilus on my laptop is 26.5 seconds.
Finally, we have to pick a goal for our optimization path. One easy way to do this is to
find an application that already has fast pop-up menus and see how long it takes for it
to bring up a pop-up menu 100 times. A perfect example of this is xterm, which has
nice snappy menus. Although the menus are not as complicated as those in nautilus,
they should at least be considered an upper bound on how fast menus can be.
The pop-up menus on xterm work a little bit differently, so we have to slightly change
the script to create 100 pop-ups. When xterm creates a pop-up, it requires that the left
control key is depressed, so we have to slightly modify our automation script. This
script is shown in Listing 11.2.
Listing 11.2.
#!/bin/bash
do
echo $i
./xte 'keydown Control_L' 'mousemove 100 100' 'mouseclick 3' 'mouseclick 3'
./xte 'keydown Control_L' 'mousemove 200 100' 'mouseclick 3' 'mouseclick 3' done
When running xterm and timing the pop-up menu creation, xterm takes ~9.2 seconds
to complete the script. nautilus has signficant (almost 17 seconds) room for
improvement. It is probably unreasonable to expect the creation of nautilus's complex
pop-up menus to be the same speed as those of xterm, so let's be conservative and set
a goal of 10 percent, or 3 seconds. Hopefully, we will be able to do much better than
this, or at least figure out why it is not possible to speed it up any more.
< Day Day Up >
Because we only want oprofile to measure events that occur while we are opening
the pop-up menus, we are going to use the command line shown in Listing 11.3 to
start and stop the profiling immediately before and immediately after we run our
script (named script.sh) that opens and closes 100 pop-up menus.
Listing 11.3.
Running opreport after that profiling information has been collected gives us the
information shown in Listing 11.4.
Listing 11.4.
As you can see, time is spent in many different libraries. Unfortunately, it is not at all
clear which application is responsible for making those calls. In particular, we have no
idea which processes have called the libgobject library. Fortunately, oprofile
provides a way to record the shared libraries' functions that an application uses
during a run. Listing 11.5 shows how to configure oprofile's sample collection to
separate the samples by library, which means that oprofile will attribute the samples
Listing 11.5.
After we rerun our test (using the commands in Listing 11.3), opreport splits up the
library samples per application, as shown in Listing 11.6.
Listing 11.6.
If we drill down into the libgobject and libglib libraries, we can see exactly which
functions are being called, as shown in Listing 11.7.
Listing 11.7.
...
From the oprofile output, we can see that nautilus spends a significant amount of
time in the libgobject library and, in particular, in the g_type_check_instance_is_a
function. However, it is unclear what function within the nautilus file manager called
these functions. In fact, the functions may not even be called directly from nautilus,
instead being made by other shared library calls that nautilus is making.
We next use ltrace, the shared library tracer, to try to figure out which library calls
are the most expensive and ultimately what is calling the
g_type_check_instance_is_a function. Because we are concerned primarily about
which functions nautilus is calling, rather than the exact timing information, it is only
necessary to open a pop-up menu once rather than 100 times. Because ltrace will
catch every single shared library call for a single run, if we create 100 pop-up menus,
ltrace would just show the same profile information 100 times.
This procedure for capturing shared library usage information is similar to how we did
it for the GIMP. We first start nautilus as normal. Then before we open up a pop-up
menu, we attach to the nautilus process using the following ltrace command:
ltrace -c -p <pid_of_nautilus>.
We right-click in the nautilus background to bring up the menu, and then immediately
kill the ltrace process with a <Ctrl-C>. After tracing the pop-up, we get the summary
table shown in Listing 11.8.
Listing 11.8.
We can see something interesting in this table. ltrace shows a completely different
function at the top of the list than oprofile did. This is mainly because oprofile and
ltrace measure slightly different things. oprofile shows how much time is spent in
actual functions, but none of the children. ltrace just shows how much time it takes
for an external library call to complete. If that library function in turn calls other
functions, ltrace does not record their individual timings. In fact, it currently does
not even detect or display that these other library calls happened.
In this particular case, the function that oprofile says is the hottest function of
The most significant information that ltrace gives us is a set of a few library calls that
our application makes that we can investigate. We can figure out where the library is
being called, and possibly why all the time is being spent in that library call.
< Day Day Up >
In this case, we downloaded and installed Red Hat's source rpm for nautilus, which
places the source of nautilus in /usr/src/redhat/SOURCES/. By using Red Hat's
source package, we have the exact source and patches that Red Hat used to create the
binary in the package. It is important to investigate the source that was used to create
the binaries that we have been investigating, because another version may have
different performance characteristics. After we extract the source, we can begin to
figure out where the bonobo_window_add_popup call is made. We can search all the
source files in the nautilus directory using the commands in Listing 11.9.
Listing 11.9.
./src/file-manager/fm-directory-view.c: bonobo_window_add_popup\
(get_bonobo_window (view), menu, popup_path);
Listing 11.10.
{
GtkMenu *menu;
return menu;
Listing 11.11.
Now that we have narrowed down exactly where the menu pop-ups are created and
displayed, we can begin to figure out exactly which pieces are taking all the time and
which pieces are ultimately calling the g_type_check_instance_is_a function that
oprofile says is the hot function.
< Day Day Up >
Although no Linux performance tool shows us exactly which functions are calling a
particular function, gprof should be able to present this callback information, but this
requires recompiling the application and all the libraries that it relies on with the -pg
flag to be effective. For nautilus, which relies on 72 shared libraries, this can be a
daunting and infeasible task, so we have to look for another solution. Newer versions
of oprofile can also provide this type of information, but because oprofile only
samples periodically, it will still not be able to account for every call to any given
function.
Fortunately, we can creatively use gdb to extract that information. Using gdb to trace
the application greatly slows down the run; however, we do not really care whether
the trace takes a long time. We are interested in finding the number of times that a
particular function is called rather than the amount of time it is called, so it is
acceptable for the run to take a long time. Luckily, the creation of the pop-up menu is
in the millisecond range; even if it is 1,000 times slower with gdb, it still only takes
about 15 minutes to extract the full trace. The value of the information outweighs our
wait to retrieve it.
To solve this, use another one of gdb's features. gdb can execute a given set of
commands when it hits a breakpoint. By using the command command, we can tell gdb
to execute bt; cont every time it hits the breakpoint in our function. So now the
backtrace displays automatically, and the application continues running every time it
hits g_type_check_instance_is_a.
Now we have to isolate when the trace actually runs. We could just set up the
breakpoint in g_type_check_instance_is_a at the start of the nautilus execution, and
gdb would show tracing information when it is called by any function. Because we only
care about those functions that are called when we are creating a pop-up menu, we
want to limit that tracing to only when pop-ups are being created. To do this, we set
another breakpoint at the beginning and end of the
Listing 11.12.
When running these gdb commands and opening a pop-up menu, gdb churns away for
several minutes and creates a 33MB file containing all the backtrace information for
functions that called the g_type_check_instance_is_a function. A sample of one is
shown in Listing 11.13.
Listing 11.13.
Listing 11.14.
#!/usr/bin/python
import sys
import string
funcs = ""
stop_at = "fm_directory_view_pop_up_background_context_menu"
for line in sys.stdin:
parsed = string.split(line)
if (line[:1] == "#"):
if (parsed[0] == "#0"):
funcs = parsed[1]
elif (parsed[3] == stop_at):
print funcs
funcs = ""
else:
funcs = parsed[3] + "->" + funcs
When we run the gdb.txt file into this python program using the command line shown
in Listing 11.15, we have a more consolidated output, an example of which is shown in
Listing 11.16.
Listing 11.15.
....
create_popup_menu->gtk_widget_show->g_object_notify->g_type_check_
instance_is_a
create_popup_menu->gtk_widget_show->g_object_notify->g_object_ref->g_
type_check_instance_is_a
create_popup_menu->gtk_widget_show->g_object_notify->g_object_
notify_queue_add->g_param_spec_get_redirect_target->g_type_check_
instance_is_a
create_popup_menu->gtk_widget_show->g_object_notify->g_object_notify_
queue_add->g_param_spec_get_redirect_target->g_type_check_instance_is_a
create_popup_menu->gtk_widget_show->g_object_notify->g_object_unref->g_
type_check_instance_is_a
create_popup_menu->gtk_widget_show->g_object_unref->g_type_check_
instance_is_a
...
Because the output lines are long, they have been wrapped when displayed in this
book; in the text file, however, there is one backtrace per line. Each line ends with the
g_type_check_instance_is_a function. Because each backtrace spans only one line,
we can extract information about the backtraces using some common Linux tools,
such as wc, which we can use to count the number of lines in a particular file.
First, let's look at how many calls have been made to the
g_type_check_instance_is_a function. This is the same as the number of backtraces
and, hence, the number of lines in the backtrace.txt file. Listing 11.17 shows the wc
command being called on our pruned backtrace file. The first number indicates the
number of lines in the file.
Listing 11.17.
As you can see, the function has been called 6,848 times just to create the pop-up
menu. Next, let's see how many of those functions are made on behalf of
bonobo_window_add_popup. This is shown in Listing 11.18.
Listing 11.18.
We first have to start by taking a baseline, because the binaries we are testing have
been compiled with different flags than those provided by Red Hat. We time the
scripts as we did before. In this case, a run of 100 iterations takes 30.5 seconds on the
version that we have compiled ourselves. Next, we comment out the
eel_pop_up_context_menu call. This shows us how much time it took nautilus to
detect the mouse click and decide that a context menu needed to be created. Even if
we completely optimize away all the commands in these functions, we will not be able
to run any faster than this. In this case, it takes 7.6 seconds to run all 100 iterations.
Next, we comment out bonobo_window_add_popup to see how much time it costs us to
actually call the function that ltrace says is taking the most amount of time. If we
comment out bonobo_window_add_popup, the 100 iterations take 21.9 seconds to
complete. This says that if we optimize away the bonobo_window_add_popup, it can
shave ~8 seconds off the total run, which is nearly a 25 percent improvement.
< Day Day Up >
Listing 11.19.
void
fm_directory_view_pop_up_background_context_menu (FMDirectoryView *view,
GdkEventButton *event)
{
/* Primitive Cache */
static FMDirectoryView *old_view = NULL;
static GtkMenu *old_menu = NULL;
/* Make the context menu items not flash as they update to proper disabled,
* etc. states by forcing menus to update now.
*/
if ((old_view != view) || view->details->menu_states_untrustworthy)
{
update_menus_if_pending (view);
old_view = view;
old_menu = create_popup_menu(view, FM_DIRECTORY_VIEW_POPUP_PATH_BACKGROUND);
}
eel_pop_up_context_menu (old_menu,
EEL_DEFAULT_POPUP_MENU_DISPLACEMENT,
EEL_DEFAULT_POPUP_MENU_DISPLACEMENT,
event);
}
In this case, we remember the menu that was generated last time. If we are past it in
the same view, and we do not believe that the menu for that view has changed, we just
use the same menu that we used last time instead of creating a new one. This is not a
sophisticated technique, and it will break down if the user does not open a pop-up
menu in the same directory repeatedly. For example, if the user opens a pop-up in
directory 1, and then opens one in directory 2, if the user then opens a pop-up in
directory 1, nautilus will still create a new menu. It is possible to create a simple cache
that stores menus as they are created. When opening a menu, the first check is to see
whether these views already have menus in the cache. If they do, the cached menus
could be viewed; otherwise, new ones could be created. This cache would be especially
useful for some special directories, such as the desktop, computer, or home directory
Note, however, that right now, this is just a test solution. It would have to be
presented to the nautilus developers to confirm that it did not break any functionality
and is suitable for inclusion. However, through the course of the hunt, we have
determined what functions are slow, tracked down where they are called, and created
a possible solution that objectively improves performance. It is also important to note
that the improvement was objective; that is, we have hard data to prove that the new
method is faster, rather than simply subjective (i.e., just saying that it feels snappier).
Most developers would love to have this kind of performance bug report.
< Day Day Up >
• Track down which individual process is causing the system to slow down.
• Use strace to investigate the performance behavior of a process that is not
CPU-bound.
• Use strace to investigate how an application is interacting with the Linux
kernel.
• Submit bug reports that describe a performance problem so that an author or
maintainer has enough information to fix the problem.
< Day Day Up >
This type of problem is different from the problems in the two previous chapters,
because we initially have absolutely no idea what part of the system is causing the
problem. When investigating the GIMP's and nautilus's performance, we knew which
application was responsible for the problem. In this case, we just have a misbehaving
system, and the performance problem could theoretically be in any part of the system.
This type of situation is common. When confronted with it, it is important to use the
performance tools to actually track down the cause of the problem rather than just
guess the cause and try a solution.
< Day Day Up >
However, in this case, it is not so easy to do. We do not know when the problem will
begin or how long it will last, so we cannot really set a baseline without more
investigation. As far as a goal, ideally we would like the problem to disappear
completely, but the problem might be caused by essential OS functions, so eliminating
it entirely might not be possible.
First, we need to do a little more investigation into why this problem is happening to
figure out a reasonable baseline. The initial step is to run top as the slowdown is
happening. This gives us a list of processes that may be causing the problem, or it may
even point at the kernel itself.
In this case, as shown in Listing 12.1, we run top and ask it to show only nonidle
processes (by pressing <I> as top runs).
Listing 12.1.
The top output in Listing 12.1 has several interesting properties. First, we notice that
no process is hogging the CPU; both nonidle tasks are using less than 2 percent of the
total CPU time. Second, the system is spending 91 percent waiting for I/O to happen.
Third, the system is not using any of the swap space, so the grinding disk is NOT
caused by swapping. Finally, an unknown process, prelink, is running when the
problem happens. It is unclear what this prelink command is, so we will remember
that application name and investigate it later.
Our next step is to run vmstat to see what the system is doing. Listing 12.2 shows the
result of vmstat and confirms what we saw with top. That is, ~90 percent of the time
the system is waiting for I/O. It also tells us that the disk subsystem is reading in about
1,000 blocks a second of data. This is a significant amount of disk I/O.
Now that we know that the disk is being heavily used, the kernel is spending a
significant amount of time waiting for I/O, and an unknown application, prelink, is
running, we can begin to figure out exactly what the system is doing.
We do not know for certain that prelink is causing the problem, but we suspect that it
is. The easiest way to determine whether prelink is causing the disk I/O is to "kill" the
prelink process and see whether the disk usage goes away. (This might not be
possible on a production machine, but since we are working on a personal desktop we
can be more fast and loose.) Listing 12.3 shows the output of vmstat, where halfway
through this output, we killed the prelink process. As you can see, the blocks read in
drop to zero after prelink is killed.
Listing 12.3.
Because prelink looks like the guilty application, we can start investigating exactly
what it is and why it is run. In Listing 12.4, we ask rpm to tell us which files are part of
the package that prelink is part of.
First, we note that the prelink package has a cron job that runs daily. This explains
why the performance problem occurs periodically. Second, we note that prelink has a
man page and documentation that describe its function. The man page describes
prelink as an application that can prelink executables and libraries so that their
startup times decrease. (It is just a little ironic that an application that is meant to
boost performance is slowing down our system.) The prelink application can be run
in two modes. The first mode causes all of the specified executables and libraries to be
prelinked even if it has already been done. (This is specified by the --force or -f
option). The second mode is a quick mode, where prelink just checks the mtimes and
ctimes of the libraries and executables to see whether anything has changed since the
last prelinking. (This is specified by the --quick or -q option.) Normally, prelink
writes all the mtimes and ctimes of the prelinked executable to its own cache. It then
uses that information in quick mode to avoid prelinking those executables that have
already been linked.
Examining the cron entry from the prelink package shows that, by default, the
Fedora system uses prelink in both modes. It calls prelink in the full mode every 14
days. However, for every day between that, prelink runs in the quick mode.
Timing prelink in both full and quick mode tells us how slow the worst case is (full
prelinking) and how much performance increases when using the quick mode. We
have to be careful when timing prelink, because different runs may yield radically
different times. When running an application that uses a significant amount of disk
I/O, it is necessary to run it several times to get an accurate indication of its baseline
performance. The first time a disk-intensive application is run, much of the data from
its I/O is loaded into the cache. The second time the application is run, performance is
much greater, because the data it is using is in the disk caches, and it does not need to
read from the disk. If you use the first run as the baseline, you can be misled into
believing that performance has increased after a performance tweak when the real
cause of the performance boost was the warm caches. By just running the application
several times, you can warm up the caches and get an accurate baseline. Listing 12.5
shows the results of prelink in both modes after it has been run several times.
The first fact to note from Listing 12.5 is that the quick mode is not all that quicker
than the full mode. This is suspicious and needs more investigation. The second fact
reinforces what top reported. prelink spends only a small amount of CPU time; the
rest is spent waiting for disk I/O.
Now we have to pick a reasonable goal. The PDF file that was installed in the prelink
package describes the process of prelinking. It also says that the full mode should take
several minutes, and the quick mode should take several seconds. As a goal, let's try
to reduce the quick mode's time to under a minute. Even if we could optimize the
quick mode, we would still have significant disk grinding every 14 days, but the daily
runs would be much more tolerable.
< Day Day Up >
Listing 12.6.
After prelink is configured and compiled, we can use the binary we compiled to
investigate the performance problems.
< Day Day Up >
At first, we ask strace to trace the slower full run of prelink. This is the run that creates
the initial cache that is used when prelink is running in quick mode. Initially, we ask
strace to show us the summary of the system calls that prelink made and see how long
each took to complete. The command to do this is shown in Listing 12.7.
Listing 12.7.
Listing 12.7 is also a sample of prelink's output. prelink is struggling when trying to
prelink some of the system executables and libraries. This information becomes valuable
later, so remember it.
Listing 12.8 shows the summary output file that the strace command in Listing 12.7
generated.
Listing 12.8.
As you can see in Listing 12.8, a significant amount of time is spent in the read system
call. This is expected. prelink needs to figure out what shared libraries are linked into the
application, and this requires that part of the executable be read in to be analyzed. The
prelink documentation mentions that when generating the list of libraries that an
application requires, that application is actually started by the dynamic loader in a special
mode, and then the information is read from the executable using a pipe. This is why
pread is also high in the profile. In contrast, we would expect the quick version to have
very few of these calls.
To see how the profile of the quick version is different, we run the same strace command
on the quick version of prelink. We can do that with the strace command shown in
Listing 12.9.
Listing 12.9.
Listing 12.10 shows the strace profile of prelink running in quick mode.
Listing 12.10.
As expected, Listing 12.10 shows that the quick mode executes a significant number of
lstat64 system calls. These are the system calls that return the mtime and ctime for each
executable. prelink looks in its cache to compare the saved mtime and ctime with the
Listing 12.11.
The output of strace is a 14MB text file, aq_run. Browsing through it shows that prelink
uses lstat64 to check many of the libraries and executables. However, it reveals a few
different types cases where read() is used. The first, shown in Listing 12.12, is where
prelink reads a file that is a shell script. Because this shell script is not a binary ELF file,
it can't be prelinked.
These shell scripts were unchanged since the original full-system prelink was run, so it
would be nice if prelink's cache would record the fact that this file cannot be prelinked. If
the ctime and mtime do not change, prelink should not even try to read them. (If it was a
shell script during the last full prelink and we haven't touched it, it still cannot be
prelinked.)
Listing 12.12.
...
open("/bin/unicode_stop", O_RDONLY|O_LARGEFILE) = 5
....
open("/bin/unicode_start", O_RDONLY|O_LARGEFILE) = 5
close(5) = 0
....
Listing 12.13.
...
pread(5, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\2\0\3\0\1\0\0\0\0\201\4"...,
52, 0) = 52
pread(5, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\2\0\3\0\1\0\0\0\0\201\4"...,
52, 0) = 52
pread(5, "\1\0\0\0\0\0\0\0\0\200\4\10\0\200\4\10\320d\7\0\320d\7"...,
128, 52) = 128
pread(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
920, 488632) = 920
close(5)
...
Finally, in Listing 12.14, we see prelink reading a binary that it had trouble prelinking in
the original full system run. We saw an error regarding this binary in the original prelink
output. When it starts to read this file, it pulls in other libraries and begins to operate on
each of those and their dependencies. This triggers an enormous amount of reading.
Listing 12.14.
...
open("/usr/lib/mozilla-1.6/regchrome", O_RDONLY|O_LARGEFILE) = 6
...
open("/usr/lib/mozilla-1.6/libldap50.so", O_RDONLY|O_LARGEFILE) = 6
close(6) = 0
open("/usr/lib/mozilla-1.6/libgtkxtbin.so", O_RDONLY|O_LARGEFILE) = 6
close(6) = 0
...}) = 0
open("/usr/lib/mozilla-1.6/libjsj.so", O_RDONLY|O_LARGEFILE) = 6
close(6) = 0
lstat64("/usr/lib/mozilla-1.6/mozilla-xremote-client", {st_mode=S_IFREG|0755,
st_size=12896, ...}) = 0
open("/usr/lib/mozilla-1.6/libgkgfx.so", O_RDONLY|O_LARGEFILE) = 6
close(6) = 0
...
Optimizing this case is tricky. Because this binary was not actually the problem (rather the
library that it was linking to, libxpcom.so), we cannot just mark the executable as bad in
the cache. However, if we store the errant library name libxpcom.so with the failing
executable, it may be possible to check the times of the binary and the library, and only try
to prelink again if one of them has changed.
< Day Day Up >
To start the experiment, we copy all the files in /usr/bin/ to the sandbox directory
and run prelink on this directory. This directory includes normal binaries, and shell
scripts, and other libraries that cannot be prelinked. We then run prelink on the
sandbox directory and tell it to create a new cache rather than rely on the system
cache. This is shown in Listing 12.15.
Listing 12.15.
Next, in Listing 12.16, we time how long it takes the quick mode of prelink to run.
We had to run this multiple times until it gave a consistent result. (The first run
warmed the cache for each of the succeeding runs.) The baseline time in Listing 12.16
is .983 seconds. We have to beat this time for our optimization (improving the cache)
to be worth investigating.
Listing 12.16.
Next, in Listing 12.17, we run strace on this prelink command. This is to record
which files prelink opens in the sandbox directory.
Next we create a new directory, sandbox2, into which we once again copy all the
binaries in the /usr/bin directory. However, we overwrite all the files that prelink
"opened" in the preceding strace output with a known good binary, less, which can be
prelinked. We copy the less on to all the problem binaries rather than just deleting
them, so that both sandboxes contain the same number of files. After we set up the
second sandbox, we run the full version of prelink on this new directory using the
command in Listing 12.18.
Listing 12.18.
Finally, we time the run of the quick mode and compare it to our baseline.
Again, we had to run it several times, where the first time warmed the cache. In
Listing 12.19, we can see that we did, indeed, see a performance increase. The time to
execute the prelink dropped from ~.98 second to ~.29 seconds.
Listing 12.19.
Next, we compare the strace output of the two different runs to verify that the
number of reads did, in fact, decrease. Listing 12.20 shows the strace summary
information from sandbox, which contained binaries that prelink could not link.
Listing 12.20.
Listing 12.21 shows the strace summary from sandbox where prelink could link all
the binaries.
Listing 12.21.
As you can see from the differences in Listing 12.20 and Listing 12.21, we have
dramatically reduced the number of reads done in the directory. In addition, we have
significantly reduced the amount of time required to prelink the directory. Caching
and avoiding unprelinkable executables looks like a promising optimization.
< Day Day Up >
When arriving at bugzilla, we first search for bug reports in prelink to see whether
anyone else has reported this problem. In this case, no one has, so we enter the bug
report in Listing 12.22 and wait for the author or maintainer to respond and possibly
fix the bug.
Listing 12.22.
Description of problem:
When running in quick mode, prelink does not cache the fact that some
binaries can not be prelinked. As a result it rescans them every time ,
even if prelink is running in quick mode. This causes the disk to grind
and dramatically slows down the whole system.
1) Static Binaries
2) Shell Scripts
For 1&2, it would be nice if prelink cached that fact that these
executables can not be prelinked, and then in quick mode check their
ctime & mtime, and don't even try to read them if it already knows that
they can't be prelinked.
How reproducible:
Always
Steps to Reproduce:
3. Examine strace's output, and you'll see all of the reads that take
place.
open("/bin/unicode_stop", O_RDONLY|O_LARGEFILE) = 5
read(5, "#!/bin/sh\n# stop u", 18) = 18
close(5) = 0
....
Static Binary:
open("/bin/ash.static", O_RDONLY|O_LARGEFILE) = 5
pread(5, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\2\0\3\0\1\0\0\0\0\201\4"...,
52, 0) = 52
pread(5, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\2\0\3\0\1\0\0\0\0\201\4"...,
52, 0) = 52
pread(5, "\1\0\0\0\0\0\0\0\0\200\4\10\0\200\4\10\320d\7\0\320d\7"...,
128, 52) = 128
pread(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
920, 488632) = 920
close(5)
Un-prelinkable executable:
lstat64("/usr/lib/mozilla-1.6/regchrome", {st_mode=S_IFREG|0755,
...}) = 0
open("/usr/lib/mozilla-1.6/regchrome", O_RDONLY|O_LARGEFILE) = 6
...
open("/usr/lib/mozilla-1.6/libldap50.so", O_RDONLY|O_LARGEFILE) = 6
close(6) = 0
lstat64("/usr/lib/mozilla-1.6/libgtkxtbin.so", {st_mode=S_IFREG|0755,
st_size=14268, ...}) = 0
open("/usr/lib/mozilla-1.6/libgtkxtbin.so", O_RDONLY|O_LARGEFILE) = 6
close(6) = 0
lstat64("/usr/lib/mozilla-1.6/libjsj.so", {st_mode=S_IFREG|0755,
st_size=96752,
...}) = 0
open("/usr/lib/mozilla-1.6/libjsj.so", O_RDONLY|O_LARGEFILE) = 6
close(6) = 0
lstat64("/usr/lib/mozilla-1.6/mozilla-xremote-client",
{st_mode=S_IFREG|0755, st_size=12896, ...}) = 0
lstat64("/usr/lib/mozilla-1.6/regxpcom", {st_mode=S_IFREG|0755,
st_size=55144, ...}) = 0
lstat64("/usr/lib/mozilla-1.6/libgkgfx.so", {st_mode=S_IFREG|0755,
st_size=143012, ...}) = 0
open("/usr/lib/mozilla-1.6/libgkgfx.so", O_RDONLY|O_LARGEFILE) = 6
read(6, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0", 18) = 18
close(6) = 0
...
Expected Results: All of these should have been simple lstat checks
rather than actual reads of the executables.
Additional info:
Even if the author or maintainer never replies, it is still a good idea to enter the
problem in the bug-tracking database. The problem and possible solution will be
recorded, and some enthusiastic programmer may come along and fix the problem.
< Day Day Up >
In the next chapter, the final chapter, we look at the higher-level picture of Linux
performance and performance tools. We review methodologies and tools covered in
this book and look at some of the areas of Linux performance tools that are ripe for
improvement.
< Day Day Up >
• Understand the holes in the Linux performance toolbox, and understand some
of the ideal solutions
• Understand the benefits of Linux as a platform for performance investigation
< Day Day Up >
One glaring hole is that Linux has no single tool that provides all relevant performance
statistics for a particular process. ps was meant to fill this hole in the original UNIX,
and on Linux, it is pretty good but it does not cover all the statistics that other
commercial UNIX implementations provide. Some statistics are invaluable in tracking
down performance problems for example, inblk (I/O blocks read in) and oublk (I/O
blocks written out), which indicate the amount of disk I/O a process is using; vcsw
(voluntary context switches) and invcsw (involuntary context switches), which often
indicate a process was context-switched off the CPU; msgrcv (messages received on
pipes and sockets) and msgsnd (messages sent on pipes and sockets), which show the
amount of network and pipe I/O an application is using. An ideal tool would add all
these statistics and combine the functionality of many performance tools presented so
far (including oprofile, top, ps, strace, ltrace, and the /proc file system) into a
single application. A user should be able point this single application at a process and
extract all the important performance statistics. Each statistic would be updated in
real time, enabling a user to debug an application as it runs. It would group statistics
for a single area of investigation in the same location.
For example, if I were investigating memory usage, it would show exactly how
memory was being used in the heap, in the stack, by libraries, shared memory, and in
mmap. If a particular memory area was much higher than I expected, I could drill down,
and this performance tool would show me exactly which functions allocated the
memory. If I were investigating CPU usage, I would start with overall statistics, such
as how much time is spent in system time versus user time, and how many system
calls a particular process is making, but then I would be able to drill down into either
the system or user time and see exactly which functions are spending all the time and
how often they are being called. A smart shell script that used the appropriate
preexisting tools to gather and combine this information would go a long way to
achieving some of this functionality, but fundamental changes in the behavior of some
of the tools would be necessary to completely realize this vision.
The next performance tool hole is the fact that there is currently no way to provide a
complete call tree of a program's execution. Linux has several incomplete
implementations. oprofile provides call-tree generation, but it is based on sampling,
This call-tree tool would be useful even if it dramatically slowed down application
performance as it runs. A common way of using this would be to run oprofile to
figure out which functions in an application are "hot," and then run the call-tree
program to figure out why the application called them. The oprofile step would
provide an accurate view of the application's bottlenecks when it runs at full speed,
and the call tree, even if it runs slowly, would show how and why the application
called those functions. The only problem would be if the program's behavior was
timing sensitive and it would change if it was run slowly (for example, something that
relied on network or disk I/O). However, many problems exist that are not timing
sensitive, and an accurate call-tree mechanism would go a long way to fixing these.
The final and biggest hole in Linux right now is that of I/O attribution. Right now,
Linux does not provide a good way to track down which applications are using the
highest amounts of disk or even network I/O.
An ideal tool would show, in real time, the amount of input and output bytes of disk
and network I/O that a particular process is using. The tool would show the statistics
as raw bandwidth, as well as a percentage of the raw I/O that the subsystem is
capable of. In addition, users would also be able to split up the statistics, so that they
could see the same statistics for each individual network and disk device.
< Day Day Up >
First, a developer has access to most (if not all) source code for the entire system. This
is invaluable when tracking down a problem that appears to exist outside of your code.
On a commercial UNIX or other operating systems where source is not available, you
might have to wait for a vendor to investigate the problem, and you have no guarantee
that he will fix it if it is his problem. However, on Linux, you can investigate the
problem yourself and figure out exactly why the performance problem is happening. If
the problem is outside your application, you can fix it and submit a patch, or just run
with a fixed version. If, by reading the source of the Linux code, you realize that the
problem is in your code, you can then fix the problem. In either case, you can fix it
immediately and are not gated by waiting for someone else.
The second advantage of Linux is that it is relatively easy to find and contact the
developers of a particular application or library. In contrast to most other proprietary
operating systems, where it is difficult to figure out which engineer is responsible for
a given piece of code, Linux is much more open. Usually, the names or contact
information of the developers for a particular piece of software are with the software
package. Access to the developers allows you to ask questions about how a particular
piece of code behaves, what slow-running code intends to do, and whether a given
optimization is safe to perform. The developers are usually more than happy to help
with this.
The final reason that Linux is a great platform on which to optimize performance is
because it is still young. Features are still being developed, and Linux has many
opportunities to find and fix straightforward performance bugs. Because most
developers focus on adding functionality, performance issues can be left unresolved.
An ambitious performance investigator can find and fix many of the small performance
problems in the ever-developing Linux. These small fixes go beyond a single individual
and benefit the entire Linux community.
< Day Day Up >
It is up to you, the reader, to change Linux performance for the better. The
opportunities for improvement of Linux performance and Linux performance tools
abound. If you find a performance problem that annoys you, fix it or report it to the
developers and work with them to fix it. Either way, no one else will be hit by the
problem, and the entire Linux community benefits.
< Day Day Up >
Tool
Distro
Source Location
bash
http://cnswww.cns.cwru.edu/~chet/bash/bashtop.html
etherape
None
http://etherape.sourceforge.net/
ethtool
http://sourceforge.net/projects/gkernel/
free
gcc
http://gcc.gnu.org/
gdb
http://sources.redhat.com/gdb/
gkrellm
FC2, S9.1
http://web.wt.net/~billw/gkrellm/gkrellm.html
gnome-system-monitor
ftp://ftp.gnome.org/pub/gnome/sources/gnome-system-monitor/
gnumeric
http://www.gnome.org/projects/gnumeric/
gprof
http://sources.redhat.com/binutils
ifconfig
http://www.tazenda.demon.co.uk/phil/net-tools/
iostat
FC2, S9.1
http://perso.wanadoo.fr/sebastien.godard/
ip
ipcs
iptraf
FC2, S9.1
http://cebu.mozcom.com/riker/iptraf
kcachegrind
FC2, S9.1
http://kcachegrind.sourceforge.net/cgi-bin/show.cgi
ldd
http://www.gnu.org/software/libc/libc.html
Part of binutils:
http://sources.redhat.com/binutils
lsof
ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof
ltrace
http://packages.debian.org/unstable/utils/ltrace.html
memprof
http://www.gnome.org/projects/memprof
mii-tool
http://www.tazenda.demon.co.uk/phil/net-tools/
mpstat
FC2, S9.1
http://perso.wanadoo.fr/sebastien.godard/
netstat
http://www.tazenda.demon.co.uk/phil/net-tools/
objdump
Part of binutils:
http://sources.redhat.com/binutils
oprofile
http://oprofile.sourceforge.net/
proc filesystem
The proc file system is part of the Linux kernel and is enabled in almost every
distribution.
procinfo
FC2, S9.1
ftp://ftp.cistron.nl/pub/people/svm
ps
http://procps.sourceforge.net/
sar
http://perso.wanadoo.fr/sebastien.godard/
script
http://www.kernel.org/pub/linux/utils/util-linux/
slabtop
http://procps.sourceforge.net/
strace
http://sourceforge.net/projects/strace/
tee
ftp://alpha.gnu.org/gnu/coreutils/
time
FC2, EL3
http://www.gnu.org/directory/GNU/time.html
top
http://procps.sourceforge.net/
valgrind
S9.1
http://valgrind.kde.org/
vmstat
http://procps.sourceforge.net/
Although not denoted in the table, Debian (testing) contains all the tools listed except
procinfo.
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
Index 346
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
% time option
ltrace tool
strace tool
%MEM option, top (v. 2.x and 3.x) tool
%memused option
sar (II) tool
%swpused option
sar (II) tool
+D directory option
lsof (List Open Files) tool
disk I/O subsystem usage
+d directory option
lsof (List Open Files) tool
disk I/O subsystem usage
--annotated-source option
gprof command
--assembly option
opreport tool
--brief option
gprof command
--delay option
slabtop tool
--details option
opreport tool
--flat-profile option
gprof command
--follow-exec option
memprof tool
application use of memory
--follow-fork option
memprof tool
application use of memory
--graph option
gprof command
--help option
kcachegrind tool
application use of memory
ltrace tool
strace tool
--interfaces=name option
Index 347
main
netstat tool
network I/O
--long-filenames option
opreport tool
--raw|-w option
netstat tool
network I/O
--sort option
slabtop tool
--source -- option
opreport tool
--statistics|-s option
netstat tool
network I/O
--symbols option
opreport tool
--tcp|-t option
netstat tool
network I/O
--trace-jump=yes|no option
kcachegrind tool
application use of memory
--udp|-u option
netstat tool
network I/O
-/+ buffers/cache option
free tool
-A option
gprof command
-a option
opreport tool 2nd
-A option
ps command
-a option
script command
vmstat II tool
-B option
sar (II) tool
-c
-c delay option
free tool
-c option
ltrace tool
strace tool
-d interface option
iptraf tool
network I/O
-d option
iostat tool
disk I/O subsystem usage
opreport tool
Index 348
main
-D option
vmstat tool
disk I/O subsystem usage
-d option
vmstat tool
disk I/O subsystem usage
-d[=cumulative] option
watch command
-f option
opreport tool
-g[1 | 2 | 3] option
GNU compiler collection 2nd
-i, --interface=interface name option
etherape tool
network I/O
-k option
iostat tool
disk I/O subsystem usage
-l delay option
free tool
-l option
ipcs tool
application use of shared memory
opreport tool
-m option
vmstat II tool
-n DEV option
sar tool
network I/O
-n EDEV option
sar tool
network I/O
-n FULL option
sar tool
network I/O
-n sec option
watch command
-n SOCK option
sar tool
network I/O
-n, --numeric option
etherape tool
network I/O
-o /statistic/
-o /statistic/ option
ps command
-o file option
ltrace tool
strace tool
-p option
gprof command
Index 349
main
ipcs tool
application use of shared memory
netstat tool
network I/O
-p partition option
vmstat tool
disk I/O subsystem usage
-p pid option
ltrace tool
strace tool
-pg option
GNU compiler collection
-q option
gprof command
-r delay option
lsof (List Open Files) tool
disk I/O subsystem usage
-r option
sar (II) tool
-s delay option
free tool
-s interface option
iptraf tool
network I/O
-s option
opreport tool 2nd
vmstat II tool
-t minutes option
iptraf tool
network I/O
-t option
ipcs tool
application use of shared memory
script command
-u option
ipcs tool
application use of shared memory
-u user option
ps command
-U user option
ps command
-v option
time command
-W option
sar (II) tool
-x option
iostat tool
disk I/O subsystem usage
-Xrunhprof command-line option, java command
/proc file system display info. [See procinfo tool]
/proc tool
Index 350
main
application use of memory
supported for Java, Mono, Python, and Perl 2nd
/proc//PID tool
processes
maps file 2nd 3rd 4th
status file 2nd 3rd 4th 5th
/proc//PID/ tool
processes
status file
/proc/interrupt file
/proc/meminfo file
memory performance
example 2nd
options 2nd
statistics 2nd 3rd
Index 351
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
a option
slabtop tool
active
vs. inactive memory 2nd 3rd
Active option
/proc/meminfo file
active option
vmstat II tool
active option, top (v. 2.x and 3.x) tool
application optimization
CPU usage 2nd
disk I/O usage
loaders
memory
network I/O usage
startup time
application performance investigation
analyzing tool results 2nd 3rd 4th 5th 6th 7th
configuring applications 2nd
identifying problems 2nd
installing/configuring performance tools
latency problems 2nd
analyzing time use 2nd
analyzing tool results 2nd 3rd 4th
configuring applications 2nd
identifying problems 2nd
installing/configuring tools
running applications and tools 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
setting baseline/goals 2nd 3rd 4th
solutions 2nd 3rd
tracing function calls 2nd 3rd 4th 5th 6th
running applications and performance tools 2nd 3rd
setting baseline/goals 2nd 3rd
solutions
accessing image tiles 2nd
accessing image tiles, with local arrays 2nd 3rd 4th
increasing image cache 2nd
searching Web for functions 2nd
verifying
system-wide problems 2nd 3rd 4th 5th 6th
Index 352
main
configuring application 2nd
configuring/installing performance tools
running applications/tools 2nd 3rd 4th 5th 6th 7th 8th 9th
simulating solution 2nd 3rd 4th 5th 6th
submitting bug report 2nd 3rd
testing solution
application tests
automating 2nd
applications
CPU cache
kernel mode
subdividing time use
time use
gprof command 2nd 3rd 4th 5th 6th 7th 8th 9th
oprofile (II) tool 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th
time use versus library time use 2nd
ltrace tool 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th
use of CPU cache
cachegrind tool
oprofile tool
use of memory
/proc//PID tool 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
kcachegrind tool 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
memprof tool 2nd 3rd 4th 5th 6th 7th
oprofile (III) tool 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th
ps tool 2nd 3rd 4th 5th 6th
tools supported for Java, Mono, Python, and Perl 2nd
valgrind tool 2nd 3rd 4th 5th 6th 7th 8th
use of shared memory
ipcs tool 2nd 3rd 4th 5th 6th 7th
user mode
automation of tasks
application tests 2nd
performance tool invocations 2nd
average mode, vmstat II tool
memory performance
average mode, vmstat tool 2nd 3rd
avgqu-sz statistic
iostat tool
disk I/O subsystem usage
avgrq-sz statistic
iostat tool
disk I/O subsystem usage
await statistic
iostat tool
disk I/O subsystem usage
Index 353
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
b option
slabtop tool
vmstat tool
bar() function
memprof tool
application use of memory 2nd 3rd 4th 5th 6th
baseline of system performance
bash shell
automating/executing long commands
example 2nd 3rd
options 2nd 3rd
time command 2nd
bash tool
source location
bi statistic
vmstat tool
disk I/O subsystem usage
Blk_read statistic
iostat tool
disk I/O subsystem usage
Blk_read/s statistic
iostat tool
disk I/O subsystem usage
Blk_wrtn statistic
iostat tool
disk I/O subsystem usage
Blk_wrtn/s statistic
iostat tool
disk I/O subsystem usage
blocked processes
queue statistics 2nd
bo statistic
vmstat tool
disk I/O subsystem usage
bt option
GNU debugger
buff option
vmstat II tool
buffers
memory 2nd 3rd 4th
Index 354
main
Buffers option
/proc/meminfo file
free tool
procinfo II tool
buffers option, top (v. 2.x and 3.x) tool
bufpg/s option
sar (II) tool
Index 355
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
c option
slabtop tool
C/C++
static versus dynamic languages 2nd
cache option
vmstat II tool
cache subsystem, CPUs
application use of memory use
oprofile
applications use
cachegrind
Levels 1 and 2 caches
Cached option
/proc/meminfo file
free tool
cachegrind tool
application use of CPU cache
oprofile
caches
hot functions
caches, memory 2nd 3rd 4th
call trees
process time use
unreliable/incomplete 2nd
calls option
ltrace tool
strace tool
carrier statistic
ip tool
network I/O
ipconfig tool
network I/O
CODE option, top (v. 2.x and 3.x) tool
coll/sstatistic
sar tool
network I/O
collsns statistic
ip tool
network I/O
command option
Index 356
main
ps command
ps tool
application use of memory
COMMAND statistic
lsof (List Open Files) tool
disk I/O subsystem usage
command-line mode, top (v. 2.0.x) tool
command-line mode, top (v. 3.x.x) tool 2nd
command-line options
memory performance
free tool 2nd
sar (II) tool
slabtop tool 2nd
mpstat
procinfo tool 2nd
sar tool 2nd
top (v. 2.0.x) tool
vmstat II tool
memory performance
vmstat tool 2nd
Committed_AS option
/proc/meminfo file
context switches 2nd 3rd 4th
count option
iostat tool
disk I/O subsystem usage
sar tool
network I/O
vmstat tool
disk I/O subsystem usage
CPU cache
application use of memory
applications use
cachegrind
oprofile
CPU performance investigation
analyzing tool results 2nd 3rd 4th 5th 6th 7th
configuring applications 2nd
identifying problems 2nd
installing/configuring performance tools
running applications and performance tools 2nd 3rd
setting baseline/goals 2nd 3rd
solutions
accessing image tiles 2nd
accessing image tiles, with local arrays 2nd 3rd 4th
increasing image cache 2nd
searching Web for functions 2nd
verifying
CPU performance tools
gnome-system-monitor
example 2nd 3rd
Index 357
main
options 2nd
mpstat
example 2nd 3rd
options 2nd
statistics 2nd
oprofile 2nd
example 2nd 3rd 4th 5th 6th
opcontrol program, event handling 2nd
opcontrol program, options 2nd
opreport program 2nd 3rd
options 2nd 3rd
procinfo
command-line options 2nd
example 2nd 3rd
statistics 2nd
sar
example 2nd 3rd 4th 5th
options 2nd 3rd 4th
statistics 2nd
top
system-wide slowdown 2nd
top (v. 2.0.x) 2nd 3rd 4th
command-line mode
command-line options
example 2nd 3rd 4th
runtime mode 2nd 3rd
sorting/display options
system-wide statistics 2nd
top (v. 3.x.x) 2nd
command-line mode
command-line options
example 2nd 3rd
runtime mode
runtime options 2nd 3rd
system-wide options
vmstat 2nd
average mode 2nd 3rd
command-line options 2nd
CPU-specific statistics 2nd 3rd 4th 5th 6th 7th
sample mode 2nd 3rd
system-wide slowdown 2nd 3rd
vmstat II
sample mode
CPU usage
application problems 2nd
kernel space
process function time use 2nd
call trees
hot functions 2nd
process system calls
process time use
Index 358
main
system-wide performance
to multiple processors 2nd
to single processors
uneven usage by processes 2nd
user space
CPU utilization 2nd
cs option
vmstat tool
cyclic redundancy check (CRC) errors 2nd
Index 359
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
Index 360
main
disk I/O SUBSYSTEM performance tools
vmstat (ii)
options
disk I/O subsystem performance tools
vmstat (ii)
options
statistics 2nd 3rd 4th 5th 6th 7th
disk I/O subsystem usage
inadequate performance tools 2nd
system-wide performance 2nd
disk I/O usage
application problems
disks statistic
vmstat tool
disk I/O subsystem usage
do option
bash shell
documentation
performance investigation 2nd 3rd 4th 5th 6th 7th 8th
done option
bash shell
dropped statistic
ip tool
network I/O
ipconfig tool
network I/O
dsiz option
ps tool
application use of memory
dynamic languages
versus static languages 2nd
dynamic loader
ld.so tool 2nd
environmental variables 2nd
example
options
statistics 2nd
Index 361
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
EDIT NOTE
Delete entries beginning with performance tools for Ch09
disk or hard disk?
Earlier refs for vmstat II, average and sample modes were for CPU performance so
make that addition
For Ch10, performance investigation
applications, INSERT the type of application problem for GIMP
Elapsed time option
time command
environmental variables
ld.so tool 2nd
errors option
strace tool
errors statistic
ip tool
network I/O
ipconfig tool
network I/O
etherape tool
network I/O
example 2nd
options 2nd 3rd
source location
Ethernet network I/O
ethtool performance tool
options
ip performance tool
example 2nd 3rd 4th
options 2nd 3rd
statistics 2nd
ipconfig performance tool
example 2nd
layers
link layer 2nd
physical layer 2nd
mii-tool performance tool
example 2nd
options 2nd
netstat performance tool
example 2nd 3rd 4th 5th
Index 362
main
sar performance tool
example 2nd 3rd
ethtool tool
source location
ethtool tool tool
network I/O
options
etime option
ps command
Exit status option
time command
Index 363
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
fault/s option
sar (II) tool
FD statistic
lsof (List Open Files) tool
disk I/O subsystem usage
Fedora Core 2 (FC2) distribution
installing
oprofile tool
performance tools included 2nd
file option
script command
File Transport Protocol (FTP)
foo() function
memprof tool
application use of memory 2nd 3rd 4th 5th 6th
forks option
vmstat tool
frame statistic
ipconfig tool
network I/O
frames
network statistics 2nd
Free option
free tool
procinfo II tool
free option
vmstat II tool
free swap option
vmstat II tool
free tool
memory performance
example 2nd 3rd 4th
options 2nd 3rd 4th
statistics 2nd
source location
frmpg/s option
sar (II) tool
FTP (File Transport Protocol)
function option
ltrace tool
Index 364
main
functions
memory subsystem use
function library size 2nd
function text size 2nd
heap sizes 2nd
process time usage
call trees
hot functions 2nd
Index 365
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
Index 366
main
statistics 2nd
source location
gnome-system-monitor
CPU-related options 2nd
example 2nd 3rd
gnome-system-monitor (II) tool
memory performance
example 2nd
options 2nd
GNU compiler collection (gcc)
example 2nd 3rd 4th
options 2nd 3rd
GNU compiler collection. [See gcc tool]
GNU debugger (gdb)
example 2nd 3rd 4th
options 2nd 3rd
GNU debugger. [See gdb tool]
gnumeric spreadsheet
example 2nd 3rd 4th
options 2nd 3rd
gnumeric tool
source location
gprof command
example 2nd 3rd 4th 5th
options 2nd 3rd
gprof tool 2nd
source location
Index 367
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
hardware
interrupts 2nd
performance investigation
hardware and software layers 2nd
link layer
network I/O
physical layer
network I/O 2nd
heap memory subsystem use 2nd
High option
free tool
HighFree option
/proc/meminfo file
HighTotal option
/proc/meminfo file
hot functions
process time use
cache misses
HTTP (Hypertext Transfer Protocol)
HugePages 2nd
Hypertext Transfer Protocol (HTTP)
Index 368
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
Index 369
main
vmstat tool
disk I/O subsystem usage
iostat tool
disk I/O subsystem usage
example 2nd 3rd 4th
options 2nd 3rd
statistics 2nd 3rd
IP (Internet Protocol)
ip tool
network I/O
example 2nd 3rd 4th
options 2nd 3rd
statistics 2nd
source location
ip-fragstatistic
sar tool
network I/O
ipconfig tool
network I/O
example 2nd
options
statistics 2nd
ipcs tool
application use of memory
supported for Java, Mono, Python, and Perl 2nd
application use of shared memory
example 2nd 3rd 4th
options 2nd 3rd
iptraf tool
network I/O
example 2nd 3rd
options 2nd 3rd
source location
Index 370
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
Java
memory performance tools
application use 2nd
static versus dynamic languages 2nd
java command
-Xrunhprof command-line option
Index 371
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
kbbuffers option
sar (II) tool
kbcached option
sar (II) tool
kbmemfree option
sar (II) tool
kbmemused option
sar (II) tool
kbswpcad option
sar (II) tool
kbswpfree option
sar (II) tool
kbswpused option
sar (II) tool
kcachegrind tool
application use of memory
example 2nd 3rd 4th 5th 6th 7th
options 2nd 3rd
source location
kernel mode
applications
kernel scheduling
context switches
kernel space
CPU usage
kernel usage
system-wide performance
Index 372
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
l option
slabtop tool
latency performance investigation
analyzing time use 2nd
analyzing tool results 2nd 3rd 4th
configuring applications 2nd
identifying problems 2nd
installing/configuring tools
running applications and tools 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
setting baseline/goals 2nd 3rd 4th
solutions 2nd 3rd
tracing function calls 2nd 3rd 4th 5th 6th
latency performance problems
investigating 2nd
ld (The linux loader) tool
source location
ld.so tool 2nd
environmental variables 2nd
example
options
statistics 2nd
ldd command
example 2nd
options 2nd
ldd tool
source location
Level 1 and 2 CPU caches
libraries
memory subsystem use
function library size 2nd
time use versus application time use 2nd
ltrace tool 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th
utility performance helpers
example 2nd
options 2nd
link layer
link layer, network I/O
Linux kernel
memory usage (slabs) 2nd 3rd 4th 5th
time use versus user time use
Index 373
main
load average
queue statistics 2nd
loaders
application problems
Low option
free tool
LowFree option
/proc/meminfo file
LowTotal option
/proc/meminfo file
lsof (List Open Files) tool
disk I/O subsystem usage
example 2nd
options 2nd 3rd
statistics 2nd
lsof tool
source location
ltrace command
ltrace tool 2nd
analyzing results 2nd
latency-sensitive applications 2nd 3rd 4th
example 2nd 3rd 4th 5th
installing/configuring 2nd
options 2nd
running 2nd
latency-sensitive applications 2nd 3rd 4th
source location
statistics 2nd
Index 374
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
Index 375
main
statistics 2nd
gnome-system-monitor (II)
example 2nd
options 2nd
processes, maps file
/proc//PID 2nd 3rd
processes, status
/proc//PID
processes, status file
/proc//PID 2nd 3rd 4th 5th
procinfo II
CPU statistics 2nd
example 2nd
options
sar (II)
example 2nd 3rd
options 2nd 3rd
statistics 2nd
slabtop
example 2nd
options 2nd 3rd
top (v. 2.x and 3.x)
example 2nd 3rd
runtime mode 2nd
statistics 2nd
vmstat II 2nd
average mode
command-line options
example 2nd 3rd 4th 5th
output statistics 2nd
memory subsystem
active vs. inactive memory 2nd 3rd
kernal usage (slabs)
type of memory used
kernel usage (slabs) 2nd 3rd 4th 5th
performance relationship
physical memory
buffers 2nd 3rd 4th
caches 2nd 3rd 4th
HugePages 2nd
swap space 2nd 3rd 4th 5th
processes:function library size 2nd
processes:function text size 2nd
processes:function use of heap memory 2nd
processes:functions used
processes:resident set size
processes:shared memory
processes:type of memory used 2nd
thrashing
virtual memory
memprof tool
Index 376
main
application use of memory
example 2nd 3rd 4th 5th 6th
options 2nd 3rd
source location
MemTotal option
/proc/meminfo file 2nd 3rd 4th 5th 6th 7th
merged reads statistic
vmstat tool
disk I/O subsystem usage
merged writes statistic
vmstat tool
disk I/O subsystem usage
metric of system performance
mii-tool
network I/O
example 2nd
options 2nd
mii-tool tool
source location
milli reading statistic
vmstat tool
disk I/O subsystem usage
milli spent IO statistic
vmstat tool
disk I/O subsystem usage
milli writing statistic
vmstat tool
disk I/O subsystem usage
minflt option
ps tool
application use of memory
Minor page faults option
time command
Mono
memory performance tools
application use 2nd
static versus dynamic languages 2nd
mpstat tool
CPU-related options 2nd
CPU-related statistics 2nd
example 2nd 3rd
msgrcv (messages received on pipes and sockets) tool
msgsnd (messages sent on pipes and sockets) tool
MTUs (maximum transfer unit) 2nd
multiprocessor stat. [See mpstat tool]
Index 377
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
n option
slabtop tool
nautilus file manager
latency performance investigation 2nd
analyzing time use 2nd
analyzing tool results 2nd 3rd 4th
configuring applications 2nd
identifying problems 2nd
installing/configuring tools
running applications and tools 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
setting baseline/goals 2nd 3rd 4th
solutions 2nd 3rd
tracing function calls 2nd 3rd 4th 5th 6th
versus GIMP application 2nd
nDRT option, top (v. 2.x and 3.x) tool
netstat tool
network I/O
example 2nd 3rd 4th 5th
options 2nd 3rd 4th
source location
network configuration tools
MTU settings 2nd
network I/O
layers 2nd
link layer
physical layer 2nd
protocol-level network I/O
network I/O performance
error-prone devices
limits
traffic
application sockets
process time use
remote processes 2nd
network I/O performance tools
etherape
example 2nd
options 2nd 3rd
ethtool
options
Index 378
main
gkrellm
example 2nd
options 2nd
statistics 2nd
ip
example 2nd 3rd 4th
options 2nd 3rd
statistics 2nd
ipconfig
example 2nd
options
statistics 2nd
iptraf
example 2nd 3rd
options 2nd 3rd
mii-tool
example 2nd
options 2nd
netstat
example 2nd 3rd 4th 5th
options 2nd 3rd 4th
sar
example 2nd 3rd
options 2nd 3rd
statistics 2nd
network I/O usage
application problems
system-wide performance
network layer
network performance tools
inadequate tools 2nd
NODE statistic
lsof (List Open Files) tool
disk I/O subsystem usage
Index 379
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
o option
slabtop tool
objdump command
example 2nd
options 2nd
objdump tool
source location
Offset field, maps file
/proc//PID tool
processes, maps file
opannotate tool
analyzing results 2nd
example
options 2nd 3rd
opcontrol program 2nd
event handling 2nd
options 2nd
opreport program 2nd 3rd 4th 5th 6th 7th 8th
opreport tool
analyzing results
example 2nd
options 2nd 3rd 4th
running
latency-sensitive applications
oprofile (II) tool
example 2nd 3rd 4th 5th 6th
opannotate options 2nd 3rd
opreport options 2nd 3rd 4th
options
oprofile (III) tool
application use of memory
example 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
options 2nd
oprofile tool 2nd 3rd 4th
analyzing results 2nd 3rd 4th
latency-sensitive applications 2nd 3rd 4th
application use of CPU cache
oprofile
CPU-related options 2nd 3rd
example 2nd 3rd 4th 5th 6th
Index 380
main
installing
on Fedora Core 2 FC2
on Red Hat Enterprise Linux (EL3) 2nd
on SUSE 9.1 (S9.1)
installing/configuring 2nd
opcontrol program
event handling 2nd
options 2nd
opreport program 2nd 3rd
running
latency-sensitive applications 2nd 3rd 4th 5th 6th
source location
option
ps tool
application use of memory
oublk (I/O blocks written out) tool
overruns statistic
ip tool
network I/O
ipconfig tool
network I/O
Index 381
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
P option
slabtop tool
packets statistic
ip tool
network I/O
Page oinoption
procinfo II tool
Page out option
procinfo II tool
Page size option
time command
pages paged in option
vmstat II tool
pages paged out option
vmstat II tool
pages swapped in option
vmstat II tool
pages swapped in/out option
vmstat II tool
partitions statistic
vmstat tool
disk I/O subsystem usage
Pathname field, maps file
/proc//PID tool
processes, maps file
pcpu option
ps command
peek function
Percent of CPU this job got option
time command
performance investigation
applications
analyzing tool results 2nd 3rd 4th 5th 6th 7th
configuring applications 2nd
identifying problems 2nd
installing/configuring performance tools
running applications and performance tools 2nd 3rd
searching Web for functions 2nd
setting baseline/goals 2nd 3rd
solutions, accessing image tiles 2nd
Index 382
main
solutions, accessing image tiles, with local arrays 2nd 3rd 4th
solutions, increasing image cache 2nd
solutions, verifying
automating tasks 2nd 3rd
documentation 2nd 3rd 4th
guidelines 2nd
hardware/software configuration
performance results
research information/URLs
establishing baseline
establishing metric
establishing target 2nd
general guidelines 2nd
initial use of performance tools
latency-sensitive applications 2nd
analyzing time use 2nd
analyzing tool results 2nd 3rd 4th
configuring applications 2nd
identifying problems 2nd
installing/configuring tools
running 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
setting baseline/goals 2nd 3rd 4th
solutions 2nd 3rd
tracing function calls 2nd 3rd 4th 5th 6th
low-overhead tools 2nd
multiple tool use 2nd
solutions earlier by others 2nd 3rd
system-wide slowdown
configuring application 2nd
configuring/installing performance tools
identifying problems 2nd
running applications/tools 2nd 3rd 4th 5th 6th 7th 8th 9th
setting baseline/goals 2nd 3rd 4th 5th 6th 7th
simulating solution 2nd 3rd 4th 5th 6th
submitting bug report 2nd 3rd
testing solution
trusting tools 2nd
using others\#213 experience 2nd
performance tools
;latency-sensitive applicationsinstalling/configuring tools
analyzing results 2nd 3rd 4th 5th 6th 7th
application calls to libraries
ltrace 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th
application use of CPU cache
cachegrind
oprofile
application use of memory
kcachegrind 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
memprof 2nd 3rd 4th 5th 6th 7th
oprofile (III) 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th
ps 2nd 3rd 4th 5th 6th
Index 383
main
tools supported for Java, Mono, Python, and Perl 2nd
valgrind 2nd 3rd 4th 5th 6th 7th 8th
application use of shared memory
ipcs 2nd 3rd 4th 5th 6th 7th
applications
network I/O usage
CPU
gnome-system-monitor 2nd 3rd 4th 5th
mpstat 2nd 3rd 4th 5th 6th 7th 8th
oprofile 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th 14th 15th 16th 17th
18th
procinfo 2nd 3rd 4th 5th 6th 7th 8th
sar 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th
top 2nd
top (v. 2.0.x) 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th 14th 15th
top (v. 3.x.x) 2nd
top (v.3.x.x) 2nd 3rd 4th 5th 6th 7th 8th
vmstat 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th
CPU usage
process function time use 2nd 3rd 4th 5th
process system calls
process time use
time
user or kernel space
disk I/O subsystem usage
inadequaticies 2nd
iostat 2nd 3rd 4th 5th 6th 7th
iostat: statistics 2nd 3rd
lsof (List Open Files) 2nd 3rd 4th 5th 6th 7th
sar 2nd 3rd 4th 5th
sar: statistics 2nd
vmstat (ii) 2nd 3rd 4th 5th 6th
vmstat (ii): statistics 2nd 3rd 4th 5th 6th 7th
dynamic loader
ld.so tool 2nd 3rd 4th 5th 6th 7th 8th
inadequaticies
disk and network I/O subsystem usage 2nd
scattered performance statistics 2nd
unreliable/incomplete call trees 2nd
installing/configuring
invocations
automating 2nd
latency-sensitive applications
analyzing time use 2nd
analyzing tool results 2nd 3rd 4th
running applications and tools 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
solutions 2nd 3rd
tracing function calls 2nd 3rd 4th 5th 6th
Linux platform advantages
accessibility of developers
available source code
Index 384
main
newness
memory
/proc/meminfo file 2nd 3rd 4th 5th 6th 7th
free 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
gnome-system-monitor (II) 2nd 3rd 4th
procinfo II 2nd 3rd 4th 5th
sar (II) 2nd 3rd 4th 5th 6th 7th 8th
slabtop 2nd 3rd 4th 5th
top (v. 2.x and 3.x) 2nd 3rd 4th 5th 6th 7th 8th
vmstat II 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
network I/O
etherape 2nd 3rd 4th 5th
ethtool
gkrellm 2nd 3rd 4th 5th 6th
ip 2nd 3rd 4th 5th 6th 7th 8th 9th
ipconfig 2nd 3rd 4th 5th
iptraf 2nd 3rd 4th 5th 6th
mii-tool 2nd 3rd 4th
netstat 2nd 3rd 4th 5th 6th 7th 8th 9th
sar 2nd 3rd 4th 5th 6th 7th 8th
process status
ps command 2nd 3rd 4th 5th 6th
processes, maps file
/proc//PID 2nd 3rd 4th
processes, status file
/proc//PID 2nd 3rd 4th 5th 6th
reviewing release notes and documentation
running 2nd 3rd
system-wide performance
CPU usage by processes uneven 2nd
CPU-bound 2nd 3rd 4th
diagnosis flowchart
disk I/O subsystem usage 2nd
kernel function usage
network I/O usage
numerous interrupts
swap memory
time use
gprof 2nd 3rd 4th 5th 6th 7th 8th 9th
oprofile (II) 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th
time command 2nd 3rd 4th 5th 6th 7th 8th 9th
tracing system calls
strace 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th 14th 15th
utility helper capabilities
automating/executing long commands 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th
debugging applications 2nd 3rd 4th 5th 6th 7th 8th
graphing/analyzing statistics 2nd 3rd 4th 5th
inserting debugging information 2nd 3rd 4th 5th 6th 7th
listing all of library\#213s functions 2nd 3rd 4th
listing libraries used by applications 2nd 3rd 4th
recording displayed output and key presses 2nd 3rd 4th 5th 6th
Index 385
main
saving output to files 2nd 3rd 4th
utility helper capabilitiess
graphing/analyzing statistics
utility helpers
bash shell 2nd 3rd 4th 5th 6th
GNU compiler collection 2nd 3rd 4th 5th 6th 7th
GNU debugger 2nd 3rd 4th 5th 6th 7th 8th
gnumeric spreadsheet 2nd 3rd 4th 5th 6th
ldd command 2nd 3rd 4th
ltrace command
objdump command 2nd 3rd 4th
script command 2nd 3rd 4th 5th 6th
tee command 2nd 3rd 4th
watch command 2nd 3rd 4th 5th
Perl
memory performance tools
application use 2nd
static versus dynamic languages
pgpgin/s option
sar (II) tool
pgpgout/s option
sar (II) tool
physical layer network I/O 2nd
physical memory
active vs. inactive memory 2nd 3rd
application use of memory
CPU cache
buffers 2nd 3rd 4th
caches 2nd 3rd 4th
HugePages 2nd
kernel usage (slabs) 2nd 3rd 4th 5th
swap space 2nd 3rd 4th 5th
PID statistic
lsof (List Open Files) tool
disk I/O subsystem usage
pmep option
ps tool
application use of memory
Point-to-Point Protocol (PPP)
PPP (Point-to-Point Protocol)
prelink application
performance investigation 2nd 3rd 4th 5th 6th
configuring application 2nd
configuring/installing performance tools
running applications/tools 2nd 3rd 4th 5th 6th 7th 8th 9th
simulating solution 2nd 3rd 4th 5th 6th
submitting bug report 2nd 3rd
testing solution
proc filesystem tool
source location
process status
Index 386
main
ps command
example 2nd 3rd
options 2nd 3rd
process time usage
functions 2nd
call trees
hot functions 2nd
system calls
user or kernal space
procinfo II tool
memory performance
CPU statistics 2nd
example 2nd
options
procinfo tool
command-line options 2nd
CPU-related statistics 2nd
example 2nd 3rd 4th
source location
ps command
example 2nd 3rd
options 2nd 3rd
ps tool
application use of memory
example 2nd
options 2nd 3rd 4th 5th
supported for Java, Mono, Python, and Perl 2nd
disk I/O subsystem usage
source location
pswpin/s option
sar (II) tool
pswpout/s option
sar (II) tool
Python
memory performance tools
application use 2nd
static versus dynamic languages
Index 387
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
r/s statistic
iostat tool
disk I/O subsystem usage
RAM
rawsckstatistic
sar tool
network I/O
rd_sec/s statistic
sar tool
disk I/O subsystem usage
read sectors statistic
vmstat tool
disk I/O subsystem usage 2nd
reads statistic
vmstat tool
disk I/O subsystem usage
reads: merged statistic
vmstat tool
disk I/O subsystem usage
reads: ms statistic
vmstat tool
disk I/O subsystem usage
reads: sectors statistic
vmstat tool
disk I/O subsystem usage
reads: total statistic
vmstat tool
disk I/O subsystem usage
Red Hat Enterprise Linux (EL3)
installing
oprofile tool 2nd
performance tools included 2nd
requested writes statistic
vmstat tool
disk I/O subsystem usage
RES option, top (v. 2.x and 3.x) tool
rkB/s statistic
iostat tool
disk I/O subsystem usage
rrqm/s statistic
Index 388
main
iostat tool
disk I/O subsystem usage
rsec/s statistic
iostat tool
disk I/O subsystem usage
rss option
ps tool
application use of memory
runnable processes
queue statistics 2nd
runtime mode, top (v. 2.0.x) tool 2nd 3rd
runtime mode, top (v. 2.x and 3.x) tool 2nd
runtime mode, top (v. 3.x.x) tool 2nd 3rd 4th
RX packets statistic
ipconfig tool
network I/O
rxbyt/sstatistic
sar tool
network I/O
rxcmp/sstatistic
sar tool
network I/O
rxdrop/sstatistic
sar tool
network I/O
rxerr/sstatistic
sar tool
network I/O
rxfifo/sstatistic
sar tool
network I/O
rxfram/sstatistic
sar tool
network I/O
rxmcst/sstatistic
sar tool
network I/O
rxpck/s statistic
sar tool
network I/O
Index 389
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
S option
ltrace tool
s option
slabtop tool
sample mode, vmstat II tool
memory performance
sample mode, vmstat tool 2nd 3rd
sar (II) tool
memory performance
example 2nd 3rd
options 2nd 3rd
statistics 2nd
sar tool
CPU-related options 2nd 3rd 4th
CPU-related statistics 2nd
disk I/O subsystem usage
example 2nd
options 2nd
statistics 2nd 3rd
example 2nd 3rd 4th 5th
network I/O
example 2nd 3rd
options 2nd 3rd
statistics 2nd
script command
example 2nd 3rd
options 2nd 3rd
script tool
source location
seconds option
ltrace tool
strace tool
Secure Shell (SSH) service
Serial Line Internet Protocol (SLIP)
Shared option
free tool
procinfo II tool
SHR option, top (v. 2.x and 3.x) tool
si option
vmstat II tool
Index 390
main
SIZE statistic
lsof (List Open Files) tool
disk I/O subsystem usage
Slab option
/proc/meminfo file
slabs, memory 2nd 3rd 4th 5th
slabtop tool
memory performance
example 2nd
options 2nd 3rd
source location
SLIP (Serial Line Internet Protocol)
so option
vmstat II tool
software
performance investigation
SSH (Secure Shell) service
startup time of applications
static languages
versus dynamic languages 2nd
strace tool 2nd
disk I/O subsystem usage
example 2nd
options 2nd
source location
statistics 2nd
system-wide slowdown
configuring/installing tool
running applications/tools 2nd 3rd 4th 5th
simulating solution 2nd 3rd 4th 5th
SUSE 9.1 (S9.1) distribution
installing
oprofile tool
performance tools included 2nd
svctm statistic
iostat tool
disk I/O subsystem usage
swap memory
application use of memory
swap memory usage
system-wide performance
Swap oin option
procinfo II tool
SWAP option, top (v. 2.x and 3.x) tool
Swap out option
procinfo II tool
swap space 2nd 3rd 4th 5th
swap, total, free option, top (v. 2.x and 3.x) tool
SwapCached option
/proc/meminfo file
SwapFree option
Index 391
main
/proc/meminfo file
Swaps option
time command
SwapTotal option
/proc/meminfo file
swpd option
vmstat II tool
sy option
vmstat tool 2nd
system activity reporter. [See sar tool]
system calls
strace tool 2nd
example 2nd
options 2nd
statistics 2nd
System time option
time command
system-wide performance
CPU usage by processes uneven 2nd
CPU-bound
to multiple processors 2nd
to single processors
diagnosis flowchart
disk I/O subsystem usage 2nd
kernel function usage
network I/O
numerous interrupts
swap memory
system-wide slowdown investigation
configuring application 2nd
configuring/installing performance tools
identifying problems 2nd
running applications/tools 2nd 3rd 4th 5th 6th 7th 8th 9th
setting baseline/goals 2nd 3rd 4th 5th 6th 7th
simulating solution 2nd 3rd 4th 5th 6th
submitting bug report 2nd 3rd
testing solution
Index 392
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
T option
top (v. 2.0.x) tool
target for system performance 2nd
TCP (Transport Control Protocol)
network I/O
tcpsckstatistic
sar tool
network I/O
tee command
example 2nd
options 2nd
tee tool
source location
thrashing
time command
example 2nd 3rd 4th
options 2nd 3rd
statistics 2nd
time option
ps command
time tool
application use of time
source location
time use
applications
gprof command 2nd 3rd 4th 5th 6th 7th 8th 9th
oprofile (II) tool 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th
libraries versus applications 2nd
ltrace tool 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th
Linux kernel versus users
subdividing application use
time use performance tools
time command
example 2nd 3rd 4th
options 2nd 3rd
statistics 2nd
top (v. 2.0.x) tool 2nd
command-line mode
command-line options
CPU-related options 2nd
Index 393
main
example 2nd 3rd 4th
runtime mode 2nd 3rd
sorting/display options
system-wide statistics 2nd
top (v. 2.x and 3.x)
memory performance
top (v. 2.x and 3.x) tool
memory performance
example 2nd 3rd
statistics 2nd
runtime mode 2nd
top (v. 3.x.x) tool
command-line mode
command-line options
CPU-related options
example 2nd 3rd
runtime mode
runtime options 2nd 3rd
system-wide options
top tool
source location
system-wide slowdown 2nd
Total option
free tool
procinfo II tool
total reads statistic
vmstat tool
disk I/O subsystem usage
total swap option
vmstat II tool
total, used, free option, top (v. 2.x and 3.x) tool
Totals option
free tool
totsckstatistic
sar tool
network I/O
tps statistic
iostat tool
disk I/O subsystem usage
sar tool
disk I/O subsystem usage
Transport Control Protocol (TCP)
network I/O
transport layer
tsiz option
ps tool
application use of memory
TX packets statistic
ipconfig tool
network I/O
txbyt/sstatistic
Index 394
main
sar tool
network I/O
txcarr/sstatistic
sar tool
network I/O
txcmp/sstatistic
sar tool
network I/O
txdrop/sstatistic
sar tool
network I/O
txerr/sstatistic
sar tool
network I/O
txfifo/sstatistic
sar tool
network I/O
txpck/sstatistic
sar tool
network I/O
TYPE statistic
lsof (List Open Files) tool
disk I/O subsystem usage
Index 395
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
Index 396
main
options 2nd 3rd
ldd command
example 2nd
options 2nd
ltrace command
objdump command
example 2nd
options 2nd
script command
example 2nd 3rd
options 2nd 3rd
tee command
example 2nd
options 2nd
watch command 2nd
options 2nd 3rd
Index 397
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
v option
slabtop tool
valgrind tool
application use of memory
example 2nd 3rd 4th 5th 6th
options 2nd 3rd
source location
vcsw (voluntary context switches) tool
VIRT option, top (v. 2.x and 3.x) tool
virtual memory
Virtual Memory Statistics. [See vmstat tool]
VmData statistic
/proc//PID tool
processes: status file
VmExe statistic
/proc//PID tool
processes: status file
VmLck statistic
/proc//PID tool
processes: status file
VmRSS statistic
/proc//PID tool
processes: status file
vmsta tool
source location
vmstat II tool
average mode
memory performance
memory performance 2nd
command-line options
example 2nd 3rd 4th 5th
output statistics 2nd
sample mode
memory performance
vmstat tool 2nd
average mode 2nd 3rd
command-line options 2nd
CPU-specific statistics 2nd 3rd 4th 5th 6th 7th
disk I/O subsystem usage
example 2nd 3rd 4th
Index 398
main
options 2nd
statistics 2nd 3rd 4th 5th 6th 7th
sample mode 2nd 3rd
system-wide slowdown 2nd 3rd
VmStk statistic
/proc//PID tool
processes: status file
Voluntary context switches option
time command
vsz option
ps tool
application use of memory
Index 399
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
w/s statistic
iostat tool
disk I/O subsystem usage
wa option
vmstat tool 2nd
wa statistic
vmstat tool
disk I/O subsystem usage
watch command
automating/executing long commands
options 2nd 3rd 4th
Web searches
source for performance investigation
while condition option
bash shell
wkB/s statistic
iostat tool
disk I/O subsystem usage
wr_sec/s statistic
sar tool
disk I/O subsystem usage
Writeback option
/proc/meminfo file
writes statistic
vmstat tool
disk I/O subsystem usage 2nd
writes: merged statistic
vmstat tool
disk I/O subsystem usage
writes: ms statistic
vmstat tool
disk I/O subsystem usage
writes: sectors statistic
vmstat tool
disk I/O subsystem usage
writes: total statistic
vmstat tool
disk I/O subsystem usage
written sectors statistic
vmstat tool
Index 400
main
disk I/O subsystem usage
wrqm/s statistic
iostat tool
disk I/O subsystem usage
wsec/s statistic
iostat tool
disk I/O subsystem usage
Index 401
main
Index
[SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [R] [S] [T] [U] [V]
[W] [X]
xautomation package
xeyes command
ltrace tool 2nd 3rd 4th 5th
Index 402