Red Hat Enterprise Linux-8-Monitoring and Managing System Status and Performance-En-us
Red Hat Enterprise Linux-8-Monitoring and Managing System Status and Performance-En-us
The text of and illustrations in this document are licensed by Red Hat under a Creative Commons
Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is
available at
http://creativecommons.org/licenses/by-sa/3.0/
. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must
provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert,
Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift,
Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States
and other countries.
Linux ® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS ® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States
and/or other countries.
MySQL ® is a registered trademark of MySQL AB in the United States, the European Union and
other countries.
Node.js ® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the
official Joyent Node.js open source or commercial project.
The OpenStack ® Word Mark and OpenStack logo are either registered trademarks/service marks
or trademarks/service marks of the OpenStack Foundation, in the United States and other
countries and are used with the OpenStack Foundation's permission. We are not affiliated with,
endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.
Abstract
Monitor and optimize the throughput, latency, and power consumption of Red Hat Enterprise Linux
8 in different scenarios.
Table of Contents
Table of Contents
. . . . . . . . . .OPEN
MAKING . . . . . . SOURCE
. . . . . . . . . .MORE
. . . . . . .INCLUSIVE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
..............
. . . . . . . . . . . . . FEEDBACK
PROVIDING . . . . . . . . . . . . ON
. . . .RED
. . . . .HAT
. . . . .DOCUMENTATION
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11. . . . . . . . . . . . .
. . . . . . . . . . . 1.. .OVERVIEW
CHAPTER . . . . . . . . . . . .OF
. . . PERFORMANCE
. . . . . . . . . . . . . . . . . .MONITORING
. . . . . . . . . . . . . . OPTIONS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
..............
.CHAPTER
. . . . . . . . . . 2.
. . GETTING
. . . . . . . . . . .STARTED
. . . . . . . . . .WITH
. . . . . .TUNED
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
..............
2.1. THE PURPOSE OF TUNED 14
2.2. TUNED PROFILES 14
Syntax of profile configuration 14
2.3. THE DEFAULT TUNED PROFILE 15
2.4. MERGED TUNED PROFILES 15
2.5. THE LOCATION OF TUNED PROFILES 16
2.6. TUNED PROFILES DISTRIBUTED WITH RHEL 16
2.7. TUNED CPU-PARTITIONING PROFILE 18
2.8. USING THE TUNED CPU-PARTITIONING PROFILE FOR LOW-LATENCY TUNING 19
2.9. CUSTOMIZING THE CPU-PARTITIONING TUNED PROFILE 20
2.10. REAL-TIME TUNED PROFILES DISTRIBUTED WITH RHEL 21
2.11. STATIC AND DYNAMIC TUNING IN TUNED 21
2.12. TUNED NO-DAEMON MODE 22
2.13. INSTALLING AND ENABLING TUNED 22
2.14. LISTING AVAILABLE TUNED PROFILES 23
2.15. SETTING A TUNED PROFILE 24
2.16. DISABLING TUNED 25
. . . . . . . . . . . 3.
CHAPTER . . CUSTOMIZING
. . . . . . . . . . . . . . . . TUNED
. . . . . . . . PROFILES
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
..............
3.1. TUNED PROFILES 26
Syntax of profile configuration 26
3.2. THE DEFAULT TUNED PROFILE 26
3.3. MERGED TUNED PROFILES 27
3.4. THE LOCATION OF TUNED PROFILES 27
3.5. INHERITANCE BETWEEN TUNED PROFILES 28
3.6. STATIC AND DYNAMIC TUNING IN TUNED 28
3.7. TUNED PLUG-INS 29
Syntax for plug-ins in TuneD profiles 30
Short plug-in syntax 30
Conflicting plug-in definitions in a profile 31
3.8. AVAILABLE TUNED PLUG-INS 31
Monitoring plug-ins 31
Tuning plug-ins 31
3.9. FUNCTIONALITIES OF THE SCHEDULER TUNED PLUGIN 36
3.10. VARIABLES IN TUNED PROFILES 41
3.11. BUILT-IN FUNCTIONS IN TUNED PROFILES 41
3.12. BUILT-IN FUNCTIONS AVAILABLE IN TUNED PROFILES 42
3.13. CREATING NEW TUNED PROFILES 43
3.14. MODIFYING EXISTING TUNED PROFILES 44
3.15. SETTING THE DISK SCHEDULER USING TUNED 45
.CHAPTER
. . . . . . . . . . 4.
. . .REVIEWING
. . . . . . . . . . . .A
. . SYSTEM
. . . . . . . . . USING
. . . . . . . .TUNA
. . . . . . INTERFACE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48
..............
4.1. INSTALLING THE TUNA TOOL 48
4.2. VIEWING THE SYSTEM STATUS USING TUNA TOOL 48
4.3. TUNING CPUS USING TUNA TOOL 49
1
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
.CHAPTER
. . . . . . . . . . 5.
. . MONITORING
. . . . . . . . . . . . . . . PERFORMANCE
. . . . . . . . . . . . . . . . . USING
. . . . . . . .RHEL
. . . . . .SYSTEM
. . . . . . . . . ROLES
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53
..............
5.1. PREPARING A CONTROL NODE AND MANAGED NODES TO USE RHEL SYSTEM ROLES 53
5.1.1. Preparing a control node on RHEL 8 53
5.1.2. Preparing a managed node 55
5.2. INTRODUCTION TO THE METRICS SYSTEM ROLE 57
5.3. USING THE METRICS SYSTEM ROLE TO MONITOR YOUR LOCAL SYSTEM WITH VISUALIZATION 59
5.4. USING THE METRICS SYSTEM ROLE TO SET UP A FLEET OF INDIVIDUAL SYSTEMS TO MONITOR
THEMSELVES 59
5.5. USING THE METRICS SYSTEM ROLE TO MONITOR A FLEET OF MACHINES CENTRALLY VIA YOUR
LOCAL MACHINE 60
5.6. SETTING UP AUTHENTICATION WHILE MONITORING A SYSTEM USING THE METRICS SYSTEM ROLE
61
5.7. USING THE METRICS SYSTEM ROLE TO CONFIGURE AND ENABLE METRICS COLLECTION FOR SQL
SERVER 62
.CHAPTER
. . . . . . . . . . 6.
. . .SETTING
. . . . . . . . . UP
. . . .PCP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64
..............
6.1. OVERVIEW OF PCP 64
6.2. INSTALLING AND ENABLING PCP 64
6.3. DEPLOYING A MINIMAL PCP SETUP 65
6.4. SYSTEM SERVICES AND TOOLS DISTRIBUTED WITH PCP 66
6.5. PCP DEPLOYMENT ARCHITECTURES 69
6.6. RECOMMENDED DEPLOYMENT ARCHITECTURE 73
6.7. SIZING FACTORS 73
6.8. CONFIGURATION OPTIONS FOR PCP SCALING 74
6.9. EXAMPLE: ANALYZING THE CENTRALIZED LOGGING DEPLOYMENT 74
6.10. EXAMPLE: ANALYZING THE FEDERATED SETUP DEPLOYMENT 75
6.11. TROUBLESHOOTING HIGH MEMORY USAGE 76
. . . . . . . . . . . 7.
CHAPTER . . LOGGING
. . . . . . . . . . . PERFORMANCE
. . . . . . . . . . . . . . . . . DATA
. . . . . . .WITH
. . . . . .PMLOGGER
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79
..............
7.1. MODIFYING THE PMLOGGER CONFIGURATION FILE WITH PMLOGCONF 79
7.2. EDITING THE PMLOGGER CONFIGURATION FILE MANUALLY 79
7.3. ENABLING THE PMLOGGER SERVICE 80
7.4. SETTING UP A CLIENT SYSTEM FOR METRICS COLLECTION 81
7.5. SETTING UP A CENTRAL SERVER TO COLLECT DATA 82
7.6. REPLAYING THE PCP LOG ARCHIVES WITH PMREP 83
.CHAPTER
. . . . . . . . . . 8.
. . .MONITORING
. . . . . . . . . . . . . . PERFORMANCE
. . . . . . . . . . . . . . . . . WITH
. . . . . . PERFORMANCE
. . . . . . . . . . . . . . . . . .CO-PILOT
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85
..............
8.1. MONITORING POSTFIX WITH PMDA-POSTFIX 85
8.2. VISUALLY TRACING PCP LOG ARCHIVES WITH THE PCP CHARTS APPLICATION 86
8.3. COLLECTING DATA FROM SQL SERVER USING PCP 88
. . . . . . . . . . . 9.
CHAPTER . . .PERFORMANCE
. . . . . . . . . . . . . . . . .ANALYSIS
. . . . . . . . . . .OF
. . . XFS
. . . . . WITH
. . . . . . PCP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
..............
9.1. INSTALLING XFS PMDA MANUALLY 91
9.2. EXAMINING XFS PERFORMANCE METRICS WITH PMINFO 92
9.3. RESETTING XFS PERFORMANCE METRICS WITH PMSTORE 93
9.4. PCP METRIC GROUPS FOR XFS 94
9.5. PER-DEVICE PCP METRIC GROUPS FOR XFS 95
.CHAPTER
. . . . . . . . . . 10.
. . . SETTING
. . . . . . . . . . UP
. . . .GRAPHICAL
. . . . . . . . . . . . .REPRESENTATION
. . . . . . . . . . . . . . . . . . . .OF
. . . PCP
. . . . . METRICS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98
..............
10.1. SETTING UP PCP WITH PCP-ZEROCONF 98
10.2. SETTING UP A GRAFANA-SERVER 98
10.3. ACCESSING THE GRAFANA WEB UI 99
10.4. CONFIGURING PCP REDIS 101
2
Table of Contents
10.5. CREATING PANELS AND ALERT IN PCP REDIS DATA SOURCE 101
10.6. ADDING NOTIFICATION CHANNELS FOR ALERTS 104
10.7. SETTING UP AUTHENTICATION BETWEEN PCP COMPONENTS 105
10.8. INSTALLING PCP BPFTRACE 106
10.9. VIEWING THE PCP BPFTRACE SYSTEM ANALYSIS DASHBOARD 107
10.10. INSTALLING PCP VECTOR 108
10.11. VIEWING THE PCP VECTOR CHECKLIST 109
10.12. TROUBLESHOOTING GRAFANA ISSUES 110
.CHAPTER
. . . . . . . . . . 11.
. . .OPTIMIZING
. . . . . . . . . . . . .THE
. . . . .SYSTEM
. . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . . USING
. . . . . . . .THE
. . . . WEB
. . . . . .CONSOLE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113
..............
11.1. PERFORMANCE TUNING OPTIONS IN THE WEB CONSOLE 113
11.2. SETTING A PERFORMANCE PROFILE IN THE WEB CONSOLE 113
11.3. MONITORING PERFORMANCE ON THE LOCAL SYSTEM USING THE WEB CONSOLE 114
11.4. MONITORING PERFORMANCE ON SEVERAL SYSTEMS USING THE WEB CONSOLE AND GRAFANA 116
. . . . . . . . . . . 12.
CHAPTER . . . SETTING
. . . . . . . . . .THE
. . . . .DISK
. . . . . SCHEDULER
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .119
..............
12.1. AVAILABLE DISK SCHEDULERS 119
12.2. DIFFERENT DISK SCHEDULERS FOR DIFFERENT USE CASES 120
12.3. THE DEFAULT DISK SCHEDULER 120
12.4. DETERMINING THE ACTIVE DISK SCHEDULER 120
12.5. SETTING THE DISK SCHEDULER USING TUNED 121
12.6. SETTING THE DISK SCHEDULER USING UDEV RULES 123
12.7. TEMPORARILY SETTING A SCHEDULER FOR A SPECIFIC DISK 124
.CHAPTER
. . . . . . . . . . 13.
. . . TUNING
. . . . . . . . . THE
. . . . .PERFORMANCE
. . . . . . . . . . . . . . . . . OF
. . . .A. .SAMBA
. . . . . . . .SERVER
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
...............
13.1. SETTING THE SMB PROTOCOL VERSION 125
13.2. TUNING SHARES WITH DIRECTORIES THAT CONTAIN A LARGE NUMBER OF FILES 125
13.3. SETTINGS THAT CAN HAVE A NEGATIVE PERFORMANCE IMPACT 126
. . . . . . . . . . . 14.
CHAPTER . . . OPTIMIZING
. . . . . . . . . . . . . .VIRTUAL
. . . . . . . . . MACHINE
. . . . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
...............
14.1. WHAT INFLUENCES VIRTUAL MACHINE PERFORMANCE 127
The impact of virtualization on system performance 127
Reducing VM performance loss 127
14.2. OPTIMIZING VIRTUAL MACHINE PERFORMANCE BY USING TUNED 128
14.3. CONFIGURING VIRTUAL MACHINE MEMORY 129
14.3.1. Adding and removing virtual machine memory by using the web console 129
14.3.2. Adding and removing virtual machine memory by using the command-line interface 130
14.3.3. Additional resources 132
14.4. OPTIMIZING VIRTUAL MACHINE I/O PERFORMANCE 132
14.4.1. Tuning block I/O in virtual machines 132
14.4.2. Disk I/O throttling in virtual machines 133
14.4.3. Enabling multi-queue virtio-scsi 134
14.5. OPTIMIZING VIRTUAL MACHINE CPU PERFORMANCE 135
14.5.1. Adding and removing virtual CPUs by using the command-line interface 135
14.5.2. Managing virtual CPUs by using the web console 136
14.5.3. Configuring NUMA in a virtual machine 137
14.5.4. Sample vCPU performance tuning scenario 139
14.5.5. Deactivating kernel same-page merging 145
14.6. OPTIMIZING VIRTUAL MACHINE NETWORK PERFORMANCE 145
14.7. VIRTUAL MACHINE PERFORMANCE MONITORING TOOLS 146
14.8. ADDITIONAL RESOURCES 149
.CHAPTER
. . . . . . . . . . 15.
. . . IMPORTANCE
. . . . . . . . . . . . . . . OF
. . . .POWER
. . . . . . . . MANAGEMENT
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .150
...............
15.1. POWER MANAGEMENT BASICS 150
3
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
. . . . . . . . . . . 16.
CHAPTER . . . MANAGING
. . . . . . . . . . . . .POWER
. . . . . . . . CONSUMPTION
. . . . . . . . . . . . . . . . . WITH
. . . . . . POWERTOP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156
...............
16.1. THE PURPOSE OF POWERTOP 156
16.2. USING POWERTOP 156
16.2.1. Starting PowerTOP 156
16.2.2. Calibrating PowerTOP 156
16.2.3. Setting the measuring interval 157
16.2.4. Additional resources 157
16.3. POWERTOP STATISTICS 157
16.3.1. The Overview tab 157
16.3.2. The Idle stats tab 158
16.3.3. The Device stats tab 158
16.3.4. The Tunables tab 158
16.3.5. The WakeUp tab 158
16.4. WHY POWERTOP DOES NOT DISPLAY FREQUENCY STATS VALUES IN SOME INSTANCES 159
16.5. GENERATING AN HTML OUTPUT 160
16.6. OPTIMIZING POWER CONSUMPTION 160
16.6.1. Optimizing power consumption using the powertop service 160
16.6.2. The powertop2tuned utility 160
16.6.3. Optimizing power consumption using the powertop2tuned utility 160
16.6.4. Comparison of powertop.service and powertop2tuned 161
.CHAPTER
. . . . . . . . . . 17.
. . . TUNING
. . . . . . . . . CPU
. . . . . FREQUENCY
. . . . . . . . . . . . . .TO
. . . .OPTIMIZE
. . . . . . . . . . ENERGY
. . . . . . . . . .CONSUMPTION
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
...............
17.1. SUPPORTED CPUPOWER TOOL COMMANDS 162
17.2. CPU IDLE STATES 163
17.3. OVERVIEW OF CPUFREQ 164
17.3.1. CPUfreq drivers 164
17.3.2. Core CPUfreq governors 165
17.3.3. Intel P-state CPUfreq governors 166
17.3.4. Setting up CPUfreq governor 167
. . . . . . . . . . . 18.
CHAPTER . . . GETTING
. . . . . . . . . . STARTED
. . . . . . . . . . .WITH
. . . . . .PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169
...............
18.1. INTRODUCTION TO PERF 169
18.2. INSTALLING PERF 169
18.3. COMMON PERF COMMANDS 169
. . . . . . . . . . . 19.
CHAPTER . . . PROFILING
. . . . . . . . . . . . .CPU
. . . . .USAGE
. . . . . . . .IN
. . REAL
. . . . . . TIME
. . . . . .WITH
. . . . . .PERF
. . . . . . TOP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171
..............
19.1. THE PURPOSE OF PERF TOP 171
19.2. PROFILING CPU USAGE WITH PERF TOP 171
19.3. INTERPRETATION OF PERF TOP OUTPUT 172
19.4. WHY PERF DISPLAYS SOME FUNCTION NAMES AS RAW FUNCTION ADDRESSES 172
19.5. ENABLING DEBUG AND SOURCE REPOSITORIES 172
19.6. GETTING DEBUGINFO PACKAGES FOR AN APPLICATION OR LIBRARY USING GDB 173
.CHAPTER
. . . . . . . . . . 20.
. . . .COUNTING
. . . . . . . . . . . .EVENTS
. . . . . . . . . DURING
. . . . . . . . .PROCESS
. . . . . . . . . . .EXECUTION
. . . . . . . . . . . . .WITH
. . . . . .PERF
. . . . . .STAT
. . . . . . . . . . . . . . . . . . . . . . . . . . 175
...............
20.1. THE PURPOSE OF PERF STAT 175
20.2. COUNTING EVENTS WITH PERF STAT 175
20.3. INTERPRETATION OF PERF STAT OUTPUT 176
20.4. ATTACHING PERF STAT TO A RUNNING PROCESS 177
. . . . . . . . . . . 21.
CHAPTER . . . RECORDING
. . . . . . . . . . . . . .AND
. . . . .ANALYZING
. . . . . . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . .PROFILES
. . . . . . . . . . . WITH
. . . . . . PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . .178
...............
21.1. THE PURPOSE OF PERF RECORD 178
21.2. RECORDING A PERFORMANCE PROFILE WITHOUT ROOT ACCESS 178
4
Table of Contents
.CHAPTER
. . . . . . . . . . 22.
. . . .INVESTIGATING
. . . . . . . . . . . . . . . . .BUSY
. . . . . . CPUS
. . . . . . .WITH
. . . . . .PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .186
...............
22.1. DISPLAYING WHICH CPU EVENTS WERE COUNTED ON WITH PERF STAT 186
22.2. DISPLAYING WHICH CPU SAMPLES WERE TAKEN ON WITH PERF REPORT 186
22.3. DISPLAYING SPECIFIC CPUS DURING PROFILING WITH PERF TOP 187
22.4. MONITORING SPECIFIC CPUS WITH PERF RECORD AND PERF REPORT 187
. . . . . . . . . . . 23.
CHAPTER . . . .MONITORING
. . . . . . . . . . . . . . APPLICATION
. . . . . . . . . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . .WITH
. . . . . .PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .189
...............
23.1. ATTACHING PERF RECORD TO A RUNNING PROCESS 189
23.2. CAPTURING CALL GRAPH DATA WITH PERF RECORD 189
23.3. ANALYZING PERF.DATA WITH PERF REPORT 190
. . . . . . . . . . . 24.
CHAPTER . . . .CREATING
. . . . . . . . . . . UPROBES
. . . . . . . . . . . WITH
. . . . . . PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
...............
24.1. CREATING UPROBES AT THE FUNCTION LEVEL WITH PERF 192
24.2. CREATING UPROBES ON LINES WITHIN A FUNCTION WITH PERF 192
24.3. PERF SCRIPT OUTPUT OF DATA RECORDED OVER UPROBES 193
. . . . . . . . . . . 25.
CHAPTER . . . .PROFILING
. . . . . . . . . . . .MEMORY
. . . . . . . . . . ACCESSES
. . . . . . . . . . . . WITH
. . . . . . PERF
. . . . . . MEM
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .194
...............
25.1. THE PURPOSE OF PERF MEM 194
25.2. SAMPLING MEMORY ACCESS WITH PERF MEM 194
25.3. INTERPRETATION OF PERF MEM REPORT OUTPUT 196
. . . . . . . . . . . 26.
CHAPTER . . . .DETECTING
. . . . . . . . . . . . .FALSE
. . . . . . . SHARING
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .198
...............
26.1. THE PURPOSE OF PERF C2C 198
26.2. DETECTING CACHE-LINE CONTENTION WITH PERF C2C 198
26.3. VISUALIZING A PERF.DATA FILE RECORDED WITH PERF C2C RECORD 199
26.4. INTERPRETATION OF PERF C2C REPORT OUTPUT 201
26.5. DETECTING FALSE SHARING WITH PERF C2C 202
. . . . . . . . . . . 27.
CHAPTER . . . .GETTING
. . . . . . . . . .STARTED
. . . . . . . . . . WITH
. . . . . . FLAMEGRAPHS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .205
...............
27.1. INSTALLING FLAMEGRAPHS 205
27.2. CREATING FLAMEGRAPHS OVER THE ENTIRE SYSTEM 205
27.3. CREATING FLAMEGRAPHS OVER SPECIFIC PROCESSES 206
27.4. INTERPRETING FLAMEGRAPHS 207
CHAPTER 28. MONITORING PROCESSES FOR PERFORMANCE BOTTLENECKS USING PERF CIRCULAR
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
BUFFERS ................
28.1. CIRCULAR BUFFERS AND EVENT-SPECIFIC SNAPSHOTS WITH PERF 209
28.2. COLLECTING SPECIFIC DATA TO MONITOR FOR PERFORMANCE BOTTLENECKS USING PERF
CIRCULAR BUFFERS 209
CHAPTER 29. ADDING AND REMOVING TRACEPOINTS FROM A RUNNING PERF COLLECTOR WITHOUT
. . . . . . . . . . . . OR
STOPPING . . . .RESTARTING
. . . . . . . . . . . . . . PERF
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211
..............
29.1. ADDING TRACEPOINTS TO A RUNNING PERF COLLECTOR WITHOUT STOPPING OR RESTARTING
PERF 211
29.2. REMOVING TRACEPOINTS FROM A RUNNING PERF COLLECTOR WITHOUT STOPPING OR
5
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
. . . . . . . . . . . 30.
CHAPTER . . . .PROFILING
. . . . . . . . . . . . MEMORY
. . . . . . . . . . ALLOCATION
. . . . . . . . . . . . . . . WITH
. . . . . . NUMASTAT
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
...............
30.1. DEFAULT NUMASTAT STATISTICS 213
30.2. VIEWING MEMORY ALLOCATION WITH NUMASTAT 213
. . . . . . . . . . . 31.
CHAPTER . . . CONFIGURING
. . . . . . . . . . . . . . . . AN
. . . .OPERATING
. . . . . . . . . . . . .SYSTEM
. . . . . . . . . TO
. . . .OPTIMIZE
. . . . . . . . . . .CPU
. . . . .UTILIZATION
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
...............
31.1. TOOLS FOR MONITORING AND DIAGNOSING PROCESSOR ISSUES 215
31.2. TYPES OF SYSTEM TOPOLOGY 216
31.2.1. Displaying system topologies 216
31.3. CONFIGURING KERNEL TICK TIME 218
31.4. OVERVIEW OF AN INTERRUPT REQUEST 220
31.4.1. Balancing interrupts manually 220
31.4.2. Setting the smp_affinity mask 221
. . . . . . . . . . . 32.
CHAPTER . . . .TUNING
. . . . . . . . .SCHEDULING
. . . . . . . . . . . . . . POLICY
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223
...............
32.1. CATEGORIES OF SCHEDULING POLICIES 223
32.2. STATIC PRIORITY SCHEDULING WITH SCHED_FIFO 223
32.3. ROUND ROBIN PRIORITY SCHEDULING WITH SCHED_RR 224
32.4. NORMAL SCHEDULING WITH SCHED_OTHER 224
32.5. SETTING SCHEDULER POLICIES 224
32.6. POLICY OPTIONS FOR THE CHRT COMMAND 225
32.7. CHANGING THE PRIORITY OF SERVICES DURING THE BOOT PROCESS 226
32.8. PRIORITY MAP 227
32.9. TUNED CPU-PARTITIONING PROFILE 228
32.10. USING THE TUNED CPU-PARTITIONING PROFILE FOR LOW-LATENCY TUNING 229
32.11. CUSTOMIZING THE CPU-PARTITIONING TUNED PROFILE 230
.CHAPTER
. . . . . . . . . . 33.
. . . .FACTORS
. . . . . . . . . . .AFFECTING
. . . . . . . . . . . . I/O
. . . . AND
. . . . . FILE
. . . . . .SYSTEM
. . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .232
...............
33.1. TOOLS FOR MONITORING AND DIAGNOSING I/O AND FILE SYSTEM ISSUES 232
33.2. AVAILABLE TUNING OPTIONS FOR FORMATTING A FILE SYSTEM 234
33.3. AVAILABLE TUNING OPTIONS FOR MOUNTING A FILE SYSTEM 235
33.4. TYPES OF DISCARDING UNUSED BLOCKS 236
33.5. SOLID-STATE DISKS TUNING CONSIDERATIONS 236
33.6. GENERIC BLOCK DEVICE TUNING PARAMETERS 237
.CHAPTER
. . . . . . . . . . 34.
. . . .TUNING
. . . . . . . . .THE
. . . . .NETWORK
. . . . . . . . . . .PERFORMANCE
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .239
...............
34.1. TUNING NETWORK ADAPTER SETTINGS 239
34.1.1. Increasing the ring buffer size to reduce a high packet drop rate by using nmcli 239
34.1.2. Tuning the network device backlog queue to avoid packet drops 240
34.1.3. Increasing the transmit queue length of a NIC to reduce the number of transmit errors 242
34.2. TUNING IRQ BALANCING 243
34.2.1. Interrupts and interrupt handlers 243
34.2.2. Software interrupt requests 243
34.2.3. NAPI Polling 244
34.2.4. The irqbalance service 244
34.2.5. Increasing the time SoftIRQs can run on the CPU 245
34.3. IMPROVING THE NETWORK LATENCY 246
34.3.1. How the CPU power states influence the network latency 246
34.3.2. C-state settings in the EFI firmware 247
34.3.3. Disabling C-states by using a custom TuneD profile 247
34.3.4. Disabling C-states by using a kernel command line option 248
34.4. IMPROVING THE THROUGHPUT OF LARGE AMOUNTS OF CONTIGUOUS DATA STREAMS 250
34.4.1. Considerations before configuring jumbo frames 250
6
Table of Contents
. . . . . . . . . . . 35.
CHAPTER . . . .CONFIGURING
. . . . . . . . . . . . . . . .AN
. . . OPERATING
. . . . . . . . . . . . . .SYSTEM
. . . . . . . . .TO
. . . OPTIMIZE
. . . . . . . . . . . MEMORY
. . . . . . . . . . ACCESS
. . . . . . . . . . . . . . . . . . . . . . . . . . 281
...............
35.1. TOOLS FOR MONITORING AND DIAGNOSING SYSTEM MEMORY ISSUES 281
35.2. OVERVIEW OF A SYSTEM’S MEMORY 281
35.3. VIRTUAL MEMORY PARAMETERS 282
35.4. FILE SYSTEM PARAMETERS 285
35.5. KERNEL PARAMETERS 286
35.6. SETTING MEMORY-RELATED KERNEL PARAMETERS 286
.CHAPTER
. . . . . . . . . . 36.
. . . .CONFIGURING
. . . . . . . . . . . . . . . .HUGE
. . . . . . PAGES
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .288
...............
36.1. AVAILABLE HUGE PAGE FEATURES 288
36.2. PARAMETERS FOR RESERVING HUGETLB PAGES AT BOOT TIME 289
36.3. CONFIGURING HUGETLB AT BOOT TIME 289
36.4. PARAMETERS FOR RESERVING HUGETLB PAGES AT RUN TIME 291
36.5. CONFIGURING HUGETLB AT RUN TIME 291
36.6. ENABLING TRANSPARENT HUGEPAGES 292
36.7. DISABLING TRANSPARENT HUGEPAGES 293
36.8. IMPACT OF PAGE SIZE ON TRANSLATION LOOKASIDE BUFFER SIZE 293
.CHAPTER
. . . . . . . . . . 37.
. . . .GETTING
. . . . . . . . . .STARTED
. . . . . . . . . . WITH
. . . . . . SYSTEMTAP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .295
...............
37.1. THE PURPOSE OF SYSTEMTAP 295
37.2. INSTALLING SYSTEMTAP 295
37.3. PRIVILEGES TO RUN SYSTEMTAP 296
7
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
. . . . . . . . . . . 38.
CHAPTER . . . .CROSS-INSTRUMENTATION
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .OF
. . . SYSTEMTAP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .298
...............
38.1. SYSTEMTAP CROSS-INSTRUMENTATION 298
38.2. INITIALIZING CROSS-INSTRUMENTATION OF SYSTEMTAP 299
. . . . . . . . . . . 39.
CHAPTER . . . .MONITORING
. . . . . . . . . . . . . . .NETWORK
. . . . . . . . . . .ACTIVITY
. . . . . . . . . . WITH
. . . . . . SYSTEMTAP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .301
...............
39.1. PROFILING NETWORK ACTIVITY WITH SYSTEMTAP 301
39.2. TRACING FUNCTIONS CALLED IN NETWORK SOCKET CODE WITH SYSTEMTAP 302
39.3. MONITORING NETWORK PACKET DROPS WITH SYSTEMTAP 303
. . . . . . . . . . . 40.
CHAPTER . . . .PROFILING
. . . . . . . . . . . . KERNEL
. . . . . . . . . ACTIVITY
. . . . . . . . . . .WITH
. . . . . .SYSTEMTAP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
................
40.1. COUNTING FUNCTION CALLS WITH SYSTEMTAP 304
40.2. TRACING FUNCTION CALLS WITH SYSTEMTAP 305
40.3. DETERMINING TIME SPENT IN KERNEL AND USER SPACE WITH SYSTEMTAP 306
40.4. MONITORING POLLING APPLICATIONS WITH SYSTEMTAP 307
40.5. TRACKING MOST FREQUENTLY USED SYSTEM CALLS WITH SYSTEMTAP 308
40.6. TRACKING SYSTEM CALL VOLUME PER PROCESS WITH SYSTEMTAP 308
.CHAPTER
. . . . . . . . . . 41.
. . . MONITORING
. . . . . . . . . . . . . . . DISK
. . . . . .AND
. . . . .I/O
. . . .ACTIVITY
. . . . . . . . . .WITH
. . . . . .SYSTEMTAP
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .310
...............
41.1. SUMMARIZING DISK READ/WRITE TRAFFIC WITH SYSTEMTAP 310
41.2. TRACKING I/O TIME FOR EACH FILE READ OR WRITE WITH SYSTEMTAP 311
41.3. TRACKING CUMULATIVE I/O WITH SYSTEMTAP 311
41.4. MONITORING I/O ACTIVITY ON A SPECIFIC DEVICE WITH SYSTEMTAP 312
41.5. MONITORING READS AND WRITES TO A FILE WITH SYSTEMTAP 313
.CHAPTER
. . . . . . . . . . 42.
. . . .ANALYZING
. . . . . . . . . . . . .SYSTEM
. . . . . . . . . PERFORMANCE
. . . . . . . . . . . . . . . . . WITH
. . . . . . BPF
. . . . . COMPILER
. . . . . . . . . . . .COLLECTION
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
...............
42.1. INSTALLING THE BCC-TOOLS PACKAGE 315
42.2. USING SELECTED BCC-TOOLS FOR PERFORMANCE ANALYSES 315
Using execsnoop to examine the system processes 315
Using opensnoop to track what files a command opens 316
Using biotop to examine the I/O operations on the disk 317
Using xfsslower to expose unexpectedly slow file system operations 318
8
Table of Contents
9
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
10
PROVIDING FEEDBACK ON RED HAT DOCUMENTATION
4. Enter your suggestion for improvement in the Description field. Include links to the relevant
parts of the documentation.
11
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Performance Co-Pilot (pcp) is used for monitoring, visualizing, storing, and analyzing system-
level performance measurements. It allows the monitoring and management of real-time data,
and logging and retrieval of historical data.
Red Hat Enterprise Linux 8 provides several tools that can be used from the command line to
monitor a system outside run level 5. The following are the built-in command line tools:
top is provided by the procps-ng package. It gives a dynamic view of the processes in a
running system. It displays a variety of information, including a system summary and a list of
tasks currently being managed by the Linux kernel.
Virtual memory statistics (vmstat) is provided by the procps-ng package. It provides instant
reports of your system’s processes, memory, paging, block input/output, interrupts, and
CPU activity.
System activity reporter (sar) is provided by the sysstat package. It collects and reports
information about system activity that has occurred so far on the current day.
perf uses hardware performance counters and kernel trace-points to track the impact of other
commands and applications on a system.
bcc-tools is used for BPF Compiler Collection (BCC). It provides over 100 eBPF scripts that
monitor kernel activities. For more information about each of this tool, see the man page
describing how to use it and what functions it performs.
iostat is provided by the sysstat package. It monitors and reports on system IO device loading
to help administrators make decisions about how to balance IO load between physical disks.
numastat is provided by the numactl package. By default, numastat displays per-node NUMA
hit an miss system statistics from the kernel memory allocator. Optimal performance is
indicated by high numa_hit values and low numa_miss values.
numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and
resource usage within a system that dynamically improves NUMA resource allocation,
management, and therefore system performance.
SystemTap monitors and analyzes operating system activities, especially the kernel activities.
12
CHAPTER 1. OVERVIEW OF PERFORMANCE MONITORING OPTIONS
pqos is provided by the intel-cmt-cat package. It monitors and controls CPU cache and memory
bandwidth on recent Intel processors.
Additional resources
pcp, top, ps, vmstat, sar, perf, iostat, irqbalance, ss, numastat, numad, valgrind, and pqos
man pages
/usr/share/doc/ directory
What exactly is the meaning of value "await" reported by iostat? Red Hat Knowledgebase article
13
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
TuneD is distributed with a number of predefined profiles for use cases such as:
High throughput
Low latency
Saving power
It is possible to modify the rules defined for each profile and customize how to tune a particular device.
When you switch to another profile or deactivate TuneD, all changes made to the system settings by the
previous profile revert back to their original state.
You can also configure TuneD to react to changes in device usage and adjusts settings to improve
performance of active devices and reduce power consumption of inactive devices.
The profiles provided with TuneD are divided into the following categories:
Power-saving profiles
Performance-boosting profiles
The performance-boosting profiles include profiles that focus on the following aspects:
Additional resources
14
CHAPTER 2. GETTING STARTED WITH TUNED
Additional resources
If there are conflicts, the settings from the last specified profile takes precedence.
The following example optimizes the system to run in a virtual machine for the best performance and
concurrently tunes it for low power consumption, while the low power consumption is the priority:
WARNING
Additional resources
15
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
/usr/lib/tuned/
Distribution-specific profiles are stored in the directory. Each profile has its own directory. The profile
consists of the main configuration file called tuned.conf, and optionally other files, for example
helper scripts.
/etc/tuned/
If you need to customize a profile, copy the profile directory into the directory, which is used for
custom profiles. If there are two profiles of the same name, the custom profile located in /etc/tuned/
is used.
Additional resources
NOTE
balanced
The default power-saving profile. It is intended to be a compromise between performance and power
consumption. It uses auto-scaling and auto-tuning whenever possible. The only drawback is the
increased latency. In the current TuneD release, it enables the CPU, disk, audio, and video plugins,
and activates the conservative CPU governor. The radeon_powersave option uses the dpm-
balanced value if it is supported, otherwise it is set to auto.
It changes the energy_performance_preference attribute to the normal energy setting. It also
changes the scaling_governor policy attribute to either the conservative or powersave CPU
governor.
powersave
A profile for maximum power saving performance. It can throttle the performance in order to
minimize the actual power consumption. In the current TuneD release it enables USB autosuspend,
WiFi power saving, and Aggressive Link Power Management (ALPM) power savings for SATA host
adapters. It also schedules multi-core power savings for systems with a low wakeup rate and
activates the ondemand governor. It enables AC97 audio power saving or, depending on your
system, HDA-Intel power savings with a 10 seconds timeout. If your system contains a supported
Radeon graphics card with enabled KMS, the profile configures it to automatic power saving. On
ASUS Eee PCs, a dynamic Super Hybrid Engine is enabled.
It changes the energy_performance_preference attribute to the powersave or power energy
setting. It also changes the scaling_governor policy attribute to either the ondemand or
powersave CPU governor.
NOTE
16
CHAPTER 2. GETTING STARTED WITH TUNED
NOTE
In certain cases, the balanced profile is more efficient compared to the powersave
profile.
Consider there is a defined amount of work that needs to be done, for example a video
file that needs to be transcoded. Your machine might consume less energy if the
transcoding is done on the full power, because the task is finished quickly, the
machine starts to idle, and it can automatically step-down to very efficient power save
modes. On the other hand, if you transcode the file with a throttled machine, the
machine consumes less power during the transcoding, but the process takes longer
and the overall consumed energy can be higher.
throughput-performance
A server profile optimized for high throughput. It disables power savings mechanisms and enables
sysctl settings that improve the throughput performance of the disk and network IO. CPU governor
is set to performance.
It changes the energy_performance_preference and scaling_governor attribute to the
performance profile.
accelerator-performance
The accelerator-performance profile contains the same tuning as the throughput-performance
profile. Additionally, it locks the CPU to low C states so that the latency is less than 100us. This
improves the performance of certain accelerators, such as GPUs.
latency-performance
A server profile optimized for low latency. It disables power savings mechanisms and enables sysctl
settings that improve latency. CPU governor is set to performance and the CPU is locked to the low
C states (by PM QoS).
It changes the energy_performance_preference and scaling_governor attribute to the
performance profile.
network-latency
A profile for low latency network tuning. It is based on the latency-performance profile. It
additionally disables transparent huge pages and NUMA balancing, and tunes several other network-
related sysctl parameters.
It inherits the latency-performance profile which changes the energy_performance_preference
and scaling_governor attribute to the performance profile.
hpc-compute
A profile optimized for high-performance computing. It is based on the latency-performance
profile.
network-throughput
A profile for throughput network tuning. It is based on the throughput-performance profile. It
additionally increases kernel network buffers.
It inherits either the latency-performance or throughput-performance profile, and changes the
energy_performance_preference and scaling_governor attribute to the performance profile.
virtual-guest
A profile designed for Red Hat Enterprise Linux 8 virtual machines and VMWare guests based on the
17
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
A profile designed for Red Hat Enterprise Linux 8 virtual machines and VMWare guests based on the
throughput-performance profile that, among other tasks, decreases virtual memory swappiness and
increases disk readahead values. It does not disable disk barriers.
It inherits the throughput-performance profile and changes the energy_performance_preference
and scaling_governor attribute to the performance profile.
virtual-host
A profile designed for virtual hosts based on the throughput-performance profile that, among other
tasks, decreases virtual memory swappiness, increases disk readahead values, and enables a more
aggressive value of dirty pages writeback.
It inherits the throughput-performance profile and changes the energy_performance_preference
and scaling_governor attribute to the performance profile.
oracle
A profile optimized for Oracle databases loads based on throughput-performance profile. It
additionally disables transparent huge pages and modifies other performance-related kernel
parameters. This profile is provided by the tuned-profiles-oracle package.
desktop
A profile optimized for desktops, based on the balanced profile. It additionally enables scheduler
autogroups for better response of interactive applications.
optimize-serial-console
A profile that tunes down I/O activity to the serial console by reducing the printk value. This should
make the serial console more responsive. This profile is intended to be used as an overlay on other
profiles. For example:
mssql
A profile provided for Microsoft SQL Server. It is based on the throughput-performance profile.
intel-sst
A profile optimized for systems with user-defined Intel Speed Select Technology configurations. This
profile is intended to be used as an overlay on other profiles. For example:
Prior to Red Hat Enterprise Linux 8, the low-latency Red Hat documentation described the numerous
low-level steps needed to achieve low-latency tuning. In Red Hat Enterprise Linux 8, you can perform
low-latency tuning more efficiently by using the cpu-partitioning TuneD profile. This profile is easily
customizable according to the requirements for individual low-latency applications.
The following figure is an example to demonstrate how to use the cpu-partitioning profile. This
example uses the CPU and node layout.
The list of isolated CPUs is comma-separated or you can specify a range using a dash, such as 3-5.
This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping
CPU.
Specifying the no_balance_cores option is optional, however any CPUs in this list must be a subset
of the CPUs listed in the isolated_cores list.
Application threads using these CPUs need to be pinned individually to each CPU.
Housekeeping CPUs
Any CPU not isolated in the cpu-partitioning-variables.conf file is automatically considered a
housekeeping CPU. On the housekeeping CPUs, all services, daemons, user processes, movable
kernel threads, interrupt handlers, and kernel timers are permitted to execute.
Additional resources
This procedure describes how to tune a system for low-latency using the TuneD’s cpu-partitioning
19
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
This procedure describes how to tune a system for low-latency using the TuneD’s cpu-partitioning
profile. It uses the example of a low-latency application that can use cpu-partitioning and the CPU
layout as mentioned in the cpu-partitioning figure.
One dedicated reader thread that reads data from the network will be pinned to CPU 2.
A large number of threads that process this network data will be pinned to CPUs 4-23.
A dedicated writer thread that writes the processed data to the network will be pinned to CPU
3.
Prerequisites
You have installed the cpu-partitioning TuneD profile by using the yum install tuned-profiles-
cpu-partitioning command as root.
Procedure
3. Reboot
After rebooting, the system is tuned for low-latency, according to the isolation in the cpu-
partitioning figure. The application can use taskset to pin the reader and writer threads to CPUs
2 and 3, and the remaining application threads on CPUs 4-23.
Additional resources
For example, the cpu-partitioning profile sets the CPUs to use cstate=1. In order to use the cpu-
partitioning profile but to additionally change the CPU cstate from cstate1 to cstate0, the following
procedure describes a new TuneD profile named my_profile, which inherits the cpu-partitioning profile
and then sets C state-0.
Procedure
# mkdir /etc/tuned/my_profile
20
CHAPTER 2. GETTING STARTED WITH TUNED
2. Create a tuned.conf file in this directory, and add the following content:
# vi /etc/tuned/my_profile/tuned.conf
[main]
summary=Customized tuning on top of cpu-partitioning
include=cpu-partitioning
[cpu]
force_latency=cstate.id:0|1
NOTE
In the shared example, a reboot is not required. However, if the changes in the my_profile
profile require a reboot to take effect, then reboot your machine.
Additional resources
realtime
Use on bare-metal real-time systems.
Provided by the tuned-profiles-realtime package, which is available from the RT or NFV repositories.
realtime-virtual-host
Use in a virtualization host configured for real-time.
Provided by the tuned-profiles-nfv-host package, which is available from the NFV repository.
realtime-virtual-guest
Use in a virtualization guest configured for real-time.
Provided by the tuned-profiles-nfv-guest package, which is available from the NFV repository.
Static tuning
Mainly consists of the application of predefined sysctl and sysfs settings and one-shot activation of
several configuration tools such as ethtool.
21
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Dynamic tuning
Watches how various system components are used throughout the uptime of your system. TuneD
adjusts system settings dynamically based on that monitoring information.
For example, the hard drive is used heavily during startup and login, but is barely used later when the
user might mainly work with applications such as web browsers or email clients. Similarly, the CPU
and network devices are used differently at different times. TuneD monitors the activity of these
components and reacts to the changes in their use.
By default, dynamic tuning is disabled. To enable it, edit the /etc/tuned/tuned-main.conf file and
change the dynamic_tuning option to 1. TuneD then periodically analyzes system statistics and
uses them to update your system tuning settings. To configure the time interval in seconds between
these updates, use the update_interval option.
Currently implemented dynamic tuning algorithms try to balance the performance and powersave,
and are therefore disabled in the performance profiles. Dynamic tuning for individual plug-ins can be
enabled or disabled in the TuneD profiles.
On a typical office workstation, the Ethernet network interface is inactive most of the time. Only a
few emails go in and out or some web pages might be loaded.
For those kinds of loads, the network interface does not have to run at full speed all the time, as it
does by default. TuneD has a monitoring and tuning plug-in for network devices that can detect this
low activity and then automatically lower the speed of that interface, typically resulting in a lower
power usage.
If the activity on the interface increases for a longer period of time, for example because a DVD
image is being downloaded or an email with a large attachment is opened, TuneD detects this and
sets the interface speed to maximum to offer the best performance while the activity level is high.
This principle is used for other plug-ins for CPU and disks as well.
By default, no-daemon mode is disabled because a lot of TuneD functionality is missing in this mode,
including:
D-Bus support
Hot-plug support
To enable no-daemon mode, include the following line in the /etc/tuned/tuned-main.conf file:
daemon = 0
22
CHAPTER 2. GETTING STARTED WITH TUNED
This procedure installs and enables the TuneD application, installs TuneD profiles, and presets a default
TuneD profile for your system.
Procedure
Install it.
$ tuned-adm active
NOTE
The active profile TuneD automatically presets differs based on your machine
type and system settings.
$ tuned-adm verify
Procedure
$ tuned-adm list
Available profiles:
- accelerator-performance - Throughput performance based tuning with disabled higher
latency STOP states
23
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
$ tuned-adm active
Additional resources
Prerequisites
The TuneD service is running. See Installing and Enabling TuneD for details.
Procedure
1. Optionally, you can let TuneD recommend the most suitable profile for your system:
# tuned-adm recommend
throughput-performance
2. Activate a profile:
The following example optimizes the system to run in a virtual machine with the best
24
CHAPTER 2. GETTING STARTED WITH TUNED
The following example optimizes the system to run in a virtual machine with the best
performance and concurrently tunes it for low power consumption, while the low power
consumption is the priority:
# tuned-adm active
# reboot
Verification steps
$ tuned-adm verify
Additional resources
Procedure
# tuned-adm off
The tunings are applied again after the TuneD service restarts.
Additional resources
25
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
Install and enable TuneD as described in Installing and Enabling TuneD for details.
The profiles provided with TuneD are divided into the following categories:
Power-saving profiles
Performance-boosting profiles
The performance-boosting profiles include profiles that focus on the following aspects:
Additional resources
26
CHAPTER 3. CUSTOMIZING TUNED PROFILES
Additional resources
If there are conflicts, the settings from the last specified profile takes precedence.
The following example optimizes the system to run in a virtual machine for the best performance and
concurrently tunes it for low power consumption, while the low power consumption is the priority:
WARNING
Additional resources
/usr/lib/tuned/
Distribution-specific profiles are stored in the directory. Each profile has its own directory. The profile
consists of the main configuration file called tuned.conf, and optionally other files, for example
helper scripts.
27
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
/etc/tuned/
If you need to customize a profile, copy the profile directory into the directory, which is used for
custom profiles. If there are two profiles of the same name, the custom profile located in /etc/tuned/
is used.
Additional resources
[main]
include=parent
All settings from the parent profile are loaded in this child profile. In the following sections, the child
profile can override certain settings inherited from the parent profile or add new settings not present in
the parent profile.
You can create your own child profile in the /etc/tuned/ directory based on a pre-installed profile in
/usr/lib/tuned/ with only some parameters adjusted.
If the parent profile is updated, such as after a TuneD upgrade, the changes are reflected in the child
profile.
The following is an example of a custom profile that extends the balanced profile and sets
Aggressive Link Power Management (ALPM) for all devices to the maximum powersaving.
[main]
include=balanced
[scsi_host]
alpm=min_power
Additional resources
Static tuning
Mainly consists of the application of predefined sysctl and sysfs settings and one-shot activation of
several configuration tools such as ethtool.
28
CHAPTER 3. CUSTOMIZING TUNED PROFILES
Dynamic tuning
Watches how various system components are used throughout the uptime of your system. TuneD
adjusts system settings dynamically based on that monitoring information.
For example, the hard drive is used heavily during startup and login, but is barely used later when the
user might mainly work with applications such as web browsers or email clients. Similarly, the CPU
and network devices are used differently at different times. TuneD monitors the activity of these
components and reacts to the changes in their use.
By default, dynamic tuning is disabled. To enable it, edit the /etc/tuned/tuned-main.conf file and
change the dynamic_tuning option to 1. TuneD then periodically analyzes system statistics and
uses them to update your system tuning settings. To configure the time interval in seconds between
these updates, use the update_interval option.
Currently implemented dynamic tuning algorithms try to balance the performance and powersave,
and are therefore disabled in the performance profiles. Dynamic tuning for individual plug-ins can be
enabled or disabled in the TuneD profiles.
On a typical office workstation, the Ethernet network interface is inactive most of the time. Only a
few emails go in and out or some web pages might be loaded.
For those kinds of loads, the network interface does not have to run at full speed all the time, as it
does by default. TuneD has a monitoring and tuning plug-in for network devices that can detect this
low activity and then automatically lower the speed of that interface, typically resulting in a lower
power usage.
If the activity on the interface increases for a longer period of time, for example because a DVD
image is being downloaded or an email with a large attachment is opened, TuneD detects this and
sets the interface speed to maximum to offer the best performance while the activity level is high.
This principle is used for other plug-ins for CPU and disks as well.
Monitoring plug-ins
Monitoring plug-ins are used to get information from a running system. The output of the monitoring
plug-ins can be used by tuning plug-ins for dynamic tuning.
Monitoring plug-ins are automatically instantiated whenever their metrics are needed by any of the
enabled tuning plug-ins. If two tuning plug-ins require the same data, only one instance of the
monitoring plug-in is created and the data is shared.
Tuning plug-ins
Each tuning plug-in tunes an individual subsystem and takes several parameters that are populated
from the TuneD profiles. Each subsystem can have multiple devices, such as multiple CPUs or
network cards, that are handled by individual instances of the tuning plug-ins. Specific settings for
individual devices are also supported.
29
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
[NAME]
type=TYPE
devices=DEVICES
NAME
is the name of the plug-in instance as it is used in the logs. It can be an arbitrary string.
TYPE
is the type of the tuning plug-in.
DEVICES
is the list of devices that this plug-in instance handles.
The devices line can contain a list, a wildcard ( *), and negation (!). If there is no devices line, all
devices present or later attached on the system of the TYPE are handled by the plug-in instance.
This is same as using the devices=* option.
The following example matches all block devices starting with sd, such as sda or sdb, and does
not disable barriers on them:
[data_disk]
type=disk
devices=sd*
disable_barriers=false
The following example matches all block devices except sda1 and sda2:
[data_disk]
type=disk
devices=!sda1, !sda2
disable_barriers=false
If the plug-in supports more options, they can be also specified in the plug-in section. If the option is not
specified and it was not previously specified in the included plug-in, the default value is used.
[TYPE]
devices=DEVICES
In this case, it is possible to omit the type line. The instance is then referred to with a name, same as the
type. The previous example could be then rewritten into:
30
CHAPTER 3. CUSTOMIZING TUNED PROFILES
[disk]
devices=sdb*
disable_barriers=false
You can also disable the plug-in by specifying the enabled=false option. This has the same effect as if
the instance was never defined. Disabling the plug-in is useful if you are redefining the previous
definition from the include option and do not want the plug-in to be active in your custom profile.
NOTE
TuneD includes the ability to run any shell command as part of enabling or disabling a tuning profile.
This enables you to extend TuneD profiles with functionality that has not been integrated into TuneD
yet.
You can specify arbitrary shell commands using the script plug-in.
Additional resources
disk
Gets disk load (number of IO operations) per device and measurement interval.
net
Gets network load (number of transferred packets) per network card and measurement interval.
load
Gets CPU load per CPU and measurement interval.
Tuning plug-ins
Currently, the following tuning plug-ins are implemented. Only some of these plug-ins implement
dynamic tuning. Options supported by plug-ins are also listed:
cpu
Sets the CPU governor to the value specified by the governor option and dynamically changes the
Power Management Quality of Service (PM QoS) CPU Direct Memory Access (DMA) latency
according to the CPU load.
If the CPU load is lower than the value specified by the load_threshold option, the latency is set to
the value specified by the latency_high option, otherwise it is set to the value specified by
latency_low.
You can also force the latency to a specific value and prevent it from dynamically changing further.
31
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
You can also force the latency to a specific value and prevent it from dynamically changing further.
To do so, set the force_latency option to the required latency value.
eeepc_she
Dynamically sets the front-side bus (FSB) speed according to the CPU load.
This feature can be found on some netbooks and is also known as the ASUS Super Hybrid Engine
(SHE).
If the CPU load is lower or equal to the value specified by the load_threshold_powersave option,
the plug-in sets the FSB speed to the value specified by the she_powersave option. If the CPU load
is higher or equal to the value specified by the load_threshold_normal option, it sets the FSB speed
to the value specified by the she_normal option.
Static tuning is not supported and the plug-in is transparently disabled if TuneD does not detect the
hardware support for this feature.
net
Configures the Wake-on-LAN functionality to the values specified by the wake_on_lan option. It
uses the same syntax as the ethtool utility. It also dynamically changes the interface speed according
to the interface utilization.
sysctl
Sets various sysctl settings specified by the plug-in options.
The syntax is name=value, where name is the same as the name provided by the sysctl utility.
Use the sysctl plug-in if you need to change system settings that are not covered by other plug-ins
available in TuneD. If the settings are covered by some specific plug-ins, prefer these plug-ins.
usb
Sets autosuspend timeout of USB devices to the value specified by the autosuspend parameter.
The value 0 means that autosuspend is disabled.
vm
Enables or disables transparent huge pages depending on the value of the transparent_hugepages
option.
Valid values of the transparent_hugepages option are:
"always"
"never"
"madvise"
audio
Sets the autosuspend timeout for audio codecs to the value specified by the timeout option.
Currently, the snd_hda_intel and snd_ac97_codec codecs are supported. The value 0 means that
the autosuspend is disabled. You can also enforce the controller reset by setting the Boolean option
reset_controller to true.
disk
Sets the disk elevator to the value specified by the elevator option.
It also sets:
32
CHAPTER 3. CUSTOMIZING TUNED PROFILES
The current disk readahead to a value multiplied by the constant specified by the
readahead_multiply option
In addition, this plug-in dynamically changes the advanced power management and spindown
timeout setting for the drive according to the current drive utilization. The dynamic tuning can be
controlled by the Boolean option dynamic and is enabled by default.
scsi_host
Tunes options for SCSI hosts.
It sets Aggressive Link Power Management (ALPM) to the value specified by the alpm option.
mounts
Enables or disables barriers for mounts according to the Boolean value of the disable_barriers
option.
script
Executes an external script or binary when the profile is loaded or unloaded. You can choose an
arbitrary executable.
IMPORTANT
The script plug-in is provided mainly for compatibility with earlier releases. Prefer
other TuneD plug-ins if they cover the required functionality.
You need to correctly implement the stop action in your executable and revert all settings that you
changed during the start action. Otherwise, the roll-back step after changing your TuneD profile will
not work.
Bash scripts can import the /usr/lib/tuned/functions Bash library and use the functions defined
there. Use these functions only for functionality that is not natively provided by TuneD. If a function
name starts with an underscore, such as _wifi_set_power_level, consider the function private and do
not use it in your scripts, because it might change in the future.
Specify the path to the executable using the script parameter in the plug-in configuration.
To run a Bash script named script.sh that is located in the profile directory, use:
[script]
script=${i:PROFILE_DIR}/script.sh
33
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
sysfs
Sets various sysfs settings specified by the plug-in options.
The syntax is name=value, where name is the sysfs path to use.
Use this plugin in case you need to change some settings that are not covered by other plug-ins.
Prefer specific plug-ins if they cover the required settings.
video
Sets various powersave levels on video cards. Currently, only the Radeon cards are supported.
The powersave level can be specified by using the radeon_powersave option. Supported values are:
default
auto
low
mid
high
dynpm
dpm-battery
dpm-balanced
dpm-perfomance
For details, see www.x.org. Note that this plug-in is experimental and the option might change in
future releases.
bootloader
Adds options to the kernel command line. This plug-in supports only the GRUB 2 boot loader.
Customized non-standard location of the GRUB 2 configuration file can be specified by the
grub2_cfg_file option.
The kernel options are added to the current GRUB configuration and its templates. The system
needs to be rebooted for the kernel options to take effect.
Switching to another profile or manually stopping the TuneD service removes the additional options.
If you shut down or reboot the system, the kernel options persist in the grub.cfg file.
For example, to add the quiet kernel option to a TuneD profile, include the following lines in the
tuned.conf file:
34
CHAPTER 3. CUSTOMIZING TUNED PROFILES
[bootloader]
cmdline=quiet
The following is an example of a custom profile that adds the isolcpus=2 option to the kernel
command line:
[bootloader]
cmdline=isolcpus=2
service
Handles various sysvinit, sysv-rc, openrc, and systemd services specified by the plug-in options.
The syntax is service.service_name=command[,file:file].
start
stop
enable
disable
Separate multiple commands using either a comma (,) or a semicolon ( ;). If the directives conflict, the
service plugin uses the last listed one.
Use the optional file:file directive to install an overlay configuration file, file, for systemd only. Other
init systems ignore this directive. The service plugin copies overlay configuration files to
/etc/systemd/system/service_name.service.d/ directories. Once profiles are unloaded, the service
plugin removes these directories if they are empty.
NOTE
The service plugin only operates on the current runlevel with non- systemd init
systems.
Example 3.8. Starting and enabling the sendmail sendmail service with an overlay file
[service]
service.sendmail=start,enable,file:${i:PROFILE_DIR}/tuned-sendmail.conf
The internal variable ${i:PROFILE_DIR} points to the directory the plugin loads the profile from.
scheduler
Offers a variety of options for the tuning of scheduling priorities, CPU core isolation, and process,
thread, and IRQ affinities.
For specifics of the different options available, see Functionalities of the scheduler TuneD plug-in.
35
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
CPU isolation
To prevent processes, threads, and IRQs from using certain CPUs, use the isolated_cores option. It
changes process and thread affinities, IRQ affinities, and sets the default_smp_affinity parameter for
IRQs.
The CPU affinity mask is adjusted for all processes and threads matching the ps_whitelist option,
subject to success of the sched_setaffinity() system call. The default setting of the ps_whitelist
regular expression is .* to match all processes and thread names. To exclude certain processes and
threads, use the ps_blacklist option. The value of this option is also interpreted as a regular expression.
Process and thread names are matched against that expression. Profile rollback enables all matching
processes and threads to run on all CPUs, and restores the IRQ settings prior to the profile application.
Multiple regular expressions separated by ; for the ps_whitelist and ps_blacklist options are
supported. Escaped semicolon \; is taken literally.
The following configuration isolates CPUs 2-4. Processes and threads that match the ps_blacklist
regular expression can use any CPUs regardless of the isolation:
[scheduler]
isolated_cores=2-4
ps_blacklist=.*pmd.*;.*PMD.*;^DPDK;.*qemu-kvm.*
calc
Calculates the content of the /proc/irq/default_smp_affinity file from the isolated_cores
parameter. An inversion of the isolated_cores parameter calculates the non-isolated cores.
The intersection of the non-isolated cores and the previous content of the
/proc/irq/default_smp_affinity file is then written to the /proc/irq/default_smp_affinity file.
ignore
TuneD does not modify the /proc/irq/default_smp_affinity file.
A CPU list
Takes the form of a single number such as 1, a comma separated list such as 1,3, or a range such as 3-
5.
36
CHAPTER 3. CUSTOMIZING TUNED PROFILES
Unpacks the CPU list and writes it directly to the /proc/irq/default_smp_affinity file.
Example 3.10. Setting the default IRQ smp affinity using an explicit CPU list
The following example uses an explicit CPU list to set the default IRQ SMP affinity to CPUs 0 and 2:
[scheduler]
isolated_cores=1,3
default_irq_smp_affinity=0,2
Scheduling policy
To adjust scheduling policy, priority and affinity for a group of processes or threads, use the following
syntax:
group.groupname=rule_prio:sched:prio:affinity:regex
where rule_prio defines internal TuneD priority of the rule. Rules are sorted based on priority. This is
needed for inheritance to be able to reorder previously defined rules. Equal rule_prio rules should be
processed in the order they were defined. However, this is Python interpreter dependent. To disable an
inherited rule for groupname, use:
group.groupname=
f
for first in, first out (FIFO)
b
for batch
r
for round robin
o
for other
*
for do not change
regex is Python regular expression. It is matched against the output of the ps -eo cmd command.
Any given process name can match more than one group. In such cases, the last matching regex
determines the priority and scheduling policy.
The following example sets the scheduling policy and priorities to kernel threads and watchdog:
37
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
[scheduler]
group.kthreads=0:*:1:*:\[.*\]$
group.watchdog=0:f:99:*:\[watchdog.*\]
The scheduler plugin uses a perf event loop to identify newly created processes. By default, it listens to
perf.RECORD_COMM and perf.RECORD_EXIT events.
Setting the perf_process_fork parameter to true tells the plug-in to also listen to
perf.RECORD_FORK events, meaning that child processes created by the fork() system call are
processed.
NOTE
The CPU overhead of the scheduler plugin can be mitigated by using the scheduler runtime option and
setting it to 0. This completely disables the dynamic scheduler functionality and the perf events are not
monitored and acted upon. The disadvantage of this is that the process and thread tuning will be done
only at profile application.
The following example disables the dynamic scheduler functionality while also isolating CPUs 1 and 3:
[scheduler]
runtime=0
isolated_cores=1,3
The mmapped buffer is used for perf events. Under heavy loads, this buffer might overflow and as a
result the plugin might start missing events and not processing some newly created processes. In such
cases, use the perf_mmap_pages parameter to increase the buffer size. The value of the
perf_mmap_pages parameter must be a power of 2. If the perf_mmap_pages parameter is not
manually set, a default value of 128 is used.
The cgroup_mount_point option specifies the path to mount the cgroup file system, or, where TuneD
expects it to be mounted. If it is unset, /sys/fs/cgroup/cpuset is expected.
If the cgroup_groups_init option is set to 1, TuneD creates and removes all cgroups defined with the
cgroup* options. This is the default behavior. If the cgroup_mount_point option is set to 0, the
cgroups must be preset by other means.
If the cgroup_mount_point_init option is set to 1, TuneD creates and removes the cgroup mount
point. It implies cgroup_groups_init = 1. If the cgroup_mount_point_init option is set to 0, you must
preset the cgroups mount point by other means. This is the default behavior.
The cgroup_for_isolated_cores option is the cgroup name for the isolated_cores option
functionality. For example, if a system has 4 CPUs, isolated_cores=1 means that Tuned moves all
processes and threads to CPUs 0, 2, and 3. The scheduler plug-in isolates the specified core by writing
38
CHAPTER 3. CUSTOMIZING TUNED PROFILES
the calculated CPU affinity to the cpuset.cpus control file of the specified cgroup and moves all the
matching processes and threads to this group. If this option is unset, classic cpuset affinity using
sched_setaffinity() sets the CPU affinity.
The cgroup.cgroup_name option defines affinities for arbitrary cgroups. You can even use hierarchic
cgroups, but you must specify the hierarchy in the correct order. TuneD does not do any sanity checks
here, with the exception that it forces the cgroup to be in the location specified by the
cgroup_mount_point option.
The syntax of the scheduler option starting with group. has been augmented to use
cgroup.cgroup_name instead of the hexadecimal affinity. The matching processes are moved to the
cgroup cgroup_name. You can also use cgroups not defined by the cgroup. option as described above.
For example, cgroups not managed by TuneD.
All cgroup names are sanitized by replacing all periods ( .) with slashes (/). This prevents the plugin from
writing outside the location specified by the cgroup_mount_point option.
The following example creates 2 cgroups, group1 and group2. It sets the cgroup group1 affinity to
CPU 2 and the cgroup group2 to CPUs 0 and 2. Given a 4 CPU setup, the isolated_cores=1 option
moves all processes and threads to CPU cores 0, 2, and 3. Processes and threads specified by the
ps_blacklist regular expression are not moved.
[scheduler]
cgroup_mount_point=/sys/fs/cgroup/cpuset
cgroup_mount_point_init=1
cgroup_groups_init=1
cgroup_for_isolated_cores=group
cgroup.group1=2
cgroup.group2=0,2
group.ksoftirqd=0:f:2:cgroup.group1:ksoftirqd.*
ps_blacklist=ksoftirqd.*;rcuc.*;rcub.*;ktimersoftd.*
isolated_cores=1
The cgroup_ps_blacklist option excludes processes belonging to the specified cgroups. The regular
expression specified by this option is matched against cgroup hierarchies from /proc/PID/cgroups.
Commas (,) separate cgroups v1 hierarchies from /proc/PID/cgroups before regular expression
matching. The following is an example of content the regular expression is matched against:
10:hugetlb:/,9:perf_event:/,8:blkio:/
Multiple regular expressions can be separated by semicolons (;). The semicolon represents a logical 'or'
operator.
In the following example, the scheduler plug-in moves all processes away from core 1, except for
processes which belong to cgroup /daemons. The \b string is a regular expression metacharacter
that matches a word boundary.
39
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
[scheduler]
isolated_cores=1
cgroup_ps_blacklist=:/daemons\b
In the following example, the scheduler plugin excludes all processes which belong to a cgroup with a
hierarchy-ID of 8 and controller-list blkio.
[scheduler]
isolated_cores=1
cgroup_ps_blacklist=\b8:blkio:
Recent kernels moved some sched_ and numa_balancing_ kernel run-time parameters from the
/proc/sys/kernel directory managed by the sysctl utility, to debugfs, typically mounted under the
/sys/kernel/debug directory. TuneD provides an abstraction mechanism for the following parameters
via the scheduler plugin where, based on the kernel used, TuneD writes the specified value to the
correct location:
sched_min_granularity_ns
sched_latency_ns,
sched_wakeup_granularity_ns
sched_tunable_scaling,
sched_migration_cost_ns
sched_nr_migrate
numa_balancing_scan_delay_ms
numa_balancing_scan_period_min_ms
numa_balancing_scan_period_max_ms
numa_balancing_scan_size_mb
Example 3.15. Set tasks' "cache hot" value for migration decisions.
On the old kernels, setting the following parameter meant that sysctl wrote a value of
500000 to the /proc/sys/kernel/sched_migration_cost_ns file:
[sysctl]
kernel.sched_migration_cost_ns=500000
This is, on more recent kernels, equivalent to setting the following parameter via the
scheduler plugin:
[scheduler]
sched_migration_cost_ns=500000
40
CHAPTER 3. CUSTOMIZING TUNED PROFILES
Using TuneD variables reduces the amount of necessary typing in TuneD profiles.
There are no predefined variables in TuneD profiles. You can define your own variables by creating the
[variables] section in a profile and using the following syntax:
[variables]
variable_name=value
${variable_name}
In the following example, the ${isolated_cores} variable expands to 1,2; hence the kernel boots with
the isolcpus=1,2 option:
[variables]
isolated_cores=1,2
[bootloader]
cmdline=isolcpus=${isolated_cores}
The variables can be specified in a separate file. For example, you can add the following lines to
tuned.conf:
[variables]
include=/etc/tuned/my-variables.conf
[bootloader]
cmdline=isolcpus=${isolated_cores}
If you add the isolated_cores=1,2 option to the /etc/tuned/my-variables.conf file, the kernel boots
with the isolcpus=1,2 option.
Additional resources
You can:
41
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Create custom functions in Python and add them to TuneD in the form of plug-ins
${f:function_name:argument_1:argument_2}
To expand the directory path where the profile and the tuned.conf file are located, use the
PROFILE_DIR function, which requires special syntax:
${i:PROFILE_DIR}
Example 3.17. Isolating CPU cores using variables and built-in functions
In the following example, the ${non_isolated_cores} variable expands to 0,3-5, and the
cpulist_invert built-in function is called with the 0,3-5 argument:
[variables]
non_isolated_cores=0,3-5
[bootloader]
cmdline=isolcpus=${f:cpulist_invert:${non_isolated_cores}}
The cpulist_invert function inverts the list of CPUs. For a 6-CPU machine, the inversion is 1,2, and
the kernel boots with the isolcpus=1,2 command-line option.
Additional resources
PROFILE_DIR
Returns the directory path where the profile and the tuned.conf file are located.
exec
Executes a process and returns its output.
assertion
Compares two arguments. If they do not match, the function logs text from the first argument and
aborts profile loading.
assertion_non_equal
Compares two arguments. If they match, the function logs text from the first argument and aborts
profile loading.
kb2s
Converts kilobytes to disk sectors.
s2kb
Converts disk sectors to kilobytes.
strip
42
CHAPTER 3. CUSTOMIZING TUNED PROFILES
Creates a string from all passed arguments and deletes both leading and trailing white space.
virt_check
Checks whether TuneD is running inside a virtual machine (VM) or on bare metal:
On bare metal, the function returns the second argument, even in case of an error.
cpulist_invert
Inverts a list of CPUs to make its complement. For example, on a system with 4 CPUs, numbered
from 0 to 3, the inversion of the list 0,2,3 is 1.
cpulist2hex
Converts a CPU list to a hexadecimal CPU mask.
cpulist2hex_invert
Converts a CPU list to a hexadecimal CPU mask and inverts it.
hex2cpulist
Converts a hexadecimal CPU mask to a CPU list.
cpulist_online
Checks whether the CPUs from the list are online. Returns the list containing only online CPUs.
cpulist_present
Checks whether the CPUs from the list are present. Returns the list containing only present CPUs.
cpulist_unpack
Unpacks a CPU list in the form of 1-3,4 to 1,2,3,4.
cpulist_pack
Packs a CPU list in the form of 1,2,3,5 to 1-3,5.
Prerequisites
The TuneD service is running. See Installing and Enabling TuneD for details.
Procedure
1. In the /etc/tuned/ directory, create a new directory named the same as the profile that you want
to create:
# mkdir /etc/tuned/my-profile
2. In the new directory, create a file named tuned.conf. Add a [main] section and plug-in
definitions in it, according to your requirements.
For example, see the configuration of the balanced profile:
[main]
summary=General non-specialized TuneD profile
43
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
[cpu]
governor=conservative
energy_perf_bias=normal
[audio]
timeout=10
[video]
radeon_powersave=dpm-balanced, auto
[scsi_host]
alpm=medium_power
4. Verify that the TuneD profile is active and the system settings are applied:
$ tuned-adm active
$ tuned-adm verify
Additional resources
Prerequisites
The TuneD service is running. See Installing and Enabling TuneD for details.
Procedure
1. In the /etc/tuned/ directory, create a new directory named the same as the profile that you want
to create:
# mkdir /etc/tuned/modified-profile
2. In the new directory, create a file named tuned.conf, and set the [main] section as follows:
[main]
include=parent-profile
Replace parent-profile with the name of the profile you are modifying.
44
CHAPTER 3. CUSTOMIZING TUNED PROFILES
To use the settings from the throughput-performance profile and change the value of
vm.swappiness to 5, instead of the default 10, use:
[main]
include=throughput-performance
[sysctl]
vm.swappiness=5
5. Verify that the TuneD profile is active and the system settings are applied:
$ tuned-adm active
$ tuned-adm verify
Additional resources
device with the name of the block device, for example sdf
selected-scheduler with the disk scheduler that you want to set for the device, for example bfq
Prerequisites
The TuneD service is installed and enabled. For details, see Installing and enabling TuneD .
Procedure
1. Optional: Select an existing TuneD profile on which your profile will be based. For a list of
available profiles, see TuneD profiles distributed with RHEL .
To see which profile is currently active, use:
45
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
$ tuned-adm active
# mkdir /etc/tuned/my-profile
ID_WWN=0x5002538d00000000_
ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0
ID_SERIAL_SHORT=20120501030900000
NOTE
The command in the this example will return all values identified as a World Wide
Name (WWN) or serial number associated with the specified block device.
Although it is preferred to use a WWN, the WWN is not always available for a
given device and any values returned by the example command are acceptable to
use as the device system unique ID.
4. Create the /etc/tuned/my-profile/tuned.conf configuration file. In the file, set the following
options:
[main]
include=existing-profile
b. Set the selected disk scheduler for the device that matches the WWN identifier:
[disk]
devices_udev_regex=IDNAME=device system unique id
elevator=selected-scheduler
Here:
Replace IDNAME with the name of the identifier being used (for example, ID_WWN).
Replace device system unique id with the value of the chosen identifier (for example,
0x5002538d00000000).
To match multiple devices in the devices_udev_regex option, enclose the identifiers in
parentheses and separate them with vertical bars:
devices_udev_regex=(ID_WWN=0x5002538d00000000)|
(ID_WWN=0x1234567800000000)
46
CHAPTER 3. CUSTOMIZING TUNED PROFILES
Verification steps
$ tuned-adm active
$ tuned-adm verify
# cat /sys/block/device/queue/scheduler
In the file name, replace device with the block device name, for example sdc.
Additional resources
47
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Procedure
Verification steps
# tuna -h
Additional resources
Prerequisites
The tuna tool is installed. For more information, see Installing tuna tool.
Procedure
# tuna --show_threads
thread
pid SCHED_ rtpri affinity cmd
1 OTHER 0 0,1 init
48
CHAPTER 4. REVIEWING A SYSTEM USING TUNA INTERFACE
2 FIFO 99 0 migration/0
3 OTHER 0 0 ksoftirqd/0
4 FIFO 99 0 watchdog/0
To tune CPUs using the tuna CLI, see Tuning CPUs using tuna tool .
To tune the IRQs using the tuna tool, see Tuning IRQs using tuna tool .
# tuna --save=filename
This command saves only currently running kernel threads. Processes that are not running are
not saved.
Additional resources
Isolate CPUs
All tasks running on the specified CPU move to the next available CPU. Isolating a CPU makes it
unavailable by removing it from the affinity mask of all threads.
Include CPUs
Allows tasks to run on the specified CPU
Restore CPUs
Restores the specified CPU to its previous configuration.
This procedure describes how to tune CPUs using the tuna CLI.
Prerequisites
The tuna tool is installed. For more information, see Installing tuna tool.
Procedure
The cpu_list argument is a list of comma-separated CPU numbers. For example, --cpus=0,2.
49
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The cpu_list argument is a list of comma-separated CPU numbers. For example, --cpus=0,2.
CPU lists can also be specified in a range, for example --cpus=”1-3”, which would select CPUs 1,
2, and 3.
To add a specific CPU to the current cpu_list, for example, use --cpus=+0.
To isolate a CPU:
To include a CPU:
To use a system with four or more processors, display how to make all the ssh threads run on
CPU 0 and 1, and all the http threads on CPU 2 and 3:
3. Moves the selected threads to the selected CPUs. Tuna sets the affinity mask of threads
starting with ssh to the appropriate CPUs. The CPUs can be expressed numerically as 0
and 1, in hex mask as 0x3, or in binary as 11.
6. Moves the selected threads to the specified CPUs. Tuna sets the affinity mask of threads
starting with http to the specified CPUs. The CPUs can be expressed numerically as 2 and
3, in hex mask as 0xC, or in binary as 1100.
Verification steps
Display the current configuration and verify that the changes were performed as expected:
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
3861 OTHER 0 0,1 33997 58 gnome-screensav
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
3861 OTHER 0 0 33997 58 gnome-screensav
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
50
CHAPTER 4. REVIEWING A SYSTEM USING TUNA INTERFACE
2. Displays the selected threads to enable the user to verify their affinity mask and RT priority.
3. Selects CPU 0.
10. Moves the gnome-sc threads to the specified CPUs, CPUs 0 and 1.
Additional resources
/proc/cpuinfo file
This procedure describes how to tune the IRQs using the tuna tool.
Prerequisites
The tuna tool is installed. For more information, see Installing tuna tool.
Procedure
# tuna --show_irqs
# users affinity
0 timer 0
1 i8042 0
7 parport0 0
51
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Replace 128 with the irq_list argument and 3 with the cpu_list argument.
The cpu_list argument is a list of comma-separated CPU numbers, for example, --cpus=0,2. For
more information, see Tuning CPUs using tuna tool .
Verification steps
Compare the state of the selected IRQs before and after moving any interrupt to a specified
CPU:
Additional resources
/procs/interrupts file
52
CHAPTER 5. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
Prerequisites
RHEL 8.6 or later is installed. For more information about installing RHEL, see Performing a
standard RHEL 8 installation.
NOTE
In RHEL 8.5 and earlier versions, Ansible packages were provided through Ansible
Engine instead of Ansible Core, and with a different level of support. Do not use
Ansible Engine because the packages might not be compatible with Ansible
automation content in RHEL 8.6 and later. For more information, see Scope of
support for the Ansible Core package included in the RHEL 9 and RHEL 8.6 and
later AppStream repositories.
Procedure
[root@control-node]# su - ansible
53
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
[ansible@control-node]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/ansible/.ssh/id_rsa):
Enter passphrase (empty for no passphrase): <password>
Enter same passphrase again: <password>
...
5. Optional: To prevent Ansible from prompting you for the SSH key password each time you
establish a connection, configure an SSH agent.
[defaults]
inventory = /home/ansible/inventory
remote_user = ansible
[privilege_escalation]
become = True
become_method = sudo
become_user = root
become_ask_pass = True
NOTE
Settings in the ~/.ansible.cfg file have a higher priority and override settings
from the global /etc/ansible/ansible.cfg file.
Uses the account set in the remote_user parameter when it establishes SSH connections to
managed nodes.
Uses the sudo utility to execute tasks on managed nodes as the root user.
Prompts for the root password of the remote user every time you apply a playbook. This is
recommended for security reasons.
7. Create an ~/inventory file in INI or YAML format that lists the hostnames of managed hosts.
You can also define groups of hosts in the inventory file. For example, the following is an
inventory file in the INI format with three hosts and one host group named US:
managed-node-01.example.com
[US]
managed-node-02.example.com ansible_host=192.0.2.100
managed-node-03.example.com
54
CHAPTER 5. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
Note that the control node must be able to resolve the hostnames. If the DNS server cannot
resolve certain hostnames, add the ansible_host parameter next to the host entry to specify its
IP address.
Next steps
Prepare the managed nodes. For more information, see Preparing a managed node .
Additional resources
Scope of support for the Ansible Core package included in the RHEL 9 and RHEL 8.6 and later
AppStream repositories
How to register and subscribe a system to the Red Hat Customer Portal using subscription-
manager
Prerequisites
You prepared the control node. For more information, see Preparing a control node on RHEL 8 .
IMPORTANT
Direct SSH access as the root user is a security risk. To reduce this risk, you will
create a local user on this node and configure a sudo policy when preparing a
managed node. Ansible on the control node can then use the local user account
to log in to the managed node and run playbooks as different users, such as root.
Procedure
The control node later uses this user to establish an SSH connection to this host.
55
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
You must enter this password when Ansible uses sudo to perform tasks as the root user.
3. Install the ansible user’s SSH public key on the managed node:
a. Log in to the control node as the ansible user, and copy the SSH public key to the managed
node:
d. Verify the SSH connection by remotely executing a command on the control node:
a. Create and edit the /etc/sudoers.d/ansible file by using the visudo command:
The benefit of using visudo over a normal editor is that this utility provides basic sanity
checks and checks for parse errors before installing the file.
To grant permissions to the ansible user to run all commands as any user and group on
56
CHAPTER 5. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
To grant permissions to the ansible user to run all commands as any user and group on
this host after entering the ansible user’s password, use:
To grant permissions to the ansible user to run all commands as any user and group on
this host without entering the ansible user’s password, use:
Alternatively, configure a more fine-granular policy that matches your security requirements.
For further details on sudoers policies, see the sudoers(5) man page.
Verification
1. Verify that you can execute commands from the control node on an all managed nodes:
The hard-coded all group dynamically contains all hosts listed in the inventory file.
2. Verify that privilege escalation works correctly by running the whoami utility on a managed host
by using the Ansible command module:
If the command returns root, you configured sudo on the managed nodes correctly.
Additional resources
57
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
be monitored by the local system. The metrics System Role enables you to use pcp to monitor your
systems performance without having to configure pcp separately, as the set-up and deployment of pcp
is handled by the playbook.
NOTE
58
CHAPTER 5. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
Prerequisites
You have the rhel-system-roles package installed on the machine you want to monitor.
Procedure
1. Configure localhost in the /etc/ansible/hosts Ansible inventory by adding the following content
to the inventory:
localhost ansible_connection=local
---
- name: Manage metrics
hosts: localhost
vars:
metrics_graph_service: yes
metrics_manage_firewall: true
metrics_manage_selinux: true
roles:
- rhel-system-roles.metrics
# ansible-playbook name_of_your_playbook.yml
NOTE
4. To view visualization of the metrics being collected on your machine, access the grafana web
interface as described in Accessing the Grafana web UI .
59
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
You have the rhel-system-roles package installed on the machine you want to use to run the
playbook.
Procedure
1. Add the name or IP address of the machines you want to monitor via the playbook to the
/etc/ansible/hosts Ansible inventory file under an identifying group name enclosed in brackets:
[remotes]
webserver.example.com
database.example.com
---
- hosts: remotes
vars:
metrics_retention_days: 0
metrics_manage_firewall: true
metrics_manage_selinux: true
roles:
- rhel-system-roles.metrics
NOTE
# ansible-playbook name_of_your_playbook.yml -k
Prerequisites
You have the rhel-system-roles package installed on the machine you want to use to run the
60
CHAPTER 5. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
You have the rhel-system-roles package installed on the machine you want to use to run the
playbook.
Procedure
---
- hosts: localhost
vars:
metrics_graph_service: yes
metrics_query_service: yes
metrics_retention_days: 10
metrics_monitored_hosts: ["database.example.com", "webserver.example.com"]
metrics_manage_firewall: yes
metrics_manage_selinux: yes
roles:
- rhel-system-roles.metrics
# ansible-playbook name_of_your_playbook.yml
NOTE
3. To view a graphical representation of the metrics being collected centrally by your machine and
to query the data, access the grafana web interface as described in Accessing the Grafana web
UI.
Prerequisites
You have the rhel-system-roles package installed on the machine you want to use to run the
playbook.
Procedure
61
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
1. Include the following variables in the Ansible playbook you want to setup authentication for:
---
vars:
metrics_username: your_username
metrics_password: your_password
metrics_manage_firewall: true
metrics_manage_selinux: true
NOTE
# ansible-playbook name_of_your_playbook.yml
Verification steps
Prerequisites
You have the rhel-system-roles package installed on the machine you want to monitor.
You have installed Microsoft SQL Server for Red Hat Enterprise Linux and established a
'trusted' connection to an SQL server. See Install SQL Server and create a database on Red Hat .
You have installed the Microsoft ODBC driver for SQL Server for Red Hat Enterprise Linux. See
Red Hat Enterprise Server and Oracle Linux .
Procedure
1. Configure localhost in the /etc/ansible/hosts Ansible inventory by adding the following content
to the inventory:
62
CHAPTER 5. MONITORING PERFORMANCE USING RHEL SYSTEM ROLES
localhost ansible_connection=local
---
- hosts: localhost
vars:
metrics_from_mssql: true
metrics_manage_firewall: true
metrics_manage_selinux: true
roles:
- role: rhel-system-roles.metrics
NOTE
# ansible-playbook name_of_your_playbook.yml
Verification steps
Use the pcp command to verify that SQL Server PMDA agent (mssql) is loaded and running:
# pcp
platform: Linux rhel82-2.local 4.18.0-167.el8.x86_64 #1 SMP Sun Dec 15 01:24:23 UTC
2019 x86_64
hardware: 2 cpus, 1 disk, 1 node, 2770MB RAM
timezone: PDT+7
services: pmcd pmproxy
pmcd: Version 5.0.2-1, 12 agents, 4 clients
pmda: root pmcd proc pmproxy xfs linux nfsclient mmv kvm mssql
jbd2 dm
pmlogger: primary logger: /var/log/pcp/pmlogger/rhel82-2.local/20200326.16.31
pmie: primary engine: /var/log/pcp/pmie/rhel82-2.local/pmie.log
Additional resources
For more information about using Performance Co-Pilot for Microsoft SQL Server, see this Red
Hat Developers Blog post.
63
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
You can analyze data patterns by comparing live results with archived data.
Features of PCP:
Light-weight distributed architecture, which is useful during the centralized analysis of complex
systems.
The Performance Metric Collector Daemon (pmcd) collects performance data from the
installed Performance Metric Domain Agents (pmda). PMDAs can be individually loaded or
unloaded on the system and are controlled by the PMCD on the same host.
Various client tools, such as pminfo or pmstat, can retrieve, display, archive, and process this
data on the same host or over the network.
The pcp package provides the command-line tools and underlying functionality.
The pcp-gui package provides the graphical application. Install the pcp-gui package by
executing the yum install pcp-gui command. For more information, see Visually tracing PCP
log archives with the PCP Charts application.
Additional resources
/usr/share/doc/pcp-doc/ directory
Index of Performance Co-Pilot (PCP) articles, solutions, tutorials, and white papers fromon
Red Hat Customer Portal
Side-by-side comparison of PCP tools with legacy tools Red Hat Knowledgebase article
64
CHAPTER 6. SETTING UP PCP
This procedure describes how to install PCP using the pcp package. If you want to automate the PCP
installation, install it using the pcp-zeroconf package. For more information about installing PCP by
using pcp-zeroconf, see Setting up PCP with pcp-zeroconf.
Procedure
Verification steps
# pcp
platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019
x86_64
hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM
timezone: CEST-2
services: pmcd
pmcd: Version 4.3.0-1, 8 agents
pmda: root pmcd proc xfs linux mmv kvm jbd2
Additional resources
You can analyze the resulting tar.gz file and the archive of the pmlogger output using various PCP
tools and compare them with other sources of performance information.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
65
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# pmlogconf -r /var/lib/pcp/config/pmlogger/config.default
5. Save the output and save it to a tar.gz file named based on the host name and the current date
and time:
# cd /var/log/pcp/pmlogger/
Extract this file and analyze the data using PCP tools.
Additional resources
pmcd
The Performance Metric Collector Daemon (PMCD).
pmie
The Performance Metrics Inference Engine.
pmlogger
The performance metrics logger.
pmproxy
The realtime and historical performance metrics proxy, time series query and REST API service.
66
CHAPTER 6. SETTING UP PCP
pcp
Displays the current status of a Performance Co-Pilot installation.
pcp-vmstat
Provides a high-level system performance overview every 5 seconds. Displays information about
processes, memory, paging, block IO, traps, and CPU activity.
pmconfig
Displays the values of configuration parameters.
pmdiff
Compares the average values for every metric in either one or two archives, in a given time window,
for changes that are likely to be of interest when searching for performance regressions.
pmdumplog
Displays control, metadata, index, and state information from a Performance Co-Pilot archive file.
pmfind
Finds PCP services on the network.
pmie
An inference engine that periodically evaluates a set of arithmetic, logical, and rule expressions. The
metrics are collected either from a live system, or from a Performance Co-Pilot archive file.
pmieconf
Displays or sets configurable pmie variables.
pmiectl
Manages non-primary instances of pmie.
pminfo
Displays information about performance metrics. The metrics are collected either from a live system,
or from a Performance Co-Pilot archive file.
pmlc
Interactively configures active pmlogger instances.
pmlogcheck
Identifies invalid data in a Performance Co-Pilot archive file.
pmlogconf
Creates and modifies a pmlogger configuration file.
pmlogctl
Manages non-primary instances of pmlogger.
pmloglabel
Verifies, modifies, or repairs the label of a Performance Co-Pilot archive file.
pmlogsummary
Calculates statistical information about performance metrics stored in a Performance Co-Pilot
archive file.
pmprobe
Determines the availability of performance metrics.
pmsocks
Allows access to a Performance Co-Pilot hosts through a firewall.
pmstat
Periodically displays a brief summary of system performance.
67
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
pmstore
Modifies the values of performance metrics.
pmtrace
Provides a command line interface to the trace PMDA.
pmval
Displays the current value of a performance metric.
pcp-atop
Shows the system-level occupation of the most critical hardware resources from the performance
point of view: CPU, memory, disk, and network.
pcp-atopsar
Generates a system-level activity report over a variety of system resource utilization. The report is
generated from a raw logfile previously recorded using pmlogger or the -w option of pcp-atop.
pcp-dmcache
Displays information about configured Device Mapper Cache targets, such as: device IOPs, cache
and metadata device utilization, as well as hit and miss rates and ratios for both reads and writes for
each cache device.
pcp-dstat
Displays metrics of one system at a time. To display metrics of multiple systems, use --host option.
pcp-free
Reports on free and used memory in a system.
pcp-htop
Displays all processes running on a system along with their command line arguments in a manner
similar to the top command, but allows you to scroll vertically and horizontally as well as interact using
a mouse. You can also view processes in a tree format and select and act on multiple processes at
once.
pcp-ipcs
Displays information about the inter-process communication (IPC) facilities that the calling process
has read access for.
pcp-mpstat
Reports CPU and interrupt-related statistics.
pcp-numastat
Displays NUMA allocation statistics from the kernel memory allocator.
pcp-pidstat
Displays information about individual tasks or processes running on the system, such as CPU
percentage, memory and stack usage, scheduling, and priority. Reports live data for the local host by
default.
pcp-shping
Samples and reports on the shell-ping service metrics exported by the pmdashping Performance
Metrics Domain Agent (PMDA).
pcp-ss
Displays socket statistics collected by the pmdasockets PMDA.
pcp-tapestat
68
CHAPTER 6. SETTING UP PCP
pmchart
Plots performance metrics values available through the facilities of the Performance Co-Pilot.
pmdumptext
Outputs the values of performance metrics collected live or from a Performance Co-Pilot archive.
pmclient
Displays high-level system performance metrics by using the Performance Metrics Application
Programming Interface (PMAPI).
pmdbg
Displays available Performance Co-Pilot debug control flags and their values.
pmerr
Displays available Performance Co-Pilot error codes and their corresponding error messages.
Available scaling deployment setup variants based on the recommended deployment set up by Red Hat,
sizing factors, and configuration options include:
NOTE
Since the PCP version 5.3.0 is unavailable in Red Hat Enterprise Linux 8.4 and the prior
minor versions of Red Hat Enterprise Linux 8, Red Hat recommends localhost and
pmlogger farm architectures.
For more information about known memory leaks in pmproxy in PCP versions before
5.3.0, see Memory leaks in pmproxy in PCP .
Localhost
Each service runs locally on the monitored machine. When you start a service without any
69
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Each service runs locally on the monitored machine. When you start a service without any
configuration changes, this is the default deployment. Scaling beyond the individual node is not
possible in this case.
By default, the deployment setup for Redis is standalone, localhost. However, Redis can optionally
perform in a highly-available and highly scalable clustered fashion, where data is shared across
multiple hosts. Another viable option is to deploy a Redis cluster in the cloud, or to utilize a managed
Redis cluster from a cloud vendor.
Decentralized
The only difference between localhost and decentralized setup is the centralized Redis service. In
this model, the host executes pmlogger service on each monitored host and retrieves metrics from a
local pmcd instance. A local pmproxy service then exports the performance metrics to a central
Redis instance.
NOTE
By default, the deployment setup for Redis is standalone, localhost. However, Redis can
optionally perform in a highly-available and highly scalable clustered fashion, where data
is shared across multiple hosts. Another viable option is to deploy a Redis cluster in the
cloud, or to utilize a managed Redis cluster from a cloud vendor.
72
CHAPTER 6. SETTING UP PCP
Additional resources
pmcd servers N N N
After every PCP upgrade, the pmlogrewrite tool is executed and rewrites old archives if there were
73
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
After every PCP upgrade, the pmlogrewrite tool is executed and rewrites old archives if there were
changes in the metric metadata from the previous version and the new version of PCP. This process
duration scales linear with the number of archives stored.
Additional resources
stream.expire specifies the duration when stale metrics should be removed, that is metrics
which were not updated in a specified amount of time in seconds.
stream.maxlen specifies the maximum number of metric values for one metric per host. This
setting should be the retention time divided by the logging interval, for example 20160 for 14
days of retention and 60s logging interval (60*60*24*14/60)
Additional resources
The following results were gathered on a centralized logging setup, also known as pmlogger farm
74
CHAPTER 6. SETTING UP PCP
The following results were gathered on a centralized logging setup, also known as pmlogger farm
deployment, with a default pcp-zeroconf 5.3.0 installation, where each remote host is an identical
container instance running pmcd on a server with 64 CPU cores, 376 GB RAM, and one disk attached.
The logging interval is 10s, proc metrics of remote nodes are not included, and the memory values refer
to the Resident Set Size (RSS) value.
Number of Hosts 10 50
Table 6.3. Used resources depending on monitored hosts for 60s logging interval
NOTE
The pmproxy queues Redis requests and employs Redis pipelining to speed up Redis
queries. This can result in high memory usage. For troubleshooting this issue, see
Troubleshooting high memory usage.
This setup of the pmlogger farms is identical to the configuration mentioned in the
75
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Example: Analyzing the centralized logging deployment for 60s logging interval, except that the Redis
servers were operating in cluster mode.
Table 6.4. Used resources depending on federated hosts for 60s logging interval
PCP Archives pmlogger Network per Day pmproxy Redis Memory per
Storage per Day Memory (In/Out) Memory Day
Here, all values are per host. The network bandwidth is higher due to the inter-node communication of
the Redis cluster.
The pmproxy process is busy processing new PCP archives and does not have spare CPU
cycles to process Redis requests and responses.
The Redis node or cluster is overloaded and cannot process incoming requests on time.
The pmproxy service daemon uses Redis streams and supports the configuration parameters, which are
PCP tuning parameters and affects Redis memory usage and key retention. The
/etc/pcp/pmproxy/pmproxy.conf file lists the available configuration options for pmproxy and the
associated APIs.
The following procedure describes how to troubleshoot high memory usage issue.
Prerequisites
Procedure
To troubleshoot high memory usage, execute the following command and observe the inflight
column:
$ pmrep :pmproxy
backlog inflight reqs/s resp/s wait req err resp err changed throttled
byte count count/s count/s s/s count/s count/s count/s count/s
14:59:08 0 0 N/A N/A N/A N/A N/A N/A N/A
14:59:09 0 0 2268.9 2268.9 28 0 0 2.0 4.0
14:59:10 0 0 0.0 0.0 0 0 0 0.0 0.0
14:59:11 0 0 0.0 0.0 0 0 0 0.0 0.0
This column shows how many Redis requests are in-flight, which means they are queued or sent,
76
CHAPTER 6. SETTING UP PCP
This column shows how many Redis requests are in-flight, which means they are queued or sent,
and no reply was received so far.
The pmproxy process is busy processing new PCP archives and does not have spare CPU
cycles to process Redis requests and responses.
The Redis node or cluster is overloaded and cannot process incoming requests on time.
To troubleshoot the high memory usage issue, reduce the number of pmlogger processes for
this farm, and add another pmlogger farm. Use the federated - multiple pmlogger farms setup.
If the Redis node is using 100% CPU for an extended amount of time, move it to a host with
better performance or use a clustered Redis setup instead.
To view how many Redis requests are inflight, see the pmproxy.redis.requests.inflight.total
metric and pmproxy.redis.requests.inflight.bytes metric to view how many bytes are
occupied by all current inflight Redis requests.
In general, the redis request queue would be zero but can build up based on the usage of large
pmlogger farms, which limits scalability and can cause high latency for pmproxy clients.
Use the pminfo command to view information about performance metrics. For example, to view
the redis.* metrics, use the following command:
77
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Additional resources
78
CHAPTER 7. LOGGING PERFORMANCE DATA WITH PMLOGGER
Specify which metrics are recorded on the system and how often
Use the pmlogconf utility to check the default configuration. If the pmlogger configuration file does
not exist, pmlogconf creates it with a default metric values.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
# pmlogconf -r /var/lib/pcp/config/pmlogger/config.default
2. Follow pmlogconf prompts to enable or disable groups of related performance metrics and to
control the logging interval for each enabled group.
Additional resources
79
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
[access]
disallow * : all;
allow localhost : enquire;
Additional resources
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
Verification steps
80
CHAPTER 7. LOGGING PERFORMANCE DATA WITH PMLOGGER
Verification steps
# pcp
platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019
x86_64
hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM
timezone: CEST-2
services: pmcd
pmcd: Version 4.3.0-1, 8 agents, 1 client
pmda: root pmcd proc xfs linux mmv kvm jbd2
pmlogger: primary logger: /var/log/pcp/pmlogger/workstation/20190827.15.54
Additional resources
/var/lib/pcp/config/pmlogger/config.default file
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
Replace 192.168.4.62 with the IP address, the client should listen on.
81
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# firewall-cmd --reload
success
# setsebool -P pcp_bind_all_unreserved_ports on
Verification steps
Additional resources
/var/lib/pcp/config/pmlogger/config.default file
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Client is configured for metrics collection. For more information, see Setting up a client system
for metrics collection.
Procedure
82
CHAPTER 7. LOGGING PERFORMANCE DATA WITH PMLOGGER
NOTE
In Red Hat Enterpirse Linux 8.0, 8.1 and 8.2 use the following format for remote
hosts in the control file: PCP_LOG_DIR/pmlogger/host_name.
Verification steps
Ensure that you can access the latest archive file from each directory:
The archive files from the /var/log/pcp/pmlogger/ directory can be used for further analysis and
graphing.
Additional resources
/var/lib/pcp/config/pmlogger/config.default file
83
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Parse the selected PCP log archive and export the values into an ASCII table
Extract the entire archive log or only select metric values from the log by specifying individual
metrics on the command line
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
The pmlogger service is enabled. For more information, see Enabling the pmlogger service.
Procedure
$ pmrep --start @3:00am --archive 20211128 --interval 5seconds --samples 10 --output csv
disk.dev.write
Time,"disk.dev.write-sda","disk.dev.write-sdb"
2021-11-28 03:00:00,,
2021-11-28 03:00:05,4.000,5.200
2021-11-28 03:00:10,1.600,7.600
2021-11-28 03:00:15,0.800,7.100
2021-11-28 03:00:20,16.600,8.400
2021-11-28 03:00:25,21.400,7.200
2021-11-28 03:00:30,21.200,6.800
2021-11-28 03:00:35,21.000,27.600
2021-11-28 03:00:40,12.400,33.800
2021-11-28 03:00:45,9.800,20.600
The mentioned example displays the data on the disk.dev.write metric collected in an archive
at a 5 second interval in comma-separated-value format.
NOTE
Additional resources
84
CHAPTER 8. MONITORING PERFORMANCE WITH PERFORMANCE CO-PILOT
As a system administrator, you can monitor the system’s performance using the PCP application in
Red Hat Enterprise Linux 8.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
The pmlogger service is enabled. For more information, see Enabling the pmlogger service.
Procedure
3. Enable the SELinux boolean, so that pmda-postfix can access the required log files:
# setsebool -P pcp_read_generic_logs=on
# cd /var/lib/pcp/pmdas/postfix/
85
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# ./Install
Verification steps
# pminfo postfix
postfix.received
postfix.sent
postfix.queues.incoming
postfix.queues.maildrop
postfix.queues.hold
postfix.queues.deferred
postfix.queues.active
Additional resources
8.2. VISUALLY TRACING PCP LOG ARCHIVES WITH THE PCP CHARTS
APPLICATION
After recording metric data, you can replay the PCP log archives as graphs. The metrics are sourced
from one or more live hosts with alternative options to use metric data from PCP log archives as a
source of historical data. To customize the PCP Charts application interface to display the data from
the performance metrics, you can use line plot, bar graphs, or utilization graphs.
Replay the data in the PCP Charts application application and use graphs to visualize the
retrospective data alongside live data of the system.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
86
CHAPTER 8. MONITORING PERFORMANCE WITH PERFORMANCE CO-PILOT
Logged performance data with the pmlogger. For more information, see Logging performance
data with pmlogger.
Procedure
# pmchart
The pmtime server settings are located at the bottom. The start and pause button allows you
to control:
2. Click File and then New Chart to select metric from both the local machine and remote
machines by specifying their host name or address. Advanced configuration options include the
ability to manually set the axis values for the chart, and to manually choose the color of the
plots.
Click File and then Export to save an image of the current view.
Click Record and then Start to start a recording. Click Record and then Stop to stop the
recording. After stopping the recording, the recorded metrics are archived to be viewed
later.
87
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
4. Optional: In the PCP Charts application, the main configuration file, known as the view, allows
the metadata associated with one or more charts to be saved. This metadata describes all chart
aspects, including the metrics used and the chart columns. Save the custom view configuration
by clicking File and then Save View, and load the view configuration later.
The following example of the PCP Charts application view configuration file describes a
stacking chart graph showing the total number of bytes read and written to the given XFS file
system loop1:
#kmchart
version 1
Additional resources
This procedure describes how to collect data for Microsoft SQL Server via pcp on your system.
Prerequisites
You have installed Microsoft SQL Server for Red Hat Enterprise Linux and established a
'trusted' connection to an SQL server.
You have installed the Microsoft ODBC driver for SQL Server for Red Hat Enterprise Linux.
Procedure
1. Install PCP:
b. Edit the /etc/pcp/mssql/mssql.conf file to configure the SQL server account’s username
88
CHAPTER 8. MONITORING PERFORMANCE WITH PERFORMANCE CO-PILOT
b. Edit the /etc/pcp/mssql/mssql.conf file to configure the SQL server account’s username
and password for the mssql agent. Ensure that the account you configure has access rights
to performance data.
username: user_name
password: user_password
Replace user_name with the SQL Server account and user_password with the SQL Server
user password for this account.
# cd /var/lib/pcp/pmdas/mssql
# ./Install
Updating the Performance Metrics Name Space (PMNS) ...
Terminate PMDA if already installed ...
Updating the PMCD control file, and notifying PMCD ...
Check mssql metrics have appeared ... 168 metrics and 598 values
[...]
Verification steps
Using the pcp command, verify if the SQL Server PMDA ( mssql) is loaded and running:
$ pcp
Performance Co-Pilot configuration on rhel.local:
platform: Linux rhel.local 4.18.0-167.el8.x86_64 #1 SMP Sun Dec 15 01:24:23 UTC 2019
x86_64
hardware: 2 cpus, 1 disk, 1 node, 2770MB RAM
timezone: PDT+7
services: pmcd pmproxy
pmcd: Version 5.0.2-1, 12 agents, 4 clients
pmda: root pmcd proc pmproxy xfs linux nfsclient mmv kvm mssql
jbd2 dm
pmlogger: primary logger: /var/log/pcp/pmlogger/rhel.local/20200326.16.31
pmie: primary engine: /var/log/pcp/pmie/rhel.local/pmie.log
View the complete list of metrics that PCP can collect from the SQL Server:
# pminfo mssql
After viewing the list of metrics, you can report the rate of transactions. For example, to report
on the overall transaction count per second, over a five second time window:
# pmval -t 1 -T 5 mssql.databases.transactions
View the graphical chart of these metrics on your system by using the pmchart command. For
more information, see Visually tracing PCP log archives with the PCP Charts application .
Additional resources
89
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Performance Co-Pilot for Microsoft SQL Server with RHEL 8.2 Red Hat Developers Blog post
90
CHAPTER 9. PERFORMANCE ANALYSIS OF XFS WITH PCP
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
# cd /var/lib/pcp/pmdas/xfs/
xfs]# ./Install
3. Select the intended PMDA role by entering c for collector, m for monitor, or b for both. The
PMDA installation script prompts you to specify one of the following PMDA roles:
The collector role allows the collection of performance metrics on the current system
The monitor role allows the system to monitor local systems, remote systems, or both
The default option is both collector and monitor, which allows the XFS PMDA to operate
correctly in most scenarios.
Verification steps
Verify that the pmcd process is running on the host and the XFS PMDA is listed as enabled in
91
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Verify that the pmcd process is running on the host and the XFS PMDA is listed as enabled in
the configuration:
# pcp
platform: Linux workstation 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019
x86_64
hardware: 12 cpus, 2 disks, 1 node, 36023MB RAM
timezone: CEST-2
services: pmcd
pmcd: Version 4.3.0-1, 8 agents
pmda: root pmcd proc xfs linux mmv kvm jbd2
Additional resources
The pminfo command provides per-device XFS metrics for each mounted XFS file system.
This procedure displays a list of all available metrics provided by the XFS PMDA.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
Display the list of all available metrics provided by the XFS PMDA:
# pminfo xfs
Display information for the individual metrics. The following examples examine specific XFS
read and write metrics using the pminfo tool:
xfs.read_bytes
92
CHAPTER 9. PERFORMANCE ANALYSIS OF XFS WITH PCP
Help:
This is the number of bytes read via read(2) system calls to files in
XFS file systems. It can be used in conjunction with the read_calls
count to calculate the average size of the read operations to file in
XFS file systems.
xfs.read_bytes
value 4891346238
Additional resources
This procedure describes how to reset XFS metrics using the pmstore tool.
Prerequisites
PCP is installed. For more information, see Installing and enabling PCP .
Procedure
$ pminfo -f xfs.write
xfs.write
value 325262
93
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# pmstore xfs.control.reset 1
Verification steps
xfs.write
value 0
Additional resources
94
CHAPTER 9. PERFORMANCE ANALYSIS OF XFS WITH PCP
95
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
96
CHAPTER 9. PERFORMANCE ANALYSIS OF XFS WITH PCP
97
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Procedure
Verification steps
Ensure that the pmlogger service is active, and starts archiving the metrics:
Additional resources
Prerequisites
PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
Procedure
98
CHAPTER 10. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
3. Open the server’s firewall for network traffic to the Grafana service.
# firewall-cmd --reload
success
Verification steps
performancecopilot-pcp-app @ 3.1.0
Additional resources
add PCP Redis, PCP bpftrace, and PCP Vector data sources
create dashboard
Prerequisites
1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
Procedure
1. On the client system, open a browser and access the grafana-server on port 3000, using
http://192.0.2.0:3000 link.
Replace 192.0.2.0 with your machine IP.
2. For the first login, enter admin in both the Email or username and Password field.
Grafana prompts to set a New password to create a secured account. If you want to set it later,
99
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Grafana prompts to set a New password to create a secured account. If you want to set it later,
click Skip.
3. From the menu, hover over the Configuration icon and then click Plugins.
4. In the Plugins tab, type performance co-pilot in the Search by name or type text box and then
click Performance Co-Pilot (PCP) plugin.
NOTE
The top corner of the screen has a similar icon, but it controls the
general Dashboard settings.
7. In the Grafana Home page, click Add your first data sourceto add PCP Redis, PCP bpftrace,
and PCP Vector data sources. For more information about adding data source, see:
To add pcp redis data source, view default dashboard, create a panel, and an alert rule, see
Creating panels and alert in PCP Redis data source .
To add pcp bpftrace data source and view the default dashboard, see Viewing the PCP
bpftrace System Analysis dashboard.
To add pcp vector data source, view the default dashboard, and to view the vector checklist,
see Viewing the PCP Vector Checklist.
8. Optional: From the menu, hover over the admin profile icon to change the
Preferences including Edit Profile, Change Password, or to Sign out.
Additional resources
100
CHAPTER 10. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
Additional resources
Prerequisites
1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
3. Mail transfer agent, for example, sendmail or postfix is installed and configured.
Procedure
Verification steps
# pmseries disk.dev.read
2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
This command does not return any data if the redis package is not installed.
Additional resources
After adding the PCP Redis data source, you can view the dashboard with an overview of useful metrics,
101
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
After adding the PCP Redis data source, you can view the dashboard with an overview of useful metrics,
add a query to visualize the load graph, and create alerts that help you to view the system issues after
they occur.
Prerequisites
1. The PCP Redis is configured. For more information, see Configuring PCP Redis.
2. The grafana-server is accessible. For more information, see Accessing the Grafana web UI .
Procedure
2. In the Grafana Home page, click Add your first data source.
3. In the Add data source pane, type redis in the Filter by name or type text box and then click
PCP Redis.
a. Add http://localhost:44322 in the URL field and then click Save & Test.
b. Click Dashboards tab → Import → PCP Redis: Host Overview to see a dashboard with an
overview of any useful metrics.
a. From the menu, hover over the Create icon → Dashboard → Add new panel
icon to add a panel.
b. In the Query tab, select the PCP Redis from the query list instead of the selected default
option and in the text field of A, enter metric, for example, kernel.all.load to visualize the
kernel load graph.
c. Optional: Add Panel title and Description, and update other options from the Settings.
102
CHAPTER 10. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
d. Click Save to apply changes and save the dashboard. Add Dashboard name.
a. In the PCP Redis query panel, click Alert and then click Create
Alert.
b. Edit the Name, Evaluate query, and For fields from the Rule, and specify the Conditions
for your alert.
c. Click Save to apply changes and save the dashboard. Click Apply to apply changes and go
back to the dashboard.
d. Optional: In the same panel, scroll down and click Delete icon to delete the created rule.
103
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
e. Optional: From the menu, click Alerting icon to view the created alert rules with
different alert statuses, to edit the alert rule, or to pause the existing rule from the Alert
Rules tab.
To add a notification channel for the created alert rule to receive an alert notification from
Grafana, see Adding notification channels for alerts .
You can receive these alerts after selecting any one type from the supported list of notifiers, which
includes DingDing, Discord, Email, Google Hangouts Chat, HipChat, Kafka REST Proxy, LINE,
Microsoft Teams, OpsGenie, PagerDuty, Prometheus Alertmanager, Pushover, Sensu, Slack,
Telegram, Threema Gateway, VictorOps, and webhook.
Prerequisites
1. The grafana-server is accessible. For more information, see Accessing the Grafana web UI .
2. An alert rule is created. For more information, see Creating panels and alert in PCP Redis data
source.
3. Configure SMTP and add a valid sender’s email address in the grafana/grafana.ini file:
# vi /etc/grafana/grafana.ini
[smtp]
enabled = true
from_address = [email protected]
4. Restart grafana-server
Procedure
1. From the menu, hover over the Alerting icon → click Notification channels → Add
channel.
b. Select the communication Type, for example, Email and enter the email address. You can
add multiple email addresses using the ; separator.
3. Click Save.
104
CHAPTER 10. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
a. From the menu, hover over the Alerting icon and then click Alert rules.
b. From the Alert Rules tab, click the created alert rule.
c. On the Notifications tab, select your notification channel name from the Send to option,
and then add an alert message.
d. Click Apply.
Additional resources
NOTE
From Red Hat Enterprise Linux 8.3, PCP supports the scram-sha-256 authentication
mechanism.
Procedure
2. Specify the supported authentication mechanism and the user database path in the pmcd.conf
file:
# vi /etc/sasl2/pmcd.conf
mech_list: scram-sha-256
sasldb_path: /etc/pcp/passwd.db
# useradd -r metrics
Password:
Again (for verification):
105
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
To add the created user, you are required to enter the metrics account password.
Verification steps
Additional resources
How can I setup authentication between PCP components, like PMDAs and pmcd in RHEL 8.2?
The bpftrace agent uses bpftrace scripts to gather the metrics. The bpftrace scripts use the enhanced
Berkeley Packet Filter (eBPF).
Prerequisites
1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
3. The scram-sha-256 authentication mechanism is configured. For more information, see Setting
up authentication between PCP components.
Procedure
2. Edit the bpftrace.conf file and add the user that you have created in the {setting-up-
authentication-between-pcp-components}:
106
CHAPTER 10. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
# vi /var/lib/pcp/pmdas/bpftrace/bpftrace.conf
[dynamic_scripts]
enabled = true
auth_enabled = true
allowed_users = root,metrics
# cd /var/lib/pcp/pmdas/bpftrace/
# ./Install
Updating the Performance Metrics Name Space (PMNS) ...
Terminate PMDA if already installed ...
Updating the PMCD control file, and notifying PMCD ...
Check bpftrace metrics have appeared ... 7 metrics and 6 values
The pmda-bpftrace is now installed, and can only be used after authenticating your user. For
more information, see Viewing the PCP bpftrace System Analysis dashboard.
Additional resources
In the PCP bpftrace data source, you can view the dashboard with an overview of useful metrics.
Prerequisites
1. The PCP bpftrace is installed. For more information, see Installing PCP bpftrace.
2. The grafana-server is accessible. For more information, see Accessing the Grafana web UI .
Procedure
2. In the Grafana Home page, click Add your first data source.
3. In the Add data source pane, type bpftrace in the Filter by name or type text box and then
click PCP bpftrace.
b. Toggle the Basic Auth option and add the created user credentials in the User and
Password field.
d. Click Dashboards tab → Import → PCP bpftrace: System Analysis to see a dashboard
with an overview of any useful metrics.
Prerequisites
1. PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
Procedure
108
CHAPTER 10. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
# cd /var/lib/pcp/pmdas/bcc
# ./Install
[Wed Apr 1 00:27:48] pmdabcc(22341) Info: Initializing, currently in 'notready' state.
[Wed Apr 1 00:27:48] pmdabcc(22341) Info: Enabled modules:
[Wed Apr 1 00:27:48] pmdabcc(22341) Info: ['biolatency', 'sysfork',
[...]
Updating the Performance Metrics Name Space (PMNS) ...
Terminate PMDA if already installed ...
Updating the PMCD control file, and notifying PMCD ...
Check bcc metrics have appeared ... 1 warnings, 1 metrics and 0 values
Additional resources
After adding the PCP Vector data source, you can view the dashboard with an overview of useful metrics
and view the related troubleshooting or reference links in the checklist.
Prerequisites
1. The PCP Vector is installed. For more information, see Installing PCP Vector.
2. The grafana-server is accessible. For more information, see Accessing the Grafana web UI .
Procedure
2. In the Grafana Home page, click Add your first data source.
3. In the Add data source pane, type vector in the Filter by name or type text box and then click
PCP Vector.
a. Add http://localhost:44322 in the URL field and then click Save & Test.
b. Click Dashboards tab → Import → PCP Vector: Host Overview to see a dashboard with an
overview of any useful metrics.
5. From the menu, hover over the Performance Co-Pilot plugin and then click PCP
Vector Checklist.
In the PCP checklist, click help or warning icon to view the related
troubleshooting or reference links.
Procedure
Verify that the pmlogger service is up and running by executing the following command:
110
CHAPTER 10. SETTING UP GRAPHICAL REPRESENTATION OF PCP METRICS
Verify if files were created or modified to the disk by executing the following command:
$ ls /var/log/pcp/pmlogger/$(hostname)/ -rlt
total 4024
-rw-r--r--. 1 pcp pcp 45996 Oct 13 2019 20191013.20.07.meta.xz
-rw-r--r--. 1 pcp pcp 412 Oct 13 2019 20191013.20.07.index
-rw-r--r--. 1 pcp pcp 32188 Oct 13 2019 20191013.20.07.0.xz
-rw-r--r--. 1 pcp pcp 44756 Oct 13 2019 20191013.20.30-00.meta.xz
[..]
Verify that the pmproxy service is running by executing the following command:
Verify that pmproxy is running, time series support is enabled, and a connection to Redis is
established by viewing the /var/log/pcp/pmproxy/pmproxy.log file and ensure that it contains
the following text:
Here, 1716 is the PID of pmproxy, which will be different for every invocation of pmproxy.
Verify if the Redis database contains any keys by executing the following command:
$ redis-cli dbsize
(integer) 34837
Verify if any PCP metrics are in the Redis database and pmproxy is able to access them by
executing the following commands:
$ pmseries disk.dev.read
2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
$ pmseries "disk.dev.read[count:10]"
2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
[Mon Jul 26 12:21:10.085468000 2021] 117971
70e83e88d4e1857a3a31605c6d1333755f2dd17c
[Mon Jul 26 12:21:00.087401000 2021] 117758
70e83e88d4e1857a3a31605c6d1333755f2dd17c
[Mon Jul 26 12:20:50.085738000 2021] 116688
70e83e88d4e1857a3a31605c6d1333755f2dd17c
[...]
pcp:metric.name:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
pcp:values:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
pcp:desc:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
pcp:labelvalue:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
pcp:instances:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
pcp:labelflags:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df
Verify if there are any errors in the Grafana logs by executing the following command:
111
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
$ journalctl -e -u grafana-server
-- Logs begin at Mon 2021-07-26 11:55:10 IST, end at Mon 2021-07-26 12:30:15 IST. --
Jul 26 11:55:17 localhost.localdomain systemd[1]: Starting Grafana instance...
Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530
lvl=info msg="Starting Grafana" logger=server version=7.3.6 c>
Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530
lvl=info msg="Config loaded from" logger=settings file=/usr/s>
Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530
lvl=info msg="Config loaded from" logger=settings file=/etc/g>
[...]
112
CHAPTER 11. OPTIMIZING THE SYSTEM PERFORMANCE USING THE WEB CONSOLE
Throughput performance
Latency performance
Network performance
Virtual machines
The TuneD service optimizes system options to match the selected profile.
In the web console, you can set which performance profile your system uses.
Additional resources
Prerequisites
Make sure the web console is installed and accessible. For details, see Installing the web
console.
Procedure
1. Log into the 8 web console. For details, see Logging in to the web console .
2. Click Overview.
113
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
4. In the Change Performance Profile dialog box, set the required profile.
Verification steps
The Overview tab now shows the selected performance profile in the Configuration section.
114
CHAPTER 11. OPTIMIZING THE SYSTEM PERFORMANCE USING THE WEB CONSOLE
In the Metrics and history page, you can view events, errors, and graphical representation for resource
utilization and saturation.
Prerequisites
The web console is installed and accessible. For details, see Installing the web console .
The cockpit-pcp package, which enables collecting the performance metrics, is installed:
i. Log in to the web console with administrative privileges. For details, see Logging in to
the web console.
Procedure
1. Log into the 8 web console. For details, see Logging in to the web console .
2. Click Overview.
115
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
The web console must be installed and accessible. For details, see link:Installing the web
console.
a. Log in to the web console with administrative privileges. For details, see Logging in to
the web console.
116
CHAPTER 11. OPTIMIZING THE SYSTEM PERFORMANCE USING THE WEB CONSOLE
Alternatively, you can install the package from the web console interface later in the procedure.
Procedure
1. In the Overview page, click View metrics and history in the Usage table.
If you do not have the redis package installed, the web console prompts you to install it.
4. To open the pmproxy service, select a zone from a drop-down list and click the Add pmproxy
button.
5. Click Save.
Verification
1. Click Networking.
2. In the Firewall table, click the Edit rules and zones button.
117
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
IMPORTANT
Additional resources
118
CHAPTER 12. SETTING THE DISK SCHEDULER
Set the scheduler using TuneD, as described in Setting the disk scheduler using TuneD
Set the scheduler using udev, as described in Setting the disk scheduler using udev rules
NOTE
In Red Hat Enterprise Linux 8, block devices support only multi-queue scheduling. This
enables the block layer performance to scale well with fast solid-state drives (SSDs) and
multi-core systems.
none
Implements a first-in first-out (FIFO) scheduling algorithm. It merges requests at the generic block
layer through a simple last-hit cache.
mq-deadline
Attempts to provide a guaranteed latency for requests from the point at which requests reach the
scheduler.
The mq-deadline scheduler sorts queued I/O requests into a read or write batch and then schedules
them for execution in increasing logical block addressing (LBA) order. By default, read batches take
precedence over write batches, because applications are more likely to block on read I/O operations.
After mq-deadline processes a batch, it checks how long write operations have been starved of
processor time and schedules the next read or write batch as appropriate.
This scheduler is suitable for most use cases, but particularly those in which the write operations are
mostly asynchronous.
bfq
Targets desktop systems and interactive tasks.
The bfq scheduler ensures that a single application is never using all of the bandwidth. In effect, the
storage device is always as responsive as if it was idle. In its default configuration, bfq focuses on
delivering the lowest latency rather than achieving the maximum throughput.
bfq is based on cfq code. It does not grant the disk to each process for a fixed time slice but assigns a
budget measured in number of sectors to the process.
This scheduler is suitable while copying large files and the system does not become unresponsive in
this case.
119
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
kyber
The scheduler tunes itself to achieve a latency goal by calculating the latencies of every I/O request
submitted to the block I/O layer. You can configure the target latencies for read, in the case of
cache-misses, and synchronous write requests.
This scheduler is suitable for fast devices, for example NVMe, SSD, or other low latency devices.
High-performance SSD or a CPU-bound system with Use none, especially when running enterprise
fast storage applications. Alternatively, use kyber.
NOTE
For non-volatile Memory Express (NVMe) block devices specifically, the default
scheduler is none and Red Hat recommends not changing this.
The kernel selects a default disk scheduler based on the type of device. The automatically selected
scheduler is typically the optimal setting. If you require a different scheduler, Red Hat recommends to
use udev rules or the TuneD application to configure it. Match the selected devices and switch the
scheduler only for those devices.
Procedure
# cat /sys/block/device/queue/scheduler
120
CHAPTER 12. SETTING THE DISK SCHEDULER
In the file name, replace device with the block device name, for example sdc.
device with the name of the block device, for example sdf
selected-scheduler with the disk scheduler that you want to set for the device, for example bfq
Prerequisites
The TuneD service is installed and enabled. For details, see Installing and enabling TuneD .
Procedure
1. Optional: Select an existing TuneD profile on which your profile will be based. For a list of
available profiles, see TuneD profiles distributed with RHEL .
To see which profile is currently active, use:
$ tuned-adm active
# mkdir /etc/tuned/my-profile
ID_WWN=0x5002538d00000000_
ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0
ID_SERIAL_SHORT=20120501030900000
NOTE
The command in the this example will return all values identified as a World Wide
Name (WWN) or serial number associated with the specified block device.
Although it is preferred to use a WWN, the WWN is not always available for a
given device and any values returned by the example command are acceptable to
use as the device system unique ID.
4. Create the /etc/tuned/my-profile/tuned.conf configuration file. In the file, set the following
options:
121
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
[main]
include=existing-profile
b. Set the selected disk scheduler for the device that matches the WWN identifier:
[disk]
devices_udev_regex=IDNAME=device system unique id
elevator=selected-scheduler
Here:
Replace IDNAME with the name of the identifier being used (for example, ID_WWN).
Replace device system unique id with the value of the chosen identifier (for example,
0x5002538d00000000).
To match multiple devices in the devices_udev_regex option, enclose the identifiers in
parentheses and separate them with vertical bars:
devices_udev_regex=(ID_WWN=0x5002538d00000000)|
(ID_WWN=0x1234567800000000)
Verification steps
$ tuned-adm active
$ tuned-adm verify
# cat /sys/block/device/queue/scheduler
In the file name, replace device with the block device name, for example sdc.
Additional resources
122
CHAPTER 12. SETTING THE DISK SCHEDULER
device with the name of the block device, for example sdf
selected-scheduler with the disk scheduler that you want to set for the device, for example bfq
Procedure
NOTE
The command in the this example will return all values identified as a World Wide
Name (WWN) or serial number associated with the specified block device.
Although it is preferred to use a WWN, the WWN is not always available for a
given device and any values returned by the example command are acceptable to
use as the device system unique ID.
2. Configure the udev rule. Create the /etc/udev/rules.d/99-scheduler.rules file with the
following content:
Here:
Replace IDNAME with the name of the identifier being used (for example, ID_WWN).
Replace device system unique id with the value of the chosen identifier (for example,
0x5002538d00000000).
Verification steps
123
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# cat /sys/block/device/queue/scheduler
Procedure
In the file name, replace device with the block device name, for example sdc.
Verification steps
# cat /sys/block/device/queue/scheduler
124
CHAPTER 13. TUNING THE PERFORMANCE OF A SAMBA SERVER
Parts of this section were adopted from the Performance Tuning documentation published in the Samba
Wiki. License: CC BY 4.0. Authors and contributors: See the history tab on the Wiki page.
Prerequisites
NOTE
To always have the latest stable SMB protocol version enabled, do not set the server
max protocol parameter. If you set the parameter manually, you will need to modify the
setting with each new version of the SMB protocol, to have the latest protocol version
enabled.
The following procedure explains how to use the default value in the server max protocol parameter.
Procedure
1. Remove the server max protocol parameter from the [global] section in the
/etc/samba/smb.conf file.
Prerequisites
125
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Procedure
NOTE
Using the settings in this procedure, files with names other than in lowercase will
no longer be displayed.
For details about the parameters, see their descriptions in the smb.conf(5) man page.
# testparm
After you applied these settings, the names of all newly created files on this share use lowercase.
Because of these settings, Samba no longer needs to scan the directory for uppercase and lowercase,
which improves the performance.
To use the optimized settings from the Kernel, remove the socket options parameter from the [global]
section in the /etc/samba/smb.conf.
126
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
Virtual CPUs (vCPUs) are implemented as threads on the host, handled by the Linux scheduler.
VMs do not automatically inherit optimization features, such as NUMA or huge pages, from the
host kernel.
Disk and network I/O settings of the host might have a significant performance impact on the
VM.
Depending on the host devices and their models, there might be significant overhead due to
emulation of particular hardware.
The severity of the virtualization impact on the VM performance is influenced by a variety factors, which
include:
The TuneD service can automatically optimize the resource distribution and performance of
your VMs.
Block I/O tuning can improve the performances of the VM’s block devices, such as disks.
IMPORTANT
127
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
IMPORTANT
Tuning VM performance can have adverse effects on other virtualization functions. For
example, it can make migrating the modified VM more difficult.
For RHEL 8 virtual machines, use the virtual-guest profile. It is based on the generally
applicable throughput-performance profile, but also decreases the swappiness of virtual
memory.
For RHEL 8 virtualization hosts, use the virtual-host profile. This enables more aggressive
writeback of dirty memory pages, which benefits the host performance.
Prerequisites
Procedure
To enable a specific TuneD profile:
# tuned-adm list
Available profiles:
- balanced - General non-specialized TuneD profile
- desktop - Optimize for the desktop use-case
[...]
- virtual-guest - Optimize for running inside a virtual guest
- virtual-host - Optimize for running KVM guests
Current active profile: balanced
128
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
Additional resources
To perform these actions, you can use the web console or the command-line interface.
14.3.1. Adding and removing virtual machine memory by using the web console
To improve the performance of a virtual machine (VM) or to free up the host resources it is using, you
can use the web console to adjust amount of memory allocated to the VM.
Prerequisites
The guest OS is running the memory balloon drivers. To verify this is the case:
If this commands displays any output and the model is not set to none, the memballoon
device is present.
In Windows guests, the drivers are installed as a part of the virtio-win driver package.
For instructions, see Installing paravirtualized KVM drivers for Windows virtual
machines.
In Linux guests, the drivers are generally included by default and activate when the
memballoon device is present.
Procedure
1. Optional: Obtain the information about the maximum memory and currently used memory for a
VM. This will serve as a baseline for your changes, and also for verification.
129
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
2. In the Virtual Machines interface, click the VM whose information you want to see.
A new page opens with an Overview section with basic information about the selected VM and a
Console section to access the VM’s graphical interface.
Maximum allocation - Sets the maximum amount of host memory that the VM can use for
its processes. You can specify the maximum memory when creating the VM or increase it
later. You can specify memory as multiples of MiB or GiB.
Adjusting maximum memory allocation is only possible on a shut-off VM.
Current allocation - Sets the actual amount of memory allocated to the VM. This value can
be less than the Maximum allocation but cannot exceed it. You can adjust the value to
regulate the memory available to the VM for its processes. You can specify memory as
multiples of MiB or GiB.
If you do not specify this value, the default allocation is the Maximum allocation value.
5. Click Save.
The memory allocation of the VM is adjusted.
Additional resources
Adding and removing virtual machine memory by using the command-line interface
14.3.2. Adding and removing virtual machine memory by using the command-line
interface
To improve the performance of a virtual machine (VM) or to free up the host resources it is using, you
can use the CLI to adjust amount of memory allocated to the VM.
Prerequisites
The guest OS is running the memory balloon drivers. To verify this is the case:
130
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
If this commands displays any output and the model is not set to none, the memballoon
device is present.
In Windows guests, the drivers are installed as a part of the virtio-win driver package.
For instructions, see Installing paravirtualized KVM drivers for Windows virtual
machines.
In Linux guests, the drivers are generally included by default and activate when the
memballoon device is present.
Procedure
1. Optional: Obtain the information about the maximum memory and currently used memory for a
VM. This will serve as a baseline for your changes, and also for verification.
2. Adjust the maximum memory allocated to a VM. Increasing this value improves the performance
potential of the VM, and reducing the value lowers the performance footprint the VM has on
your host. Note that this change can only be performed on a shut-off VM, so adjusting a running
VM requires a reboot to take effect.
For example, to change the maximum memory that the testguest VM can use to 4096 MiB:
To increase the maximum memory of a running VM, you can attach a memory device to the VM.
This is also referred to as memory hot plug. For details, see Attaching devices to virtual
machines,
WARNING
3. Optional: You can also adjust the memory currently used by the VM, up to the maximum
allocation. This regulates the memory load that the VM has on the host until the next reboot,
without changing the maximum VM allocation.
131
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Verification
2. Optional: If you adjusted the current VM memory, you can obtain the memory balloon statistics
of the VM to evaluate how effectively it regulates its memory use.
Additional resources
Adding and removing virtual machine memory by using the web console
Increasing the I/O weight of a device increases its priority for I/O bandwidth, and therefore provides it
with more host resources. Similarly, reducing a device’s weight makes it consume less host resources.
NOTE
132
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
NOTE
Each device’s weight value must be within the 100 to 1000 range. Alternatively, the value
can be 0, which removes that device from per-device listings.
Procedure
To display and set a VM’s block I/O parameters:
<domain>
[...]
<blkiotune>
<weight>800</weight>
<device>
<path>/dev/sda</path>
<weight>1000</weight>
</device>
<device>
<path>/dev/sdb</path>
<weight>500</weight>
</device>
</blkiotune>
[...]
</domain>
For example, the following changes the weight of the /dev/sda device in the liftrul VM to 500.
To enable disk I/O throttling, set a limit on disk I/O requests sent from each block device attached to
VMs to the host machine.
Procedure
1. Use the virsh domblklist command to list the names of all the disk devices on a specified VM.
133
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
vda /var/lib/libvirt/images/rollin-coal.qcow2
sda -
sdb /home/horridly-demanding-processes.iso
2. Find the host block device where the virtual disk that you want to throttle is mounted.
For example, if you want to throttle the sdb virtual disk from the previous step, the following
output shows that the disk is mounted on the /dev/nvme0n1p3 partition.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
zram0 252:0 0 4G 0 disk [SWAP]
nvme0n1 259:0 0 238.5G 0 disk
├─nvme0n1p1 259:1 0 600M 0 part /boot/efi
├─nvme0n1p2 259:2 0 1G 0 part /boot
└─nvme0n1p3 259:3 0 236.9G 0 part
└─luks-a1123911-6f37-463c-b4eb-fxzy1ac12fea 253:0 0 236.9G 0 crypt /home
3. Set I/O limits for the block device by using the virsh blkiotune command.
The following example throttles the sdb disk on the rollin-coal VM to 1000 read and write I/O
operations per second and to 50 MB per second read and write throughput.
Additional information
Disk I/O throttling can be useful in various situations, for example when VMs belonging to
different customers are running on the same host, or when quality of service guarantees are
given for different VMs. Disk I/O throttling can also be used to simulate slower disks.
I/O throttling can be applied independently to each block device attached to a VM and
supports limits on throughput and I/O operations.
Red Hat does not support using the virsh blkdeviotune command to configure I/O throttling in
VMs. For more information about unsupported features when using RHEL 8 as a VM host, see
Unsupported features in RHEL 8 virtualization .
Procedure
To enable multi-queue virtio-scsi support for a specific VM, add the following to the VM’s XML
configuration, where N is the total number of vCPU queues:
134
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
1. Adjust how many host CPUs are assigned to the VM. You can do this using the CLI or the web
console.
2. Ensure that the vCPU model is aligned with the CPU model of the host. For example, to set the
testguest1 VM to use the CPU model of the host:
4. If your host machine uses Non-Uniform Memory Access (NUMA), you can also configure NUMA
for its VMs. This maps the host’s CPU and memory processes onto the CPU and memory
processes of the VM as closely as possible. In effect, NUMA tuning provides the vCPU with a
more streamlined access to the system memory allocated to the VM, which can improve the
vCPU processing effectiveness.
For details, see Configuring NUMA in a virtual machine and Sample vCPU performance tuning
scenario.
14.5.1. Adding and removing virtual CPUs by using the command-line interface
To increase or optimize the CPU performance of a virtual machine (VM), you can add or remove virtual
CPUs (vCPUs) assigned to the VM.
When performed on a running VM, this is also referred to as vCPU hot plugging and hot unplugging.
However, note that vCPU hot unplug is not supported in RHEL 8, and Red Hat highly discourages its use.
Prerequisites
Optional: View the current state of the vCPUs in the targeted VM. For example, to display the
number of vCPUs on the testguest VM:
This output indicates that testguest is currently using 1 vCPU, and 1 more vCPu can be hot
plugged to it to increase the VM’s performance. However, after reboot, the number of vCPUs
testguest uses will change to 2, and it will be possible to hot plug 2 more vCPUs.
Procedure
1. Adjust the maximum number of vCPUs that can be attached to a VM, which takes effect on the
135
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
1. Adjust the maximum number of vCPUs that can be attached to a VM, which takes effect on the
VM’s next boot.
For example, to increase the maximum vCPU count for the testguest VM to 8:
Note that the maximum may be limited by the CPU topology, host hardware, the hypervisor,
and other factors.
2. Adjust the current number of vCPUs attached to a VM, up to the maximum configured in the
previous step. For example:
This increases the VM’s performance and host load footprint of testguest until the VM’s
next boot.
This decreases the VM’s performance and host load footprint of testguest after the VM’s
next boot. However, if needed, additional vCPUs can be hot plugged to the VM to
temporarily increase its performance.
Verification
Confirm that the current state of vCPU for the VM reflects your changes.
Additional resources
Prerequisites
Procedure
1. In the Virtual Machines interface, click the VM whose information you want to see.
A new page opens with an Overview section with basic information about the selected VM and a
136
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
A new page opens with an Overview section with basic information about the selected VM and a
Console section to access the VM’s graphical interface.
NOTE
vCPU Maximum - The maximum number of virtual CPUs that can be configured for the
VM. If this value is higher than the vCPU Count, additional vCPUs can be attached to the
VM.
Cores per socket - The number of cores for each socket to expose to the VM.
Threads per core - The number of threads for each core to expose to the VM.
Note that the Sockets, Cores per socket, and Threads per core options adjust the CPU
topology of the VM. This may be beneficial for vCPU performance and may impact the
functionality of certain software in the guest OS. If a different setting is not required by your
deployment, keep the default values.
2. Click Apply.
The virtual CPUs for the VM are configured.
NOTE
Changes to virtual CPU settings only take effect after the VM is restarted.
Additional resources
The following methods can be used to configure Non-Uniform Memory Access (NUMA) settings of a
137
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The following methods can be used to configure Non-Uniform Memory Access (NUMA) settings of a
virtual machine (VM) on a RHEL 8 host.
Prerequisites
The host is a NUMA-compatible machine. To detect whether this is the case, use the virsh
nodeinfo command and see the NUMA cell(s) line:
# virsh nodeinfo
CPU model: x86_64
CPU(s): 48
CPU frequency: 1200 MHz
CPU socket(s): 1
Core(s) per socket: 12
Thread(s) per core: 2
NUMA cell(s): 2
Memory size: 67012964 KiB
Procedure
For ease of use, you can set up a VM’s NUMA configuration by using automated utilities and services.
However, manual NUMA setup is more likely to yield a significant performance improvement.
Automatic methods
Set the VM’s NUMA policy to Preferred. For example, to do so for the testguest5 VM:
Start the numad service to automatically align the VM CPU with memory resources.
Manual methods
1. Pin specific vCPU threads to a specific host CPU or range of CPUs. This is also possible on non-
NUMA hosts and VMs, and is recommended as a safe method of vCPU performance
improvement.
For example, the following commands pin vCPU threads 0 to 5 of the testguest6 VM to host
CPUs 1, 3, 5, 7, 9, and 11, respectively:
138
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
2. After pinning vCPU threads, you can also pin QEMU process threads associated with a specified
VM to a specific host CPU or range of CPUs. For example, the following commands pin the
QEMU process thread of testguest6 to CPUs 13 and 15, and verify this was successful:
3. Finally, you can also specify which host NUMA nodes will be assigned specifically to a certain
VM. This can improve the host memory usage by the VM’s vCPU. For example, the following
commands set testguest6 to use host NUMA nodes 3 to 5, and verify this was successful:
NOTE
For best performance results, it is recommended to use all of the manual tuning methods
listed above
Known issues
Additional resources
View the current NUMA configuration of your system using the numastat utility
Starting scenario
2 NUMA nodes
139
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The output of virsh nodeinfo of such a machine would look similar to:
# virsh nodeinfo
CPU model: x86_64
CPU(s): 12
CPU frequency: 3661 MHz
CPU socket(s): 2
Core(s) per socket: 3
Thread(s) per core: 2
NUMA cell(s): 2
Memory size: 31248692 KiB
You intend to modify an existing VM to have 8 vCPUs, which means that it will not fit in a single
NUMA node.
Therefore, you should distribute 4 vCPUs on each NUMA node and make the vCPU topology
resemble the host topology as closely as possible. This means that vCPUs that run as sibling
threads of a given physical CPU should be pinned to host threads on the same core. For details,
see the Solution below:
Solution
# virsh capabilities
The output should include a section that looks similar to the following:
<topology>
<cells num="2">
<cell id="0">
<memory unit="KiB">15624346</memory>
<pages unit="KiB" size="4">3906086</pages>
<pages unit="KiB" size="2048">0</pages>
<pages unit="KiB" size="1048576">0</pages>
<distances>
<sibling id="0" value="10" />
<sibling id="1" value="21" />
</distances>
<cpus num="6">
<cpu id="0" socket_id="0" core_id="0" siblings="0,3" />
<cpu id="1" socket_id="0" core_id="1" siblings="1,4" />
<cpu id="2" socket_id="0" core_id="2" siblings="2,5" />
<cpu id="3" socket_id="0" core_id="0" siblings="0,3" />
<cpu id="4" socket_id="0" core_id="1" siblings="1,4" />
<cpu id="5" socket_id="0" core_id="2" siblings="2,5" />
</cpus>
</cell>
<cell id="1">
<memory unit="KiB">15624346</memory>
<pages unit="KiB" size="4">3906086</pages>
<pages unit="KiB" size="2048">0</pages>
140
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
2. Optional: Test the performance of the VM by using the applicable tools and utilities.
default_hugepagesz=1G hugepagesz=1G
[Unit]
Description=HugeTLB Gigantic Pages Reservation
DefaultDependencies=no
Before=dev-hugepages.mount
ConditionPathExists=/sys/devices/system/node
ConditionKernelCommandLine=hugepagesz=1G
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/etc/systemd/hugetlb-reserve-pages.sh
[Install]
WantedBy=sysinit.target
#!/bin/sh
nodes_path=/sys/devices/system/node/
if [ ! -d $nodes_path ]; then
echo "ERROR: $nodes_path does not exist"
exit 1
fi
reserve_pages()
{
141
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
reserve_pages 4 node1
reserve_pages 4 node2
This reserves four 1GiB huge pages from node1 and four 1GiB huge pages from node2.
# chmod +x /etc/systemd/hugetlb-reserve-pages.sh
4. Use the virsh edit command to edit the XML configuration of the VM you wish to optimize, in
this example super-VM:
a. Set the VM to use 8 static vCPUs. Use the <vcpu/> element to do this.
b. Pin each of the vCPU threads to the corresponding host CPU threads that it mirrors in the
topology. To do so, use the <vcpupin/> elements in the <cputune> section.
Note that, as shown by the virsh capabilities utility above, host CPU threads are not
ordered sequentially in their respective cores. In addition, the vCPU threads should be
pinned to the highest available set of host cores on the same NUMA node. For a table
illustration, see the Sample topology section below.
The XML configuration for steps a. and b. can look similar to:
<cputune>
<vcpupin vcpu='0' cpuset='1'/>
<vcpupin vcpu='1' cpuset='4'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='5'/>
<vcpupin vcpu='4' cpuset='7'/>
<vcpupin vcpu='5' cpuset='10'/>
<vcpupin vcpu='6' cpuset='8'/>
<vcpupin vcpu='7' cpuset='11'/>
<emulatorpin cpuset='6,9'/>
</cputune>
<memoryBacking>
<hugepages>
<page size='1' unit='GiB'/>
</hugepages>
</memoryBacking>
d. Configure the VM’s NUMA nodes to use memory from the corresponding NUMA nodes on
142
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
d. Configure the VM’s NUMA nodes to use memory from the corresponding NUMA nodes on
the host. To do so, use the <memnode/> elements in the <numatune/> section:
<numatune>
<memory mode="preferred" nodeset="1"/>
<memnode cellid="0" mode="strict" nodeset="0"/>
<memnode cellid="1" mode="strict" nodeset="1"/>
</numatune>
e. Ensure the CPU mode is set to host-passthrough, and that the CPU uses cache in
passthrough mode:
<cpu mode="host-passthrough">
<topology sockets="2" cores="2" threads="2"/>
<cache mode="passthrough"/>
Verification
1. Confirm that the resulting XML configuration of the VM includes a section similar to the
following:
[...]
<memoryBacking>
<hugepages>
<page size='1' unit='GiB'/>
</hugepages>
</memoryBacking>
<vcpu placement='static'>8</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='1'/>
<vcpupin vcpu='1' cpuset='4'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='5'/>
<vcpupin vcpu='4' cpuset='7'/>
<vcpupin vcpu='5' cpuset='10'/>
<vcpupin vcpu='6' cpuset='8'/>
<vcpupin vcpu='7' cpuset='11'/>
<emulatorpin cpuset='6,9'/>
</cputune>
<numatune>
<memory mode="preferred" nodeset="1"/>
<memnode cellid="0" mode="strict" nodeset="0"/>
<memnode cellid="1" mode="strict" nodeset="1"/>
</numatune>
<cpu mode="host-passthrough">
<topology sockets="2" cores="2" threads="2"/>
<cache mode="passthrough"/>
<numa>
<cell id="0" cpus="0-3" memory="2" unit="GiB">
<distances>
<sibling id="0" value="10"/>
<sibling id="1" value="21"/>
</distances>
</cell>
<cell id="1" cpus="4-7" memory="2" unit="GiB">
143
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
<distances>
<sibling id="0" value="21"/>
<sibling id="1" value="10"/>
</distances>
</cell>
</numa>
</cpu>
</domain>
2. Optional: Test the performance of the VM by using the applicable tools and utilities to evaluate
the impact of the VM’s optimization.
Sample topology
The following tables illustrate the connections between the vCPUs and the host CPUs they
should be pinned to:
CPU threads 0 3 1 4 2 5 6 9 7 10 8 11
Cores 0 1 2 3 4 5
Sockets 0 1
NUMA nodes 0 1
vCPU threads 0 1 2 3 4 5 6 7
Cores 0 1 2 3
Sockets 0 1
NUMA nodes 0 1
vCPU threads 0 1 2 3 4 5 6 7
Host CPU 0 3 1 4 2 5 6 9 7 10 8 11
threads
Cores 0 1 2 3 4 5
Sockets 0 1
NUMA nodes 0 1
144
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
In this scenario, there are 2 NUMA nodes and 8 vCPUs. Therefore, 4 vCPU threads should be
pinned to each node.
In addition, Red Hat recommends leaving at least a single CPU thread available on each node
for host system operations.
Because in this example, each NUMA node houses 3 cores, each with 2 host CPU threads, the
set for node 0 translates as follows:
Depending on your requirements, you can either deactivate KSM for a single session or persistently.
Procedure
To deactivate KSM for a single session, use the systemctl utility to stop ksm and ksmtuned
services.
To deactivate KSM persistently, use the systemctl utility to disable ksm and ksmtuned
services.
NOTE
Memory pages shared between VMs before deactivating KSM will remain shared. To stop
sharing, delete all the PageKSM pages in the system by using the following command:
After anonymous pages replace the KSM pages, the khugepaged kernel service will
rebuild transparent hugepages on the VM’s physical memory.
Due to the virtual nature of a VM’s network interface card (NIC), the VM loses a portion of its allocated
145
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Due to the virtual nature of a VM’s network interface card (NIC), the VM loses a portion of its allocated
host network bandwidth, which can reduce the overall workload efficiency of the VM. The following tips
can minimize the negative impact of virtualization on the virtual NIC (vNIC) throughput.
Procedure
Use any of the following methods and observe if it has a beneficial effect on your VM network
performance:
If the output of this command is blank, enable the vhost_net kernel module:
# modprobe vhost_net
<interface type='network'>
<source network='default'/>
<model type='virtio'/>
<driver name='vhost' queues='N'/>
</interface>
SR-IOV
If your host NIC supports SR-IOV, use SR-IOV device assignment for your vNICs. For more
information, see Managing SR-IOV devices.
Additional resources
To identify what consumes the most VM resources and which aspect of VM performance needs
146
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
To identify what consumes the most VM resources and which aspect of VM performance needs
optimization, performance diagnostic tools, both general and VM-specific, can be used.
On your RHEL 8 host, as root, use the top utility or the system monitor application, and look for
qemu and virt in the output. This shows how much host system resources your VMs are
consuming.
If the monitoring tool displays that any of the qemu or virt processes consume a large
portion of the host CPU or memory capacity, use the perf utility to investigate. For details,
see below.
On the guest operating system, use performance utilities and applications available on the
system to evaluate which processes consume the most system resources.
perf kvm
You can use the perf utility to collect and analyze virtualization-specific statistics about the
performance of your RHEL 8 host. To do so:
2. Use one of the perf kvm stat commands to display perf statistics for your virtualization host:
For real-time monitoring of your hypervisor, use the perf kvm stat live command.
To log the perf data of your hypervisor over a period of time, activate the logging by using
the perf kvm stat record command. After the command is canceled or interrupted, the
data is saved in the perf.data.guest file, which can be analyzed by using the perf kvm stat
report command.
3. Analyze the perf output for types of VM-EXIT events and their distribution. For example, the
PAUSE_INSTRUCTION events should be infrequent, but in the following output, the high
occurrence of this event suggests that the host CPUs are not handling the running vCPUs well.
In such a scenario, consider shutting down some of your active VMs, removing vCPUs from
these VMs, or tuning the performance of the vCPUs.
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
147
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Other event types that can signal problems in the output of perf kvm stat include:
For more information about using perf to monitor virtualization performance, see the perf-kvm man
page.
numastat
To see the current NUMA configuration of your system, you can use the numastat utility, which is
provided by installing the numactl package.
The following shows a host with 4 running VMs, each obtaining memory from multiple NUMA nodes. This
is not optimal for vCPU performance, and warrants adjusting:
# numastat -c qemu-kvm
In contrast, the following shows memory being provided to each VM by a single node, which is
significantly more efficient.
# numastat -c qemu-kvm
148
CHAPTER 14. OPTIMIZING VIRTUAL MACHINE PERFORMANCE
149
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
reduced secondary costs, including cooling, space, cables, generators, and uninterruptible
power supplies (UPS)
meeting government regulations or legal requirements regarding Green IT, for example, Energy
Star
This section describes the information regarding power management of your Red Hat Enterprise Linux
systems.
150
CHAPTER 15. IMPORTANCE OF POWER MANAGEMENT
SpeedStep
PowerNow!
Cool’n’Quiet
ACPI (C-state)
Smart
If your hardware has support for these features and they are enabled in the BIOS, Red Hat
Enterprise Linux uses them by default.
Sleep (C-states)
However, performing these tasks once for a large number of nearly identical systems where you can
reuse the same settings for all systems can be very useful. For example, consider the deployment of
thousands of desktop systems, or an HPC cluster where the machines are nearly identical. Another
reason to do auditing and analysis is to provide a basis for comparison against which you can identify
regressions or changes in system behavior in the future. The results of this analysis can be very helpful in
cases where hardware, BIOS, or software updates happen regularly and you want to avoid any surprises
with regard to power consumption. Generally, a thorough audit and analysis gives you a much better idea
of what is really happening on a particular system.
Auditing and analyzing a system with regard to power consumption is relatively hard, even with the most
modern systems available. Most systems do not provide the necessary means to measure power use via
software. Exceptions exist though:
iLO management console of Hewlett Packard server systems has a power management module
151
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
iLO management console of Hewlett Packard server systems has a power management module
that you can access through the web.
On some Dell systems, the IT Assistant offers power monitoring capabilities as well.
Other vendors are likely to offer similar capabilities for their server platforms, but as can be seen there is
no single solution available that is supported by all vendors. Direct measurements of power consumption
are often only necessary to maximize savings as far as possible.
Many of these tools are used for performance tuning as well, which include:
PowerTOP
It identifies specific components of kernel and user-space applications that frequently wake up the
CPU. Use the powertop command as root to start the PowerTop tool and powertop --calibrate to
calibrate the power estimation engine. For more information about PowerTop, see Managing power
consumption with PowerTOP.
Diskdevstat and netdevstat
They are SystemTap tools that collect detailed information about the disk activity and network
activity of all applications running on a system. Using the collected statistics by these tools, you can
identify applications that waste power with many small I/O operations rather than fewer, larger
operations. Using the yum install tuned-utils-systemtap kernel-debuginfo command as root,
install the diskdevstat and netdevstat tool.
To view the detailed information about the disk and network activity, use:
# diskdevstat
# netdevstat
With these commands, you can specify three parameters: update_interval, total_duration, and
152
CHAPTER 15. IMPORTANCE OF POWER MANAGEMENT
With these commands, you can specify three parameters: update_interval, total_duration, and
display_histogram.
TuneD
It is a profile-based system tuning tool that uses the udev device manager to monitor connected
devices, and enables both static and dynamic tuning of system settings. You can use the tuned-adm
recommend command to determine which profile Red Hat recommends as the most suitable for a
particular product. For more information about TuneD, see Getting started with TuneD and
Customizing TuneD profiles. Using the powertop2tuned utility, you can create custom TuneD
profiles from PowerTOP suggestions. For information about the powertop2tuned utility, see
Optimizing power consumption.
Virtual memory statistics (vmstat)
It is provided by the procps-ng package. Using this tool, you can view the detailed information about
processes, memory, paging, block I/O, traps, and CPU activity.
To view this information, use:
$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 5805576 380856 4852848 0 0 119 73 814 640 2 2 96 0 0
Using the vmstat -a command, you can display active and inactive memory. For more information
about other vmstat options, see the vmstat man page.
iostat
It is provided by the sysstat package. This tool is similar to vmstat, but only for monitoring I/O on
block devices. It also provides more verbose output and statistics.
To monitor the system I/O, use:
$ iostat
avg-cpu: %user %nice %system %iowait %steal %idle
2.05 0.46 1.55 0.26 0.00 95.67
blktrace
It provides detailed information about how time is spent in the I/O subsystem.
To view this information in human readable format, use:
Here, The first column, 253,0 is the device major and minor tuple. The second column, 1, gives
information about the CPU, followed by columns for timestamps and PID of the process issuing the
IO process.
153
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The sixth column, Q, shows the event type, the 7th column, W for write operation, the 8th column,
76423384, is the block number, and the + 8 is the number of requested blocks.
By default, the blktrace command runs forever until the process is explicitly killed. Use the -w option
to specify the run-time duration.
turbostat
It is provided by the kernel-tools package. It reports on processor topology, frequency, idle power-
state statistics, temperature, and power usage on x86-64 processors.
To view this summary, use:
# turbostat
By default, turbostat prints a summary of counter results for the entire screen, followed by counter
results every 5 seconds. Specify a different period between counter results with the -i option, for
example, execute turbostat -i 10 to print results every 10 seconds instead.
Turbostat is also useful for identifying servers that are inefficient in terms of power usage or idle
time. It also helps to identify the rate of system management interrupts (SMIs) occurring on the
system. It can also be used to verify the effects of power management tuning.
cpupower
IT is a collection of tools to examine and tune power saving related features of processors. Use the
cpupower command with the frequency-info, frequency-set, idle-info, idle-set, set, info, and
monitor options to display and set processor related values.
For example, to view available cpufreq governors, use:
For more information about cpupower, see Viewing CPU related information.
Additional resources
154
CHAPTER 15. IMPORTANCE OF POWER MANAGEMENT
155
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The PowerTOP tool can provide an estimate of the total power usage of the system and also individual
power usage for each process, device, kernel worker, timer, and interrupt handler. The tool can also
identify specific components of kernel and user-space applications that frequently wake up the CPU.
Prerequisites
To be able to use PowerTOP, make sure that the powertop package has been installed on your
system:
Procedure
# powertop
IMPORTANT
Laptops should run on battery power when running the powertop command.
Procedure
1. On a laptop, you can calibrate the power estimation engine by running the following command:
# powertop --calibrate
2. Let the calibration finish without interacting with the machine during the process.
Calibration takes time because the process performs various tests, cycles through brightness
levels and switches devices on and off.
3. When the calibration process is completed, PowerTOP starts as normal. Let it run for
156
CHAPTER 16. MANAGING POWER CONSUMPTION WITH POWERTOP
3. When the calibration process is completed, PowerTOP starts as normal. Let it run for
approximately an hour to collect data.
When enough data is collected, power estimation figures will be displayed in the first column of
the output table.
NOTE
If you want to change this measuring frequency, use the following procedure:
Procedure
Overview
Idle stats
Frequency stats
Device stats
Tunables
WakeUp
You can use the Tab and Shift+Tab keys to cycle through these tabs.
The adjacent columns within the Overview tab provide the following pieces of information:
Usage
157
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
If properly calibrated, a power consumption estimation for every listed item in the first column is shown
as well.
Apart from this, the Overview tab includes the line with summary statistics such as:
Summary of total wakeups per second, GPU operations per second, and virtual file system
operations per second
Use the up and down keys to move through suggestions, and the enter key to toggle the suggestion on
or off.
Use the up and down keys to move through the available settings, and the enter key to enable or
disable a setting.
Additional resources
In total, there are three possible modes of the Intel P-State driver:
Passive mode
Switching to the ACPI CPUfreq driver results in complete information being displayed by PowerTOP.
However, it is recommended to keep your system on the default settings.
# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
intel_pstate is returned if the Intel P-State driver is loaded and in active mode.
intel_cpufreq is returned if the Intel P-State driver is loaded and in passive mode.
While using the Intel P-State driver, add the following argument to the kernel boot command line to
force the driver to run in passive mode:
intel_pstate=passive
To disable the Intel P-State driver and use, instead, the ACPI CPUfreq driver, add the following
159
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
To disable the Intel P-State driver and use, instead, the ACPI CPUfreq driver, add the following
argument to the kernel boot command line:
intel_pstate=disable
Procedure
# powertop --html=htmlfile.html
Replace the htmlfile.html parameter with the required name for the output file.
Procedure
By default, powertop2tuned creates profiles in the /etc/tuned/ directory, and bases the custom profile
on the currently selected TuneD profile. For safety reasons, all PowerTOP tunings are initially disabled
in the new profile.
Use the --enable or -e option to generate a new profile that enables most of the tunings
suggested by PowerTOP.
Certain potentially problematic tunings, such as the USB autosuspend, are disabled by default
and need to be uncommented manually.
160
CHAPTER 16. MANAGING POWER CONSUMPTION WITH POWERTOP
Prerequisites
Procedure
# powertop2tuned new_profile_name
Additional information
$ powertop2tuned --help
The powertop2tuned utility represents integration of PowerTOP into TuneD, which enables to
benefit of advantages of both tools.
161
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
idle-info
Displays the available idle states and other statistics for the CPU idle driver using the cpupower
idle-info command. For more information, see CPU Idle States .
idle-set
Enables or disables specific CPU idle state using the cpupower idle-set command as root. Use -d to
disable and -e to enable a specific CPU idle state.
frequency-info
Displays the current cpufreq driver and available cpufreq governors using the cpupower frequency-
info command. For more information, see CPUfreq drivers, Core CPUfreq Governors, and Intel P-
state CPUfreq governors.
frequency-set
Sets the cpufreq and governors using the cpupower frequency-set command as root. For more
information, see Setting up CPUfreq governor.
set
Sets processor power saving policies using the cpupower set command as root.
Using the --perf-bias option, you can enable software on supported Intel processors to determine
the balance between optimum performance and saving power. Assigned values range from 0 to 15,
where 0 is optimum performance and 15 is optimum power efficiency. By default, the --perf-bias
option applies to all cores. To apply it only to individual cores, add the --cpu cpulist option.
info
Displays processor power related and hardware configurations, which you have enabled using the
cpupower set command. For example, if you assign the --perf-bias value as 5:
monitor
Displays the idle statistics and CPU demands using the cpupower monitor command.
# cpupower monitor
| Nehalem || Mperf ||Idle_Stats
CPU| C3 | C6 | PC3 | PC6 || C0 | Cx | Freq || POLL | C1 | C1E | C3 | C6 | C7s | C8 |
C9 | C10
162
CHAPTER 17. TUNING CPU FREQUENCY TO OPTIMIZE ENERGY CONSUMPTION
0| 1.95| 55.12| 0.00| 0.00|| 4.21| 95.79| 3875|| 0.00| 0.68| 2.07| 3.39| 88.77| 0.00| 0.00|
0.00| 0.00
[...]
Using the -l option, you can list all available monitors on your system and the -m option to display
information related to specific monitors. For example, to monitor information related to the Mperf
monitor, use the cpupower monitor -m Mperf command as root.
Additional resources
With this state, you can save power by partially deactivating CPUs that are not in use. There is no need
to configure the C-state, unlike P-states that require a governor and potentially some set up to avoid
undesirable power or performance issues. C-states are numbered from C0 upwards, with higher
numbers representing decreased CPU functionality and greater power saving. C-states of a given
number are broadly similar across processors, although the exact details of the specific feature sets of
the state may vary between processor families. C-states 0–3 are defined as follows:
C0
In this state, the CPU is working and not idle at all.
C1, Halt
In this state, the processor is not executing any instructions but is typically not in a lower power state.
The CPU can continue processing with practically no delay. All processors offering C-states need to
support this state. Pentium 4 processors support an enhanced C1 state called C1E that actually is a
state for lower power consumption.
C2, Stop-Clock
In this state, the clock is frozen for this processor but it keeps the complete state for its registers and
caches, so after starting the clock again it can immediately start processing again. This is an optional
state.
C3, Sleep
In this state, the processor goes to sleep and does not need to keep its cache up to date. Due to this
reason, waking up from this state needs considerably more time than from the C2 state. This is an
optional state.
You can view the available idle states and other statistics for the CPUidle driver using the following
command:
$ cpupower idle-info
CPUidle governor: menu
analyzing CPU 0:
163
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Intel CPUs with the "Nehalem" microarchitecture features a C6 state, which can reduce the voltage
supply of a CPU to zero, but typically reduces power consumption by between 80% and 90%. The kernel
in Red Hat Enterprise Linux 8 includes optimizations for this new C-state.
Additional resources
CPU scaling can be done automatically depending on the system load, in response to Advanced
Configuration and Power Interface (ACPI) events, or manually by user-space programs, and it allows the
clock speed of the processor to be adjusted on the fly. This enables the system to run at a reduced clock
speed to save power. The rules for shifting frequencies, whether to a faster or slower clock speed and
when to shift frequencies, are defined by the CPUfreq governor.
You can view the cpufreq information using the cpupower frequency-info command as root.
The following are the two available drivers for CPUfreq that can be used:
ACPI CPUfreq
Advanced Configuration and Power Interface (ACPI) CPUfreq driver is a kernel driver that controls
the frequency of a particular CPU through ACPI, which ensures the communication between the
kernel and the hardware.
Intel P-state
In Red Hat Enterprise Linux 8, Intel P-state driver is supported. The driver provides an interface for
controlling the P-state selection on processors based on the Intel Xeon E series architecture or
newer architectures.
Currently, Intel P-state is used by default for supported CPUs. You can switch to using ACPI CPUfreq
by adding the intel_pstate=disable command to the kernel command line.
Intel P-state implements the setpolicy() callback. The driver decides what P-state to use based on
the policy requested from the cpufreq core. If the processor is capable of selecting its next P-state
internally, the driver offloads this responsibility to the processor. If not, the driver implements
algorithms to select the next P-state.
Intel P-state provides its own sysfs files to control the P-state selection. These files are located in
the /sys/devices/system/cpu/intel_pstate/ directory. Any changes made to the files are applicable
to all CPUs.
This directory contains the following files that are used for setting P-state parameters:
164
CHAPTER 17. TUNING CPU FREQUENCY TO OPTIMIZE ENERGY CONSUMPTION
min_perf_pct limits the minimum P-state requested by the driver, expressed in a percentage
of the maximum no-turbo performance level.
no_turbo limits the driver to selecting P-state below the turbo frequency range.
turbo_pct displays the percentage of the total performance supported by hardware that is in
the turbo range. This number is independent of whether turbo has been disabled or not.
num_pstates displays the number of P-states that are supported by hardware. This number
is independent of whether turbo has been disabled or not.
Additional resources
cpufreq_performance
It forces the CPU to use the highest possible clock frequency. This frequency is statically set and
does not change. As such, this particular governor offers no power saving benefit. It is only suitable
for hours of a heavy workload, and only during times wherein the CPU is rarely or never idle.
cpufreq_powersave
It forces the CPU to use the lowest possible clock frequency. This frequency is statically set and
does not change. This governor offers maximum power savings, but at the cost of the lowest CPU
performance. The term "powersave" can sometimes be deceiving though, since in principle a slow
CPU on full load consumes more power than a fast CPU that is not loaded. As such, while it may be
advisable to set the CPU to use the powersave governor during times of expected low activity, any
unexpected high loads during that time can cause the system to actually consume more power. The
Powersave governor is more of a speed limiter for the CPU than a power saver. It is most useful in
systems and environments where overheating can be a problem.
cpufreq_ondemand
It is a dynamic governor, using which you can enable the CPU to achieve maximum clock frequency
when the system load is high, and also minimum clock frequency when the system is idle. While this
allows the system to adjust power consumption accordingly with respect to system load, it does so at
the expense of latency between frequency switching. As such, latency can offset any performance or
power saving benefits offered by the ondemand governor if the system switches between idle and
heavy workloads too often. For most systems, the ondemand governor can provide the best
compromise between heat emission, power consumption, performance, and manageability. When
the system is only busy at specific times of the day, the ondemand governor automatically switches
between maximum and minimum frequency depending on the load without any further intervention.
cpufreq_userspace
It allows user-space programs, or any process running as root, to set the frequency. Of all the
165
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
It allows user-space programs, or any process running as root, to set the frequency. Of all the
governors, userspace is the most customizable and depending on how it is configured, it can offer
the best balance between performance and consumption for your system.
cpufreq_conservative
Similar to the ondemand governor, the conservative governor also adjusts the clock frequency
according to usage. However, the conservative governor switches between frequencies more
gradually. This means that the conservative governor adjusts to a clock frequency that it considers
best for the load, rather than simply choosing between maximum and minimum. While this can
possibly provide significant savings in power consumption, it does so at an ever greater latency than
the ondemand governor.
NOTE
You can enable a governor using cron jobs. This allows you to automatically set specific
governors during specific times of the day. As such, you can specify a low-frequency
governor during idle times, for example, after work hours, and return to a higher-
frequency governor during hours of heavy workload.
For instructions on how to enable a specific governor, see Setting up CPUfreq governor.
Using the cpupower frequency-info --governor command as root, you can view the available CPUfreq
governors.
NOTE
The Intel P-state driver can operate in the following three different modes:
performance: With the performance governor, the driver instructs internal CPU logic to be
performance-oriented. The range of allowed P-states is restricted to the upper boundary of
the range that the driver is allowed to use.
powersave: With the powersave governor, the driver instructs internal CPU logic to be
powersave-oriented.
performance: With the performance governor, the driver chooses the maximum P-state it
166
CHAPTER 17. TUNING CPU FREQUENCY TO OPTIMIZE ENERGY CONSUMPTION
performance: With the performance governor, the driver chooses the maximum P-state it
is allowed to use.
powersave: With the powersave governor, the driver chooses P-states proportional to the
current CPU utilization. The behavior is similar to the ondemand CPUfreq core governor.
Passive mode
When the passive mode is used, the Intel P-state driver functions the same as the traditional
CPUfreq scaling driver. All available generic CPUFreq core governors can be used.
Prerequisites
Procedure
1. View which governors are available for use for a specific CPU:
Replace the performance governor with the cpufreq governor name as per your requirement.
To only enable a governor on specific cores, use -c with a range or comma-separated list of
CPU numbers. For example, to enable the userspace governor for CPUs 1-3 and 5, use:
NOTE
If the kernel-tools package is not installed, the CPUfreq settings can be viewed in the
/sys/devices/system/cpu/cpuid/cpufreq/ directory. Settings and values can be changed
by writing to these tunables. For example, to set the minimum clock speed of cpu0 to
360 MHz, use:
Verification
167
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# cpupower frequency-info
analyzing CPU 0:
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: Cannot determine or is not supported.
hardware limits: 400 MHz - 4.20 GHz
available cpufreq governors: performance powersave
current policy: frequency should be within 400 MHz and 4.20 GHz.
The governor "performance" may decide which speed to use within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 3.88 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: yes
The current policy displays the recently enabled cpufreq governor. In this case, it is
performance.
Additional resources
168
CHAPTER 18. GETTING STARTED WITH PERF
Procedure
Additional resources
169
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
170
CHAPTER 19. PROFILING CPU USAGE IN REAL TIME WITH PERF TOP
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
# perf top
Samples: 8K of event 'cycles', 2000 Hz, Event count (approx.): 4579432780 lost: 0/0 drop:
0/0
Overhead Shared Object Symbol
2.20% [kernel] [k] do_syscall_64
2.17% [kernel] [k] module_get_kallsym
1.49% [kernel] [k] copy_user_enhanced_fast_string
1.37% libpthread-2.29.so [.] pthread_mutex_lock 1.31% [unknown] [.] 0000000000000000
1.07% [kernel] [k] psi_task_change 1.04% [kernel] [k] switch_mm_irqs_off 0.94% [kernel] [k]
fget
0.74% [kernel] [k] entry_SYSCALL_64
0.69% [kernel] [k] syscall_return_via_sysret
0.69% libxul.so [.] 0x000000000113f9b0
0.67% [kernel] [k] kallsyms_expand_symbol.constprop.0
0.65% firefox [.] moz_xmalloc
0.65% libpthread-2.29.so [.] __pthread_mutex_unlock_usercnt
0.60% firefox [.] free
0.60% libxul.so [.] 0x000000000241d1cd
0.60% [kernel] [k] do_sys_poll
171
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
In this example, the kernel function do_syscall_64 is using the most CPU time.
Additional resources
The debuginfo package of the executable must be installed or, if the executable is a locally developed
application, the application must be compiled with debugging information turned on (the -g option in
GCC) to display the function names or symbols in such a situation.
NOTE
It is not necessary to re-run the perf record command after installing the debuginfo
associated with an executable. Simply re-run the perf report command.
Additional Resources
Procedure
172
CHAPTER 19. PROFILING CPU USAGE IN REAL TIME WITH PERF TOP
The $(uname -i) part is automatically replaced with a matching value for architecture of your
system:
Prerequisites
The application or library you want to debug must be installed on the system.
GDB and the debuginfo-install tool must be installed on the system. For details, see Setting up
to debug applications.
Repositories providing debuginfo and debugsource packages must be configured and enabled
on the system. For details, see Enabling debug and source repositories.
Procedure
1. Start GDB attached to the application or library you want to debug. GDB automatically
recognizes missing debugging information and suggests a command to run.
$ gdb -q /bin/ls
Reading symbols from /bin/ls...Reading symbols from .gnu_debugdata for /usr/bin/ls...(no
debugging symbols found)...done.
(no debugging symbols found)...done.
Missing separate debuginfos, use: dnf debuginfo-install coreutils-8.30-6.el8.x86_64
(gdb)
(gdb) q
173
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
3. Run the command suggested by GDB to install the required debuginfo packages:
The dnf package management tool provides a summary of the changes, asks for confirmation
and once you confirm, downloads and installs all the necessary files.
4. In case GDB is not able to suggest the debuginfo package, follow the procedure described in
Getting debuginfo packages for an application or library manually .
Additional resources
How can I download or install debuginfo packages for RHEL systems? — Red Hat
Knowledgebase solution
174
CHAPTER 20. COUNTING EVENTS DURING PROCESS EXECUTION WITH PERF STAT
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
Running the perf stat command without root access will only count events occurring in the
user space:
$ perf stat ls
175
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
As you can see in the previous example, when perf stat runs without root access the event
names are followed by :u, indicating that these events were counted only in the user-space.
To count both user-space and kernel-space events, you must have root access when
running perf stat:
# perf stat ls
# perf stat -a ls
Additional resources
3. When related metrics are available, a ratio or percentage is displayed after the hash sign (#) in
the right-most column.
For example, when running in default mode, perf stat counts both cycles and instructions and,
therefore, calculates and displays instructions per cycle in the right-most column. You can see
similar behavior with regard to branch-misses as a percent of all branches since both events are
176
CHAPTER 20. COUNTING EVENTS DURING PROCESS EXECUTION WITH PERF STAT
counted by default.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
The previous example counts events in the processes with the IDs of ID1 and ID2 for a time
period of seconds seconds as dictated by using the sleep command.
Additional resources
177
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
You have the perf user space tool installed as described in Installing perf.
If you do not specify a command for perf record to record during, it will record until you manually stop
the process by pressing Ctrl+C. You can attach perf record to specific processes by passing the -p
option followed by one or more process IDs. You can run perf record without root access, however,
doing so will only sample performance data in the user space. In the default mode, perf record uses
CPU cycles as the sampling event and operates in per-thread mode with inherit mode enabled.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
Replace command with the command you want to sample data during. If you do not specify a
command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
178
CHAPTER 21. RECORDING AND ANALYZING PERFORMANCE PROFILES WITH PERF
Procedure
Replace command with the command you want to sample data during. If you do not specify a
command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
Replace command with the command you want to sample data during. If you do not specify a
command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
179
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Replace command with the command you want to sample data during. If you do not specify
a command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
fp
Uses the frame pointer method. Depending on compiler optimization, such as with
binaries built with the GCC option --fomit-frame-pointer, this may not be able to unwind
the stack.
dwarf
Uses DWARF Call Frame Information to unwind the stack.
lbr
Uses the last branch record hardware on Intel processors.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
If the perf.data file was created with root access, you need to run perf report with root access
too.
Procedure
# perf report
180
CHAPTER 21. RECORDING AND ANALYZING PERFORMANCE PROFILES WITH PERF
Additional resources
In default mode, the functions are sorted in descending order with those with the highest overhead
displayed first.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
The kernel debuginfo package is installed. For more information, see Getting debuginfo
packages for an application or library using GDB.
Procedure
This example would generate a perf.data over the entire system for a period of seconds
181
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
This example would generate a perf.data over the entire system for a period of seconds
seconds as dictated by the use of the sleep command. It would also capture call graph data
using the frame pointer method.
# perf archive
Verification steps
Verify that the archive file has been generated in your current active directory:
# ls perf.data*
The output will display every file in your current directory that begins with perf.data. The archive
file will be named either:
perf.data.tar.gz
or
perf.data.tar.bz2
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
A perf.data file and associated archive file generated on a different device are present on the
current device being used.
Procedure
1. Copy both the perf.data file and the archive file into your current active directory.
# mkdir -p ~/.debug
# tar xf perf.data.tar.bz2 -C ~/.debug
NOTE
182
CHAPTER 21. RECORDING AND ANALYZING PERFORMANCE PROFILES WITH PERF
NOTE
# perf report
The debuginfo package of the executable must be installed or, if the executable is a locally developed
application, the application must be compiled with debugging information turned on (the -g option in
GCC) to display the function names or symbols in such a situation.
NOTE
It is not necessary to re-run the perf record command after installing the debuginfo
associated with an executable. Simply re-run the perf report command.
Additional Resources
Procedure
The $(uname -i) part is automatically replaced with a matching value for architecture of your
system:
183
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
The application or library you want to debug must be installed on the system.
GDB and the debuginfo-install tool must be installed on the system. For details, see Setting up
to debug applications.
Repositories providing debuginfo and debugsource packages must be configured and enabled
on the system. For details, see Enabling debug and source repositories.
Procedure
1. Start GDB attached to the application or library you want to debug. GDB automatically
recognizes missing debugging information and suggests a command to run.
$ gdb -q /bin/ls
Reading symbols from /bin/ls...Reading symbols from .gnu_debugdata for /usr/bin/ls...(no
debugging symbols found)...done.
(no debugging symbols found)...done.
Missing separate debuginfos, use: dnf debuginfo-install coreutils-8.30-6.el8.x86_64
(gdb)
(gdb) q
3. Run the command suggested by GDB to install the required debuginfo packages:
The dnf package management tool provides a summary of the changes, asks for confirmation
and once you confirm, downloads and installs all the necessary files.
4. In case GDB is not able to suggest the debuginfo package, follow the procedure described in
Getting debuginfo packages for an application or library manually .
Additional resources
How can I download or install debuginfo packages for RHEL systems? — Red Hat
184
CHAPTER 21. RECORDING AND ANALYZING PERFORMANCE PROFILES WITH PERF
How can I download or install debuginfo packages for RHEL systems? — Red Hat
Knowledgebase solution
185
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
The previous example displays counts of a default set of common hardware and software
events recorded over a time period of seconds seconds, as dictated by using the sleep
command, over each individual CPU in ascending order, starting with CPU0. As such, it may be
useful to specify an event such as cycles:
Prerequisites
You have the perf user space tool installed as described in Installing perf.
There is a perf.data file created with perf record in the current directory. If the perf.data file
was created with root access, you need to run perf report with root access too.
Procedure
Display the contents of the perf.data file for further analysis while sorting by CPU:
You can sort by CPU and command to display more detailed information about where CPU
time is being spent:
186
CHAPTER 22. INVESTIGATING BUSY CPUS WITH PERF
This example will list commands from all monitored CPUs by total overhead in descending
order of overhead usage and identify the CPU the command was executed on.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
This example will list CPUs and their respective overhead in descending order of overhead
usage in real time.
You can sort by CPU and command for more detailed information of where CPU time is
being spent:
This example will list commands by total overhead in descending order of overhead usage
and identify the CPU the command was executed on in real time.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
1. Sample and record the performance data in the specific CPU’s, generating a perf.data file:
187
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The previous example samples and records data in CPUs 0 and 1 for a period of seconds
seconds as dictated by the use of the sleep command.
The previous example samples and records data in all CPUs from CPU 0 to 2 for a period of
seconds seconds as dictated by the use of the sleep command.
# perf report
This example will display the contents of perf.data. If you are monitoring several CPUs and want
to know which CPU data was sampled on, see Displaying which CPU samples were taken on with
perf report.
188
CHAPTER 23. MONITORING APPLICATION PERFORMANCE WITH PERF
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
The previous example samples and records performance data of the processes with the process
ID’s ID1 and ID2 for a time period of seconds seconds as dictated by using the sleep
command. You can also configure perf to record events in specific threads:
NOTE
When using the -t flag and stipulating thread ID’s, perf disables inheritance by
default. You can enable inheritance by adding the --inherit option.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
Replace command with the command you want to sample data during. If you do not specify
a command, then perf record will sample data until you manually stop it by pressing Ctrl+C.
fp
189
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
fp
Uses the frame pointer method. Depending on compiler optimization, such as with
binaries built with the GCC option --fomit-frame-pointer, this may not be able to unwind
the stack.
dwarf
Uses DWARF Call Frame Information to unwind the stack.
lbr
Uses the last branch record hardware on Intel processors.
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
If the perf.data file was created with root access, you need to run perf report with root access
too.
Procedure
# perf report
190
CHAPTER 23. MONITORING APPLICATION PERFORMANCE WITH PERF
Additional resources
191
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
1. Create the uprobe in the process or application you are interested in monitoring at a location of
interest within the process or application:
Additional resources
Prerequisites
You have the perf user space tool installed as described in Installing perf.
NOTE
To do this, the debuginfo package of the executable must be installed or, if the
executable is a locally developed application, the application must be compiled
with debugging information, the -g option in GCC.
192
CHAPTER 24. CREATING UPROBES WITH PERF
Procedure
<main@/home/user/my_executable:0>
0 int main(int argc, const char **argv)
1 {
int err;
const char *cmd;
char sbuf[STRERR_BUFSIZE];
/* libsubcmd init */
7 exec_cmd_init("perf", PREFIX, PERF_EXEC_PATH,
EXEC_PATH_ENVIRONMENT);
8 pager_init(PERF_PAGER_ENVIRONMENT);
# perf script
my_prog 1367 [007] 10802159.906593: probe_my_prog:isprime: (400551) a=2
my_prog 1367 [007] 10802159.906623: probe_my_prog:isprime: (400551) a=3
my_prog 1367 [007] 10802159.906625: probe_my_prog:isprime: (400551) a=4
my_prog 1367 [007] 10802159.906627: probe_my_prog:isprime: (400551) a=5
my_prog 1367 [007] 10802159.906629: probe_my_prog:isprime: (400551) a=6
my_prog 1367 [007] 10802159.906631: probe_my_prog:isprime: (400551) a=7
my_prog 1367 [007] 10802159.906633: probe_my_prog:isprime: (400551) a=13
my_prog 1367 [007] 10802159.906635: probe_my_prog:isprime: (400551) a=17
my_prog 1367 [007] 10802159.906637: probe_my_prog:isprime: (400551) a=19
193
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
This example samples memory accesses across all CPUs for a period of seconds seconds as
dictated by the sleep command. You can replace the sleep command for any command during
which you want to sample memory access data. By default, perf mem samples both memory
loads and stores. You can select only one memory operation by using the -t option and
specifying either "load" or "store" between perf mem and record. For loads, information over
the memory hierarchy level, TLB memory accesses, bus snoops, and memory locks is captured.
Available samples
35k cpu/mem-loads,ldlat=30/P
54k cpu/mem-stores/P
The cpu/mem-loads,ldlat=30/P line denotes data collected over memory loads and the
cpu/mem-stores/P line denotes data collected over memory stores. Highlight the category of
interest and press Enter to view the data:
194
CHAPTER 25. PROFILING MEMORY ACCESSES WITH PERF MEM
Alternatively, you can sort your results to investigate different aspects of interest when
195
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Alternatively, you can sort your results to investigate different aspects of interest when
displaying the data. For example, to sort data over memory loads by type of memory accesses
occurring during the sampling period in descending order of overhead they account for:
Additional resources
IMPORTANT
Oftentimes, due to dynamic allocation of memory or stack memory being accessed, the
'Data Symbol' column will display a raw address.
196
CHAPTER 25. PROFILING MEMORY ACCESSES WITH PERF MEM
In default mode, the functions are sorted in descending order with those with the highest overhead
displayed first.
197
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
This initial modification requires that the other processors using the cache line invalidate their copy and
request an updated one despite the processors not needing, or even necessarily having access to, an
updated version of the modified data item.
You can use the perf c2c command to detect false sharing.
Cache-line contention occurs when a processor core on a Symmetric Multi Processing (SMP) system
modifies data items on the same cache line that is in use by other processors. All other processors using
this cache-line must then invalidate their copy and request an updated one. This can lead to degraded
performance.
The perf c2c command supports the same options as perf record as well as some options exclusive to
the c2c subcommand. The recorded data is stored in a perf.data file in the current directory for later
analysis.
Prerequisites
The perf user space tool is installed. For more information, see installing perf.
Procedure
This example samples and records cache-line contention data across all CPU’s for a period of
seconds as dictated by the sleep command. You can replace the sleep command with any
command you want to collect cache-line contention data over.
198
CHAPTER 26. DETECTING FALSE SHARING
Additional resources
Prerequisites
The perf user space tool is installed. For more information, see Installing perf.
A perf.data file recorded using the perf c2c command is available in the current directory. For
more information, see Detecting cache-line contention with perf c2c.
Procedure
This command visualizes the perf.data file into several graphs within the terminal:
=================================================
Trace Event Information
=================================================
Total records : 329219
Locked Load/Store Operations : 14654
Load Operations : 69679
Loads - uncacheable : 0
Loads - IO : 0
Loads - Miss : 3972
Loads - no mapping : 0
Load Fill Buffer Hit : 11958
Load L1D hit : 17235
Load L2D hit : 21
Load LLC hit : 14219
Load Local HITM : 3402
Load Remote HITM : 12757
Load Remote HIT : 5295
Load Local DRAM : 976
Load Remote DRAM : 3246
Load MESI State Exclusive : 4222
Load MESI State Shared : 0
Load LLC Misses : 22274
LLC Misses to Local DRAM : 4.4%
LLC Misses to Remote DRAM : 14.6%
LLC Misses to Remote cache (HIT) : 23.8%
LLC Misses to Remote cache (HITM) : 57.3%
Store Operations : 259539
Store - uncacheable : 0
Store - no mapping : 11
199
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 55
Load HITs on shared lines : 55454
Fill Buffer Hits on shared lines : 10635
L1D hits on shared lines : 16415
L2D hits on shared lines : 0
LLC hits on shared lines : 8501
Locked Access on shared lines : 14351
Store HITs on shared lines : 109953
Store L1D hits on shared lines : 109449
Total Merged records : 126112
=================================================
c2c details
=================================================
Events : cpu/mem-loads,ldlat=30/P
: cpu/mem-stores/P
Cachelines sort on : Remote HITMs
Cacheline data groupping : offset,pid,iaddr
=================================================
Shared Data Cache Line Table
=================================================
#
# Total Rmt ----- LLC Load Hitm ----- ---- Store Reference ---- --- Load
Dram ---- LLC Total ----- Core Load Hit ----- -- LLC Load Hit --
# Index Cacheline records Hitm Total Lcl Rmt Total L1Hit L1Miss
Lcl Rmt Ld Miss Loads FB L1 L2 Llc Rmt
# ..... .................. ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... .......
....... ....... ........ ........
#
0 0x602180 149904 77.09% 12103 2269 9834 109504 109036 468
727 2657 13747 40400 5355 16154 0 2875 529
1 0x602100 12128 22.20% 3951 1119 2832 0 0 0 65
200 3749 12128 5096 108 0 2056 652
2 0xffff883ffb6a7e80 260 0.09% 15 3 12 161 161 0 1
1 15 99 25 50 0 6 1
3 0xffffffff81aec000 157 0.07% 9 0 9 1 0 1 0 7
20 156 50 59 0 27 4
4 0xffffffff81e3f540 179 0.06% 9 1 8 117 97 20 0 10
25 62 11 1 0 24 7
=================================================
Shared Cache Line Distribution Pareto
=================================================
#
# ----- HITM ----- -- Store Refs -- Data address ---------- cycles --
-------- cpu Shared
# Num Rmt Lcl L1 Hit L1 Miss Offset Pid Code address rmt hitm lcl
200
CHAPTER 26. DETECTING FALSE SHARING
-------------------------------------------------------------
1 2832 1119 0 0 0x602100
-------------------------------------------------------------
29.13% 36.19% 0.00% 0.00% 0x20 14604 0x400bb3 1964
1230 1788 2 [.] read_write_func no_false_sharing.exe
false_sharing_example.c:155 1{122} 2{144}
43.68% 34.41% 0.00% 0.00% 0x28 14604 0x400bcd 2274
1566 1793 2 [.] read_write_func no_false_sharing.exe
false_sharing_example.c:159 2{53} 3{170}
27.19% 29.40% 0.00% 0.00% 0x30 14604 0x400be7 2045
1247 2011 2 [.] read_write_func no_false_sharing.exe
false_sharing_example.c:163 0{96} 3{171}
201
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
This table provides a one line summary for the hottest cache lines where false sharing is detected
and is sorted in descending order by the amount of remote Hitm detected per cache line by default.
Shared Cache Line Distribution Pareto
This tables provides a variety of information about each cache line experiencing contention:
The virtual address of each cache line is contained in the Data address Offset column and
followed subsequently by the offset into the cache line where different accesses occurred.
The Code Address column contains the instruction pointer code address.
The columns under the cycles label show average load latencies.
The cpu cnt column displays how many different CPUs samples came from (essentially, how
many different CPUs were waiting for the data indexed at that given location).
The Shared Object column displays the name of the ELF image where the samples come
from (the name [kernel.kallsyms] is used when the samples come from the kernel).
The Source:Line column displays the source file and line number.
The Node{cpu list} column displays which specific CPUs samples came from for each node.
Prerequisites
The perf user space tool is installed. For more information, see installing perf.
A perf.data file recorded using the perf c2c command is available in the current directory. For
more information, see Detecting cache-line contention with perf c2c.
Procedure
2. In the "Trace Event Information" table, locate the row containing the values for LLC Misses to
Remote Cache (HITM):
The percentage in the value column of the LLC Misses to Remote Cache (HITM) row
represents the percentage of LLC misses that were occurring across NUMA nodes in modified
cache-lines and is a key indicator false sharing has occurred.
=================================================
202
CHAPTER 26. DETECTING FALSE SHARING
3. Inspect the Rmt column of the LLC Load Hitm field of the Shared Data Cache Line Table:
=================================================
Shared Data Cache Line Table
=================================================
#
# Total Rmt ----- LLC Load Hitm ----- ---- Store Reference ---- ---
Load Dram ---- LLC Total ----- Core Load Hit ----- -- LLC Load Hit --
# Index Cacheline records Hitm Total Lcl Rmt Total L1Hit L1Miss
Lcl Rmt Ld Miss Loads FB L1 L2 Llc Rmt
# ..... .................. ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... .......
....... ....... ........ ........
#
0 0x602180 149904 77.09% 12103 2269 9834 109504 109036
468 727 2657 13747 40400 5355 16154 0 2875 529
1 0x602100 12128 22.20% 3951 1119 2832 0 0 0 65
200 3749 12128 5096 108 0 2056 652
2 0xffff883ffb6a7e80 260 0.09% 15 3 12 161 161 0 1
1 15 99 25 50 0 6 1
3 0xffffffff81aec000 157 0.07% 9 0 9 1 0 1 0 7
20 156 50 59 0 27 4
4 0xffffffff81e3f540 179 0.06% 9 1 8 117 97 20 0
10 25 62 11 1 0 24 7
203
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
This table is sorted in descending order by the amount of remote Hitm detected per cache line.
A high number in the Rmt column of the LLC Load Hitm section indicates false sharing and
requires further inspection of the cache line on which it occurred to debug the false sharing
activity.
204
CHAPTER 27. GETTING STARTED WITH FLAMEGRAPHS
Sampling stack traces is a common technique for profiling CPU performance with the perf tool.
Unfortunately, the results of profiling stack traces with perf can be extremely verbose and labor-
intensive to analyze. flamegraphs are visualizations created from data recorded with perf to make
identifying hot code-paths faster and easier.
Procedure
Prerequisites
Procedure
This command samples and records performance data over the entire system for 60 seconds,
as stipulated by use of the sleep command, and then constructs the visualization which will be
stored in the current active directory as flamegraph.html. The command samples call-graph
data by default and takes the same arguments as the perf tool, in this particular case:
-a
Stipulates to record data over the entire system.
-F
To set the sampling frequency per second.
Verification steps
205
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# xdg-open flamegraph.html
Prerequisites
Procedure
This command samples and records performance data of the processes with the process ID’s
ID1 and ID2 for 60 seconds, as stipulated by use of the sleep command, and then constructs
the visualization which will be stored in the current active directory as flamegraph.html. The
command samples call-graph data by default and takes the same arguments as the perf tool, in
this particular case:
-a
Stipulates to record data over the entire system.
-F
To set the sampling frequency per second.
-p
To stipulate specific process ID’s to sample and record data over.
Verification steps
# xdg-open flamegraph.html
206
CHAPTER 27. GETTING STARTED WITH FLAMEGRAPHS
The children of a stack in a given row are displayed based on the number of samples taken of each
respective function in descending order along the x-axis; the x-axis does not represent the passing of
time. The wider an individual box is, the more frequent it was on-CPU or part of an on-CPU ancestry at
the time the data was being sampled.
Procedure
To reveal the names of functions which may have not been displayed previously and further
investigate the data click on a box within the flamegraph to zoom into the stack at that given
location:
IMPORTANT
207
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
IMPORTANT
Additional resources
208
CHAPTER 28. MONITORING PROCESSES FOR PERFORMANCE BOTTLENECKS USING PERF CIRCULAR BUFFERS
The --overwrite option makes perf record store all data in an overwritable circular buffer. When the
buffer gets full, perf record automatically overwrites the oldest records which, therefore, never get
written to a perf.data file.
Using the --overwrite and --switch-output-event options together configures a circular buffer that
records and dumps data continuously until it detects the --switch-output-event trigger event. The
trigger event signals to perf record that something of interest to the user has occurred and to write the
data in the circular buffer to a perf.data file. This collects specific data you are interested in while
simultaneously reducing the overhead of the running perf process by not writing data you do not want
to a perf.data file.
Prerequisites
You have the perf user space tool installed as described in Installing perf.
You have placed a uprobe in the process or application you are interested in monitoring at a
location of interest within the process or application:
Procedure
Create the circular buffer with the uprobe as the trigger event:
209
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
This example initiates the executable and collects cpu cycles, specified after the -e option, until
perf detects the uprobe, the trigger event specified after the --switch-output-event option. At
that point, perf takes a snapshot of all the data in the circular buffer and stores it in a unique
perf.data file identified by timestamp. This example produced a total of 2 snapshots, the last
perf.data file was forced by pressing Ctrl+c.
210
DING AND REMOVING TRACEPOINTS FROM A RUNNING PERF COLLECTOR WITHOUT STOPPING OR RESTARTING PERF
Prerequisites
You have the perf user space tool installed as described in Installing perf.
Procedure
2. Run perf record with the control file setup and events you are interested in enabling:
In this example, declaring 'sched:*' after the -e option starts perf record with scheduler events.
Starting the read side of the control pipe triggers the following message in the first terminal:
Events disabled
This command triggers perf to scan the current event list in the control file for the declared
event. If the event is present, the tracepoint is enabled and the following message appears in
the first terminal:
Once the tracepoint is enabled, the second terminal displays the output from perf detecting the
tracepoint:
211
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
You have the perf user space tool installed as described in Installing perf.
You have added tracepoints to a running perf collector via the control pipe interface. For more
information, see Adding tracepoints to a running perf collector without stopping or restarting
perf.
Procedure
NOTE
This example assumes you have previously loaded scheduler events into the
control file and enabled the tracepoint sched:sched_process_fork.
This command triggers perf to scan the current event list in the control file for the declared
event. If the event is present, the tracepoint is disabled and the following message appears in
the terminal used to configure the control pipe:
212
CHAPTER 30. PROFILING MEMORY ALLOCATION WITH NUMASTAT
The numastat tool displays data for each NUMA node separately. You can use this information to
investigate memory performance of your system or the effectiveness of different memory policies on
your system.
numa_hit
The number of pages that were successfully allocated to this node.
numa_miss
The number of pages that were allocated on this node because of low memory on the intended node.
Each numa_miss event has a corresponding numa_foreign event on another node.
numa_foreign
The number of pages initially intended for this node that were allocated to another node instead.
Each numa_foreign event has a corresponding numa_miss event on another node.
interleave_hit
The number of interleave policy pages successfully allocated to this node.
local_node
The number of pages successfully allocated on this node by a process on this node.
other_node
The number of pages allocated on this node by a process on another node.
NOTE
High numa_hit values and low numa_miss values (relative to each other) indicate
optimal performance.
Prerequisites
Procedure
$ numastat
node0 node1
213
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Additional resources
214
CHAPTER 31. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE CPU UTILIZATION
turbostat tool prints counter results at specified intervals to help administrators identify
unexpected behavior in servers, such as excessive power usage, failure to enter deep sleep
states, or system management interrupts (SMIs) being created unnecessarily.
numactl utility provides a number of options to manage processor and memory affinity. The
numactl package includes the libnuma library which offers a simple programming interface to
the NUMA policy supported by the kernel, and can be used for more fine-grained tuning than
the numactl application.
numastat tool displays per-NUMA node memory statistics for the operating system and its
processes, and shows administrators whether the process memory is spread throughout a
system or is centralized on specific nodes. This tool is provided by the numactl package.
numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and
resource usage within a system in order to dynamically improve NUMA resource allocation and
management.
/proc/interrupts file displays the interrupt request (IRQ) number, the number of similar
interrupt requests handled by each processor in the system, the type of interrupt sent, and a
comma-separated list of devices that respond to the listed interrupt request.
pqos utility is available in the intel-cmt-cat package. It monitors CPU cache and memory
bandwidth on recent Intel processors. It monitors:
The size in kilobytes that the program executing in a given CPU occupies in the LLC.
taskset tool is provided by the util-linux package. It allows administrators to retrieve and set
the processor affinity of a running process, or launch a process with a specified processor
affinity.
Additional resources
215
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The following are the two primary types of topology used in modern computing:
Multi-threaded applications that are sensitive to performance may benefit from being configured to
execute on a specific NUMA node rather than a specific processor. Whether this is suitable depends
on your system and the requirements of your application. If multiple application threads access the
same cached data, then configuring those threads to execute on the same processor may be
suitable. However, if multiple threads that access and cache different data execute on the same
processor, each thread may evict cached data accessed by a previous thread. This means that each
thread 'misses' the cache and wastes execution time fetching data from memory and replacing it in
the cache. Use the perf tool to check for an excessive number of cache misses.
Procedure
$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 4 8 12 16 20 24 28 32 36
node 0 size: 65415 MB
node 0 free: 43971 MB
[...]
To gather the information about the CPU architecture, such as the number of CPUs, threads,
216
CHAPTER 31. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE CPU UTILIZATION
To gather the information about the CPU architecture, such as the number of CPUs, threads,
cores, sockets, and NUMA nodes:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 4
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 47
Model name: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
Stepping: 2
CPU MHz: 2394.204
BogoMIPS: 4787.85
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36
NUMA node1 CPU(s): 2,6,10,14,18,22,26,30,34,38
NUMA node2 CPU(s): 1,5,9,13,17,21,25,29,33,37
NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39
Additional resources
By default, Red Hat Enterprise Linux 8 uses a tickless kernel, which does not interrupt idle CPUs in order
218
CHAPTER 31. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE CPU UTILIZATION
By default, Red Hat Enterprise Linux 8 uses a tickless kernel, which does not interrupt idle CPUs in order
to reduce power usage and allow new processors to take advantage of deep sleep states.
Red Hat Enterprise Linux 8 also offers a dynamic tickless option, which is useful for latency-sensitive
workloads, such as high performance computing or realtime computing. By default, the dynamic tickless
option is disabled. Red Hat recommends using the cpu-partitioning TuneD profile to enable the
dynamic tickless option for cores specified as isolated_cores.
This procedure describes how to manually persistently enable dynamic tickless behavior.
Procedure
1. To enable dynamic tickless behavior in certain cores, specify those cores on the kernel
command line with the nohz_full parameter. On a 16 core system, enable the nohz_full=1-15
kernel option:
This enables dynamic tickless behavior on cores 1 through 15, moving all timekeeping to the
only unspecified core (core 0).
2. When the system boots, manually move the rcu threads to the non-latency-sensitive core, in
this case core 0:
3. Optional: Use the isolcpus parameter on the kernel command line to isolate certain cores from
user-space tasks.
4. Optional: Set the CPU affinity for the kernel’s write-back bdi-flush threads to the
housekeeping core:
Verification steps
This command measures ticks on CPU 1 while telling CPU 1 to sleep for 3 seconds.
The default kernel timer configuration shows around 3100 ticks on a regular CPU:
219
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
3,107 irq_vectors:local_timer_entry
With the dynamic tickless kernel configured, you should see around 4 ticks instead:
4 irq_vectors:local_timer_entry
Additional resources
How to verify the list of "isolated" and "nohz_full" CPU information from sysfs? Red Hat
Knowledgebase article
Because interrupt halts normal operation, high interrupt rates can severely degrade system
performance. It is possible to reduce the amount of time taken by interrupts by configuring interrupt
affinity or by sending a number of lower priority interrupts in a batch (coalescing a number of interrupts).
Interrupt requests have an associated affinity property, smp_affinity, which defines the processors that
handle the interrupt request. To improve application performance, assign interrupt affinity and process
affinity to the same processor, or processors on the same core. This allows the specified interrupt and
application threads to share cache lines.
On systems that support interrupt steering, modifying the smp_affinity property of an interrupt request
sets up the hardware so that the decision to service an interrupt with a particular processor is made at
the hardware level with no intervention from the kernel.
Procedure
1. Check which devices correspond to the interrupt requests that you want to configure.
2. Find the hardware specification for your platform. Check if the chipset on your system supports
distributing interrupts.
a. If it does, you can configure interrupt delivery as described in the following steps.
220
CHAPTER 31. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE CPU UTILIZATION
a. If it does, you can configure interrupt delivery as described in the following steps.
Additionally, check which algorithm your chipset uses to balance interrupts. Some BIOSes
have options to configure interrupt delivery.
b. If it does not, your chipset always routes all interrupts to a single, static CPU. You cannot
configure which CPU is used.
3. Check which Advanced Programmable Interrupt Controller (APIC) mode is in use on your
system:
Here,
If your system uses a mode other than flat, you can see a line similar to Setting APIC
routing to physical flat.
If you can see no such message, your system uses flat mode.
If your system uses x2apic mode, you can disable it by adding the nox2apic option to the
kernel command line in the bootloader configuration.
Only non-physical flat mode (flat) supports distributing interrupts to multiple CPUs. This
mode is available only for systems that have up to 8 CPUs.
4. Calculate the smp_affinity mask. For more information about how to calculate the
smp_affinity mask, see Setting the smp_affinity mask.
Additional resources
The default value of the mask is f, which means that an interrupt request can be handled on any
processor in the system. Setting this value to 1 means that only processor 0 can handle the interrupt.
Procedure
1. In binary, use the value 1 for CPUs that handle the interrupts. For example, to set CPU 0 and
CPU 7 to handle interrupts, use 0000000010000001 as the binary code:
CPU 1 1 1 1 11 1 9 8 7 6 5 4 3 2 1 0
5 4 3 2 0
Binary 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
221
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
'0x81'
On systems with more than 32 processors, you must delimit the smp_affinity values for
discrete 32 bit groups. For example, if you want only the first 32 processors of a 64 processor
system to service an interrupt request, use 0xffffffff,00000000.
3. The interrupt affinity value for a particular interrupt request is stored in the associated
/proc/irq/irq_number/smp_affinity file. Set the smp_affinity mask in this file:
Additional resources
222
CHAPTER 32. TUNING SCHEDULING POLICY
For example, say an application on a NUMA system is running on Node A when a processor on Node B
becomes available. To keep the processor on Node B busy, the scheduler moves one of the
application’s threads to Node B. However, the application thread still requires access to memory on
Node A. But, this memory will take longer to access because the thread is now running on Node B and
Node A memory is no longer local to the thread. Thus, it may take longer for the thread to finish running
on Node B than it would have taken to wait for a processor on Node A to become available, and then to
execute the thread on the original node with local memory access.
Normal policies
Normal threads are used for tasks of normal priority.
Realtime policies
Realtime policies are used for time-sensitive tasks that must complete without interruptions.
Realtime threads are not subject to time slicing. This means the thread runs until they block, exit,
voluntarily yield, or are preempted by a higher priority thread.
The lowest priority realtime thread is scheduled before any thread with a normal policy. For more
information, see Static priority scheduling with SCHED_FIFO and Round robin priority scheduling
with SCHED_RR.
Additional resources
When SCHED_FIFO is in use, the scheduler scans the list of all the SCHED_FIFO threads in order of
priority and schedules the highest priority thread that is ready to run. The priority level of a
SCHED_FIFO thread can be any integer from 1 to 99, where 99 is treated as the highest priority.
Red Hat recommends starting with a lower number and increasing priority only when you identify latency
issues.
223
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
WARNING
Because realtime threads are not subject to time slicing, Red Hat does not
recommend setting a priority as 99. This keeps your process at the same priority
level as migration and watchdog threads; if your thread goes into a computational
loop and these threads are blocked, they will not be able to run. Systems with a
single processor will eventually hang in this situation.
Administrators can limit SCHED_FIFO bandwidth to prevent realtime application programmers from
initiating realtime tasks that monopolize the processor.
/proc/sys/kernel/sched_rt_period_us
This parameter defines the time period, in microseconds, that is considered to be one hundred
percent of the processor bandwidth. The default value is 1000000 µs, or 1 second.
/proc/sys/kernel/sched_rt_runtime_us
This parameter defines the time period, in microseconds, that is devoted to running real-time
threads. The default value is 950000 µs, or 0.95 seconds.
Like SCHED_FIFO, SCHED_RR is a realtime policy that defines a fixed priority for each thread. The
scheduler scans the list of all SCHED_RR threads in order of priority and schedules the highest priority
thread that is ready to run. However, unlike SCHED_FIFO, threads that have the same priority are
scheduled in a round-robin style within a certain time slice.
You can set the value of this time slice in milliseconds with the sched_rr_timeslice_ms kernel
parameter in the /proc/sys/kernel/sched_rr_timeslice_ms file. The lowest value is 1 millisecond.
When this policy is in use, the scheduler creates a dynamic priority list based partly on the niceness value
of each process thread. Administrators can change the niceness value of a process, but cannot change
the scheduler’s dynamic priority list directly.
224
CHAPTER 32. TUNING SCHEDULING POLICY
Procedure
# ps
Use the --pid or -p option with the ps command to view the details of the particular PID.
# chrt -p 468
pid 468's current scheduling policy: SCHED_FIFO
pid 468's current scheduling priority: 85
# chrt -p 476
pid 476's current scheduling policy: SCHED_OTHER
pid 476's current scheduling priority: 0
a. For example, to set the process with PID 1000 to SCHED_FIFO, with a priority of 50:
# chrt -f -p 50 1000
b. For example, to set the process with PID 1000 to SCHED_OTHER, with a priority of 0:
# chrt -o -p 0 1000
c. For example, to set the process with PID 1000 to SCHED_RR, with a priority of 10:
# chrt -r -p 10 1000
d. To start a new application with a particular policy and priority, specify the name of the
application:
# chrt -f 36 /bin/my-app
Additional resources
The following table describes the appropriate policy options, which can be used to set the scheduling
policy of a process.
225
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The boot process priority change is done by using the following directives in the service section:
CPUSchedulingPolicy=
Sets the CPU scheduling policy for executed processes. It is used to set other, fifo, and rr policies.
CPUSchedulingPriority=
Sets the CPU scheduling priority for executed processes. The available priority range depends on
the selected CPU scheduling policy. For real-time scheduling policies, an integer between 1 (lowest
priority) and 99 (highest priority) can be used.
The following procedure describes how to change the priority of a service, during the boot process,
using the mcelog service.
Prerequisites
Procedure
# tuna --show_threads
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
1 OTHER 0 0xff 3181 292 systemd
2 OTHER 0 0xff 254 0 kthreadd
3 OTHER 0 0xff 2 0 rcu_gp
4 OTHER 0 0xff 2 0 rcu_par_gp
226
CHAPTER 32. TUNING SCHEDULING POLICY
6 OTHER 0 0 9 0 kworker/0:0H-kblockd
7 OTHER 0 0xff 1301 1 kworker/u16:0-events_unbound
8 OTHER 0 0xff 2 0 mm_percpu_wq
9 OTHER 0 0 266 0 ksoftirqd/0
[...]
2. Create a supplementary mcelog service configuration directory file and insert the policy name
and priority in this file:
[Service]
CPUSchedulingPolicy=fifo
CPUSchedulingPriority=20
EOF
# systemctl daemon-reload
Verification steps
# tuna -t mcelog -P
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
826 FIFO 20 0,1,2,3 13 0 mcelog
Additional resources
The following table describes the priority range, which can be used while setting the scheduling policy of
a process.
227
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prior to Red Hat Enterprise Linux 8, the low-latency Red Hat documentation described the numerous
low-level steps needed to achieve low-latency tuning. In Red Hat Enterprise Linux 8, you can perform
low-latency tuning more efficiently by using the cpu-partitioning TuneD profile. This profile is easily
customizable according to the requirements for individual low-latency applications.
The following figure is an example to demonstrate how to use the cpu-partitioning profile. This
example uses the CPU and node layout.
228
CHAPTER 32. TUNING SCHEDULING POLICY
The list of isolated CPUs is comma-separated or you can specify a range using a dash, such as 3-5.
This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping
CPU.
Specifying the no_balance_cores option is optional, however any CPUs in this list must be a subset
of the CPUs listed in the isolated_cores list.
Application threads using these CPUs need to be pinned individually to each CPU.
Housekeeping CPUs
Any CPU not isolated in the cpu-partitioning-variables.conf file is automatically considered a
housekeeping CPU. On the housekeeping CPUs, all services, daemons, user processes, movable
kernel threads, interrupt handlers, and kernel timers are permitted to execute.
Additional resources
One dedicated reader thread that reads data from the network will be pinned to CPU 2.
A large number of threads that process this network data will be pinned to CPUs 4-23.
A dedicated writer thread that writes the processed data to the network will be pinned to CPU
3.
Prerequisites
You have installed the cpu-partitioning TuneD profile by using the yum install tuned-profiles-
cpu-partitioning command as root.
229
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Procedure
3. Reboot
After rebooting, the system is tuned for low-latency, according to the isolation in the cpu-
partitioning figure. The application can use taskset to pin the reader and writer threads to CPUs
2 and 3, and the remaining application threads on CPUs 4-23.
Additional resources
For example, the cpu-partitioning profile sets the CPUs to use cstate=1. In order to use the cpu-
partitioning profile but to additionally change the CPU cstate from cstate1 to cstate0, the following
procedure describes a new TuneD profile named my_profile, which inherits the cpu-partitioning profile
and then sets C state-0.
Procedure
# mkdir /etc/tuned/my_profile
2. Create a tuned.conf file in this directory, and add the following content:
# vi /etc/tuned/my_profile/tuned.conf
[main]
summary=Customized tuning on top of cpu-partitioning
include=cpu-partitioning
[cpu]
force_latency=cstate.id:0|1
NOTE
230
CHAPTER 32. TUNING SCHEDULING POLICY
NOTE
In the shared example, a reboot is not required. However, if the changes in the my_profile
profile require a reboot to take effect, then reboot your machine.
Additional resources
231
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
I/O and file system performance can be affected by any of the following factors:
Sequential or random
Buffered or Direct IO
Block size
Pre-fetching data
File fragmentation
Resource contention
vmstat tool reports on processes, memory, paging, block I/O, interrupts, and CPU activity
across the entire system. It can help administrators determine whether the I/O subsystem is
responsible for any performance issues. If analysis with vmstat shows that the I/O subsystem is
responsible for reduced performance, administrators can use the iostat tool to determine the
responsible I/O device.
iostat reports on I/O device load in your system. It is provided by the sysstat package.
blktrace provides detailed information about how time is spent in the I/O subsystem. The
companion utility blkparse reads the raw output from blktrace and produces a human readable
summary of input and output operations recorded by blktrace.
btt analyzes blktrace output and displays the amount of time that data spends in each area of
the I/O stack, making it easier to spot bottlenecks in the I/O subsystem. This utility is provided
as part of the blktrace package. Some of the important events tracked by the blktrace
mechanism and analyzed by btt are:
232
CHAPTER 33. FACTORS AFFECTING I/O AND FILE SYSTEM PERFORMANCE
iowatcher can use the blktrace output to graph I/O over time. It focuses on the Logical Block
Address (LBA) of disk I/O, throughput in megabytes per second, the number of seeks per
second, and I/O operations per second. This can help to identify when you are hitting the
operations-per-second limit of a device.
BPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended
Berkeley Packet Filter (eBPF) programs. The eBPF programs are triggered on events, such as
disk I/O, TCP connections, and process creations. The BCC tools are installed in the
/usr/share/bcc/tools/ directory. The following bcc-tools helps to analyze performance:
biolatency summarizes the latency in block device I/O (disk I/O) in histogram. This allows
the distribution to be studied, including two modes for device cache hits and for cache
misses, and latency outliers.
biosnoop is a basic block I/O tracing tool for displaying each I/O event along with the
issuing process ID, and the I/O latency. Using this tool, you can investigate disk I/O
performance issues.
ext4slower, nfsslower, and xfsslower are tools that show file system operations slower
than a certain threshold, which defaults to 10ms.
For more information, see the Analyzing system performance with BPF Compiler Collection .
bpftace is a tracing language for eBPF used for analyzing performance issues. It also provides
trace utilities like BCC for system observation, which is useful for investigating I/O performance
issues.
The following SystemTap scripts may be useful in diagnosing storage or file system
performance problems:
disktop.stp: Checks the status of reading or writing disk every 5 seconds and outputs the
top ten entries during that period.
iotime.stp: Prints the amount of time spent on read and write operations, and the number
of bytes read and written.
traceio.stp: Prints the top ten executable based on cumulative I/O traffic observed, every
second.
traceio2.stp: Prints the executable name and process identifier as reads and writes to the
specified device occur.
Inodewatch.stp: Prints the executable name and process identifier each time a read or
write occurs to the specified inode on the specified major or minor device.
233
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
inodewatch2.stp: Prints the executable name, process identifier, and attributes each time
the attributes are changed on the specified inode on the specified major or minor device.
Additional resources
vmstat(8), iostat(1), blktrace(8), blkparse(1), btt(1), bpftrace, and iowatcher(1) man pages
The following are the options available before formatting a storage device:
Size
Create an appropriately-sized file system for your workload. Smaller file systems require less time
and memory for file system checks. However, if a file system is too small, its performance suffers
from high fragmentation.
Block size
The block is the unit of work for the file system. The block size determines how much data can be
stored in a single block, and therefore the smallest amount of data that is written or read at one time.
The default block size is appropriate for most use cases. However, your file system performs better
and stores data more efficiently if the block size or the size of multiple blocks is the same as or
slightly larger than the amount of data that is typically read or written at one time. A small file still
uses an entire block. Files can be spread across multiple blocks, but this can create additional runtime
overhead.
Additionally, some file systems are limited to a certain number of blocks, which in turn limits the
maximum size of the file system. Block size is specified as part of the file system options when
formatting a device with the mkfs command. The parameter that specifies the block size varies with
the file system.
Geometry
File system geometry is concerned with the distribution of data across a file system. If your system
uses striped storage, like RAID, you can improve performance by aligning data and metadata with the
underlying storage geometry when you format the device.
Many devices export recommended geometry, which is then set automatically when the devices are
formatted with a particular file system. If your device does not export these recommendations, or you
want to change the recommended settings, you must specify geometry manually when you format
the device with the mkfs command.
The parameters that specify file system geometry vary with the file system.
External journals
Journaling file systems document the changes that will be made during a write operation in a journal
file prior to the operation being executed. This reduces the likelihood that a storage device will
become corrupted in the event of a system crash or power failure, and speeds up the recovery
process.
NOTE
234
CHAPTER 33. FACTORS AFFECTING I/O AND FILE SYSTEM PERFORMANCE
NOTE
Red Hat does not recommend using the external journals option.
Metadata-intensive workloads involve very frequent updates to the journal. A larger journal uses more
memory, but reduces the frequency of write operations. Additionally, you can improve the seek time of a
device with a metadata-intensive workload by placing its journal on dedicated storage that is as fast as,
or faster than, the primary storage.
WARNING
Ensure that external journals are reliable. Losing an external journal device causes
file system corruption. External journals must be created at format time, with
journal devices being specified at mount time.
Additional resources
Access Time
Every time a file is read, its metadata is updated with the time at which access occurred (atime). This
involves additional write I/O. The relatime is the default atime setting for most file systems.
However, if updating this metadata is time consuming, and if accurate access time data is not
required, you can mount the file system with the noatime mount option. This disables updates to
metadata when a file is read. It also enables nodiratime behavior, which disables updates to metadata
when a directory is read.
NOTE
Disabling atime updates by using the noatime mount option can break applications that
rely on them, for example, backup programs.
Read-ahead
Read-ahead behavior speeds up file access by pre-fetching data that is likely to be needed soon
and loading it into the page cache, where it can be retrieved more quickly than if it were on disk. The
higher the read-ahead value, the further ahead the system pre-fetches data.
Red Hat Enterprise Linux attempts to set an appropriate read-ahead value based on what it detects
about your file system. However, accurate detection is not always possible. For example, if a storage
array presents itself to the system as a single LUN, the system detects the single LUN, and does not
set the appropriate read-ahead value for an array.
235
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Workloads that involve heavy streaming of sequential I/O often benefit from high read-ahead values.
The storage-related tuned profiles provided with Red Hat Enterprise Linux raise the read-ahead
value, as does using LVM striping, but these adjustments are not always sufficient for all workloads.
Additional resources
Batch discard
This type of discard is part of the fstrim command. It discards all unused blocks in a file system that
match criteria specified by the administrator. Red Hat Enterprise Linux 8 supports batch discard on
XFS and ext4 formatted devices that support physical discard operations.
Online discard
This type of discard operation is configured at mount time with the discard option, and runs in real
time without user intervention. However, it only discards blocks that are transitioning from used to
free. Red Hat Enterprise Linux 8 supports online discard on XFS and ext4 formatted devices.
Red Hat recommends batch discard, except where online discard is required to maintain
performance, or where batch discard is not feasible for the system’s workload.
Pre-allocation marks disk space as being allocated to a file without writing any data into that space. This
can be useful in limiting data fragmentation and poor read performance. Red Hat Enterprise Linux 8
supports pre-allocating space on XFS, ext4, and GFS2 file systems. Applications can also benefit from
pre-allocating space by using the fallocate(2) glibc call.
Additional resources
Performance generally degrades as the used blocks on an SSD approach the capacity of the disk. The
degree of degradation varies by vendor, but all devices experience degradation in this circumstance.
Enabling discard behavior can help to alleviate this degradation. For more information, see Types of
discarding unused blocks.
The default I/O scheduler and virtual memory options are suitable for use with SSDs. Consider the
following factors when configuring settings that can affect SSD performance:
I/O Scheduler
236
CHAPTER 33. FACTORS AFFECTING I/O AND FILE SYSTEM PERFORMANCE
Any I/O scheduler is expected to perform well with most SSDs. However, as with any other storage
type, Red Hat recommends benchmarking to determine the optimal configuration for a given
workload. When using SSDs, Red Hat advises changing the I/O scheduler only for benchmarking
particular workloads. For instructions on how to switch between I/O schedulers, see the
/usr/share/doc/kernel-version/Documentation/block/switching-sched.txt file.
For single queue HBA, the default I/O scheduler is deadline. For multiple queue HBA, the default
I/O scheduler is none. For information about how to set the I/O scheduler, see Setting the disk
scheduler.
Virtual Memory
Like the I/O scheduler, virtual memory (VM) subsystem requires no special tuning. Given the fast
nature of I/O on SSD, try turning down the vm_dirty_background_ratio and vm_dirty_ratio
settings, as increased write-out activity does not usually have a negative impact on the latency of
other operations on the disk. However, this tuning can generate more overall I/O, and is therefore
not generally recommended without workload-specific testing.
Swap
An SSD can also be used as a swap device, and is likely to produce good page-out and page-in
performance.
The following listed tuning parameters are separate from I/O scheduler tuning, and are applicable to all
I/O schedulers:
add_random
Some I/O events contribute to the entropy pool for the /dev/random. This parameter can be set to 0
if the overhead of these contributions become measurable.
iostats
By default, iostats is enabled and the default value is 1. Setting iostats value to 0 disables the
gathering of I/O statistics for the device, which removes a small amount of overhead with the I/O
path. Setting iostats to 0 might slightly improve performance for very high performance devices,
such as certain NVMe solid-state storage devices. It is recommended to leave iostats enabled unless
otherwise specified for the given storage model by the vendor.
If you disable iostats, the I/O statistics for the device are no longer present within the
/proc/diskstats file. The content of /sys/diskstats file is the source of I/O information for
monitoring I/O tools, such as sar or iostats. Therefore, if you disable the iostats parameter for a
device, the device is no longer present in the output of I/O monitoring tools.
max_sectors_kb
Specifies the maximum size of an I/O request in kilobytes. The default value is 512 KB. The minimum
value for this parameter is determined by the logical block size of the storage device. The maximum
value for this parameter is determined by the value of the max_hw_sectors_kb.
Red Hat recommends max_sectors_kb to always be a multiple of the optimal I/O size and the
internal erase block size. Use a value of logical_block_size for either parameter if they are zero or
not specified by the storage device.
nomerges
Most workloads benefit from request merging. However, disabling merges can be useful for
debugging purposes. By default, the nomerges parameter is set to 0, which enables merging. To
disable simple one-hit merging, set nomerges to 1. To disable all types of merging, set nomerges
237
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
to 2.
nr_requests
It is the maximum allowed number of the queued I/O. If the current I/O scheduler is none, this
number can only be reduced; otherwise the number can be increased or reduced.
optimal_io_size
Some storage devices report an optimal I/O size through this parameter. If this value is reported,
Red Hat recommends that applications issue I/O aligned to and in multiples of the optimal I/O size
wherever possible.
read_ahead_kb
Defines the maximum number of kilobytes that the operating system may read ahead during a
sequential read operation. As a result, the necessary information is already present within the kernel
page cache for the next sequential read, which improves read I/O performance.
Device mappers often benefit from a high read_ahead_kb value. 128 KB for each device to be
mapped is a good starting point, but increasing the read_ahead_kb value up to request queue’s
max_sectors_kb of the disk might improve performance in application environments where
sequential reading of large files takes place.
rotational
Some solid-state disks do not correctly advertise their solid-state status, and are mounted as
traditional rotational disks. Manually set the rotational value to 0 to disable unnecessary seek-
reducing logic in the scheduler.
rq_affinity
The default value of the rq_affinity is 1. It completes the I/O operations on one CPU core, which is in
the same CPU group of the issued CPU core. To perform completions only on the processor that
issued the I/O request, set the rq_affinity to 2. To disable the mentioned two abilities, set it to 0.
scheduler
To set the scheduler or scheduler preference order for a particular storage device, edit the
/sys/block/devname/queue/scheduler file, where devname is the name of the device you want to
configure.
238
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
34.1.1. Increasing the ring buffer size to reduce a high packet drop rate by using
nmcli
Increase the size of an Ethernet device’s ring buffers if the packet drop rate causes applications to
report a loss of data, timeouts, or other issues.
Receive ring buffers are shared between the device driver and network interface controller (NIC). The
card assigns a transmit (TX) and receive (RX) ring buffer. As the name implies, the ring buffer is a
circular buffer where an overflow overwrites existing data. There are two ways to move data from the
NIC to the kernel, hardware interrupts and software interrupts, also called SoftIRQs.
The kernel uses the RX ring buffer to store incoming packets until the device driver can process them.
The device driver drains the RX ring, typically by using SoftIRQs, which puts the incoming packets into a
kernel data structure called an sk_buff or skb to begin its journey through the kernel and up to the
application that owns the relevant socket.
The kernel uses the TX ring buffer to hold outgoing packets which should be sent to the network. These
ring buffers reside at the bottom of the stack and are a crucial point at which packet drop can occur,
which in turn will adversely affect network performance.
Procedure
# ethtool -S enp1s0
...
rx_queue_0_drops: 97326
rx_queue_1_drops: 63783
...
Note that the output of the command depends on the network card and the driver.
High values in discard or drop counters indicate that the available buffer fills up faster than the
kernel can process the packets. Increasing the ring buffers can help to avoid such loss.
# ethtool -g enp1s0
Ring parameters for enp1s0:
Pre-set maximums:
239
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
RX: 4096
RX Mini: 0
RX Jumbo: 16320
TX: 4096
Current hardware settings:
RX: 255
RX Mini: 0
RX Jumbo: 0
TX: 255
If the values in the Pre-set maximums section are higher than in the Current hardware
settings section, you can change the settings in the next steps.
IMPORTANT
Depending on the driver your NIC uses, changing in the ring buffer can shortly
interrupt the network connection.
Additional resources
34.1.2. Tuning the network device backlog queue to avoid packet drops
When a network card receives packets and before the kernel protocol stack processes them, the kernel
stores these packets in backlog queues. The kernel maintains a separate queue for each CPU core.
If the backlog queue for a core is full, the kernel drops all further incoming packets that the
netif_receive_skb() kernel function assigns to this queue. If the server contains a 10 Gbps or faster
network adapter or multiple 1 Gbps adapters, tune the backlog queue size to avoid this problem.
240
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
Prerequisites
Procedure
1. Determine whether tuning the backlog queue is needed, display the counters in the
/proc/net/softnet_stat file:
# awk '{for (i=1; i<=NF; i++) printf strtonum("0x" $i) (i==NF?"\n":" ")}'
/proc/net/softnet_stat | column -t
221951548 0 0 0 0 0 0 0 0 0 0 0 0
192058677 18862 0 0 0 0 0 0 0 0 0 0 1
455324886 0 0 0 0 0 0 0 0 0 0 0 2
...
This awk command converts the values in /proc/net/softnet_stat from hexadecimal to decimal
format and displays them in table format. Each line represents a CPU core starting with core 0.
Second column: The number of dropped frames because of a full backlog queue
2. If the values in the second column of the /proc/net/softnet_stat file increment over time,
increase the size of the backlog queue:
# sysctl net.core.netdev_max_backlog
net.core.netdev_max_backlog = 1000
net.core.netdev_max_backlog = 2000
# sysctl -p /etc/sysctl.d/10-netdev_max_backlog.conf
Verification
# awk '{for (i=1; i<=NF; i++) printf strtonum("0x" $i) (i==NF?"\n":" ")}'
/proc/net/softnet_stat | column -t
If the values still increase, double the net.core.netdev_max_backlog value again. Repeat this
process until the packet drop counters no longer increase.
241
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
34.1.3. Increasing the transmit queue length of a NIC to reduce the number of
transmit errors
The kernel stores packets in a transmit queue before transmitting them. The default length (1000
packets) is typically sufficient for 10 Gbps, and often also for 40 Gbps networks. However, in faster
networks, or if you encounter an increasing number of transmit errors on an adapter, increase the queue
length.
Procedure
In this example, the transmit queue length (qlen) of the enp1s0 interface is 1000.
2. Monitor the dropped packets counter of a network interface’s software transmit queue:
3. If you encounter a high or increasing transmit error count, set a higher transmit queue length:
#!/bin/bash
# Set TX queue length on enp1s0 to 2000
# chmod +x /etc/NetworkManager/dispatcher.d/99-set-tx-queue-length-up
242
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
Verification
If the dropped counter still increases, double the transmit queue length again. Repeat this
process until the counter no longer increases.
The hard interrupt handler then leaves the majority of packet reception to a software interrupt request
(SoftIRQ) process. The kernel can schedule these processes more fairly.
The kernel stores the interrupt counters in the /proc/interrupts file. To display the counters for a
specific NIC, such as enp1s0, enter:
Each queue has an interrupt vector in the first column assigned to it. The kernel initializes these
vectors when the system boots or when a user loads the NIC driver module. Each receive (RX) and
transmit (TX) queue is assigned a unique vector that informs the interrupt handler which NIC or
queue the interrupt is coming from. The columns represent the number of incoming interrupts for
every CPU core.
Software interrupt requests (SoftIRQs) clear the receive ring buffers of network adapters. The kernel
243
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Software interrupt requests (SoftIRQs) clear the receive ring buffers of network adapters. The kernel
schedules SoftIRQ routines to run at a time when other tasks will not be interrupted. On Red Hat
Enterprise Linux, processes named ksoftirqd/cpu-number run these routines and call driver-specific
code functions.
The command dynamically updates the output. Press Ctrl+C to interrupt the output.
Under normal operation, the kernel issues an initial hard interrupt, followed by a soft interrupt request
(SoftIRQ) handler that polls the network card using NAPI routines. To prevent SoftIRQs from
monopolizing a CPU core, the polling routine has a budget that determines the CPU time the SoftIRQ
can consume. On completion of the SoftIRQ poll routine, the kernel exits the routine and schedules it to
run again at a later time to repeat the process of receiving packets from the network card.
If irqbalance is not running, usually the CPU core 0 handles most of the interrupts. Even at moderate
load, this CPU core can become busy trying to handle the workload of all the hardware in the system. As
a consequence, interrupts or interrupt-based work can be missed or delayed. This can result in low
network and storage performance, packet loss, and potentially other issues.
IMPORTANT
On systems with only a single CPU core, the irqbalance service provides no benefit and exits on its own.
By default, the irqbalance service is enabled and running on Red Hat Enterprise Linux. To re-enable the
service if you disabled it, enter:
244
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
Additional resources
If softirqd processes could not retrieve all packets from interfaces in one NAPI polling cycle, it is an
indicator that the SoftIRQs do not have enough CPU time. This could be the case on hosts with fast
NICs, such as 10 Gbps and faster. If you increase the values of the net.core.netdev_budget and
net.core.netdev_budget_usecs kernel parameters, you can control the time and number of packets
softirqd can process in a polling cycle.
Procedure
# awk '{for (i=1; i<=NF; i++) printf strtonum("0x" $i) (i==NF?"\n":" ")}'
/proc/net/softnet_stat | column -t
221951548 0 0 0 0 0 0 0 0 0 0 0 0
192058677 0 20380 0 0 0 0 0 0 0 0 0 1
455324886 0 0 0 0 0 0 0 0 0 0 0 2
...
This awk command converts the values in /proc/net/softnet_stat from hexadecimal to decimal
format and displays them in the table format. Each line represents a CPU core starting with core
0.
Third column: The number times softirqd processes that could not retrieve all packets from
interfaces in one NAPI polling cycle.
2. If the counters in the third column of the /proc/net/softnet_stat file increment over time, tune
the system:
245
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
net.core.netdev_budget = 600
net.core.netdev_budget_usecs = 4000
# sysctl -p /etc/sysctl.d/10-netdev_budget.conf
Verification
# awk '{for (i=1; i<=NF; i++) printf strtonum("0x" $i) (i==NF?"\n":" ")}'
/proc/net/softnet_stat | column -t
For example, if the latency is higher when the server is idle than under heavy load, CPU power
management settings could influence the latency.
IMPORTANT
Disabling CPU power management features can cause a higher power consumption and
heat loss.
34.3.1. How the CPU power states influence the network latency
The consumption state (C-states) of CPUs optimize and reduce the power consumption of computers.
The C-states are numbered, starting at C0. In C0, the processor is fully powered and executing. In C1,
the processor is fully powered but not executing. The higher the number of the C-state, the more
components the CPU turns off.
Whenever a CPU core is idle, the built-in power saving logic steps in and attempts to move the core
from the current C-state to a higher one by turning off various processor components. If the CPU core
must process data, Red Hat Enterprise Linux (RHEL) sends an interrupt to the processor to wake up the
core and set its C-state back to C0.
Moving out of deep C-states back to C0 takes time due to turning power back on to various
components of the processor. On multi-core systems, it can also happen that many of the cores are
simultaneously idle and, therefore, in deeper C-states. If RHEL tries to wake them up at the same time,
the kernel can generate a large number of Inter-Processor Interrupts (IPIs) while all cores return from
deep C-states. Due to locking that is required while processing interrupts, the system can then stall for
some time while handling all the interrupts. This can result in large delays in the application response to
events.
246
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
The Idle Stats page in the PowerTOP application displays how much time the CPU cores spend in
each C-state:
Additional resources
intel_idle: This is the default driver on hosts with an Intel CPU and ignores the C-state settings
from the EFI firmware.
acpi_idle: RHEL uses this driver on hosts with CPUs from vendors other than Intel and if
intel_idle is disabled. By default, the acpi_idle driver uses the C-state settings from the EFI
firmware.
Additional resources
/usr/share/doc/kernel-doc-<version>/Documentation/admin-guide/pm/cpuidle.rst provided
by the kernel-doc package
Prerequisites
247
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Procedure
# tuned-adm active
Current active profile: network-latency
# mkdir /etc/tuned/network-latency-custom/
[main]
include=network-latency
[cpu]
force_latency=cstate.id:1|2
This custom profile inherits all settings from the network-latency profile. The force_latency
TuneD parameter specifies the latency in microseconds (µs). If the C-state latency is higher
than the specified value, the idle driver in Red Hat Enterprise Linux prevents the CPU from
moving to a higher C-state. With force_latency=cstate.id:1|2, TuneD first checks if the
/sys/devices/system/cpu/cpu_<number>_/cpuidle/state_<cstate.id>_/ directory exists. In
this case, TuneD reads the latency value from the latency file in this directory. If the directory
does not exist, TuneD uses 2 microseconds as a fallback value.
Additional resources
Use this method to test whether the latency of applications on a host are being affected by C-states. To
not hard code a specific state, consider using a more dynamic solution. See Disabling C-states by using a
custom TuneD profile.
Prerequisites
The tuned service is not running or configured to not update C-state settings.
248
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
Procedure
# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle
2. If the host uses the intel_idle driver, set the intel_idle.max_cstate kernel parameter to define
the highest C-state that CPU cores should be able to use:
Setting intel_idle.max_cstate=0 disables the intel_idle driver. Consequently, the kernel uses
the acpi_idle driver that uses the C-state values set in the EFI firmware. For this reason, also
set processor.max_cstate to override these C-state settings.
3. On every host, independent from the CPU vendor, set the highest C-state that CPU cores
should be able to use:
IMPORTANT
# reboot
Verification
# cat /sys/module/processor/parameters/max_cstate
1
2. If the host uses the intel_idle driver, display the maximum C-state:
# cat /sys/module/intel_idle/parameters/max_cstate
0
Additional resources
/usr/share/doc/kernel-doc-<version>/Documentation/admin-guide/pm/cpuidle.rst provided
by the kernel-doc package
Consider employing jumbo frames to save overhead if hosts on your network often send numerous
contiguous data streams, such as backup servers or file servers hosting numerous huge files. Jumbo
frames are non-standardized frames that have a larger Maximum Transmission Unit (MTU) than the
standard Ethernet payload size of 1500 bytes. For example, if you configure jumbo frames with the
maximum allowed MTU of 9000 bytes payload, the overhead of each frame reduces to 0.2%.
Depending on the network and services, it can be beneficial to enable jumbo frames only in specific
parts of a network, such as the storage backend of a cluster. This avoids packet fragmentation.
Prerequisites
All network devices on the transmission path must support jumbo frames and use the same Maximum
Transmission Unit (MTU) size. In the opposite case, you can face the following problems:
Dropped packets.
Increased risk of packet loss caused by fragmentation. For example, if a router fragments a
single 9000-bytes frame into six 1500-bytes frames, and any of those 1500-byte frames are
lost, the whole frame is lost because it cannot be reassembled.
In the following diagram, all hosts in the three subnets must use the same MTU if a host from network A
sends a packet to a host in network C:
250
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
Higher throughput: Each frame contains more user data while the protocol overhead is fixed.
Lower CPU utilization: Jumbo frames cause fewer interrupts and, therefore, save CPU cycles.
Increased memory buffer usage: Larger frames can fill buffer queue memory more quickly.
Jumbo frames are network packets with a payload of between 1500 and 9000 bytes. All devices in the
same broadcast domain have to support those frames.
Prerequisites
You already configured a connection profile for the network with the divergent MTU.
Procedure
# ip link show
...
3: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP
mode DEFAULT group default qlen 1000
link/ether 52:54:00:74:79:56 brd ff:ff:ff:ff:ff:ff
...
3. Set the MTU in the profile that manages the connection to the network with the divergent MTU:
Verification
251
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# ip link show
...
3: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP
mode DEFAULT group default qlen 1000
link/ether 52:54:00:74:79:56 brd ff:ff:ff:ff:ff:ff
...
Calculate the value for the -s packet size option as follows: MTU size - 8 bytes ICMP header
- 20 bytes IPv4 header = packet size
NOTE
The throughput of applications depends on many factors, such as the buffer sizes that
the application uses. Therefore, the results measured with testing utilities, such as iperf3,
can be significantly different from those of applications on a server under production
workload.
Prerequisites
No other services on either host cause network traffic that substantially affects the test result.
For 40 Gbps and faster connections, the network card supports Accelerated Receive Flow
252
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
For 40 Gbps and faster connections, the network card supports Accelerated Receive Flow
Steering (ARFS) and the feature is enabled on the interface.
Procedure
1. Optional: Display the maximum network speed of the network interface controller (NIC) on both
the server and client:
2. On the server:
a. Temporarily open the default iperf3 TCP port 5201 in the firewalld service:
# firewall-cmd --add-port=5201/tcp
# firewall-cmd --reload
# iperf3 --server
3. On the client:
--time <seconds>: Defines the time in seconds when the client stops the transmission.
Set this parameter to a value that you expect to work and increase it in later
measurements. If the server sends packets at a faster rate than the devices on the
transmit path or the client can process, packets can be dropped.
--zerocopy: Enables a zero copy method instead of using the write() system call. You
require this option only if you want to simulate a zero-copy-capable application or to
reach 40 Gbps and more on a single stream.
--client <server>: Enables the client mode and sets the IP address or name of the
server that runs the iperf3 server.
4. Wait until iperf3 completes the test. Both the server and the client display statistics every
second and a summary at the end. For example, the following is a summary displayed on a client:
5. On the server:
253
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# firewall-cmd --remove-port=5201/tcp
# firewall-cmd --reload
Additional resources
The read socket buffer holds packets that the kernel has received but which the application has
not read yet.
The write socket buffer holds packets that an application has written to the buffer but which the
kernel has not passed to the IP stack and network driver yet.
If a TCP packet is too large and exceeds the buffer size or packets are sent or received at a too fast
rate, the kernel drops any new incoming TCP packet until the data is removed from the buffer. In this
case, increasing the socket buffers can prevent packet loss.
Both the net.ipv4.tcp_rmem (read) and net.ipv4.tcp_wmem (write) socket buffer kernel settings
contain three values:
The displayed values are in bytes and Red Hat Enterprise Linux uses them in the following way:
The first value is the minimum buffer size. New sockets cannot have a smaller size.
The second value is the default buffer size. If an application sets no buffer size, this is the
default value.
The third value is the maximum size of automatically tuned buffers. Using the setsockopt()
function with the SO_SNDBUF socket option in an application disables this maximum buffer
size.
Note that the net.ipv4.tcp_rmem and net.ipv4.tcp_wmem parameters set the socket sizes for both
the IPv4 and IPv6 protocols.
IMPORTANT
254
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
IMPORTANT
Setting too large buffer sizes wastes memory. Each socket can be set to the size that the
application requests, and the kernel doubles this value. For example, if an application
requests a 256 KiB socket buffer size and opens 1 million sockets, the system can use up
to 512 GB RAM (512 KiB x 1 million) only for the potential socket buffer space.
Additionally, a too large value for the maximum buffer size can increase the latency.
Prerequisites
Procedure
1. Determine the latency of the connection. For example, ping from the client to server to measure
the average Round Trip Time (RTT):
# ping -c 10 server.example.com
...
--- server.example.com ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9014ms
rtt min/avg/max/mdev = 117.208/117.056/119.333/0.616 ms
2. Use the following formula to calculate the Bandwidth Delay Product (BDP) for the traffic you
want to tune:
For example, to calculate the BDP for a 10 Gbps connection that has a 117 ms latency:
3. Create the /etc/sysctl.d/10-tcp-socket-buffers.conf file and either set the maximum read or
write buffer size, or both, based on your requirements:
Specify the values in bytes. Use the following rule of thumb when you try to identify optimized
values for your environment:
Default buffer size (second value): Increase this value only slightly or set it to 524288 (512
KiB) at most. A too high default buffer size can cause buffer collapsing and, consequently,
latency spikes.
Maximum buffer size (third value): A value double to triple of the BDP is often sufficient.
# sysctl -p /etc/sysctl.d/10-tcp-socket-buffers.conf
5. Configure your applications to use a larger socket buffer size. The third value in the
255
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
5. Configure your applications to use a larger socket buffer size. The third value in the
net.ipv4.tcp_rmem and net.ipv4.tcp_wmem parameters defines the maximum buffer size that
the setsockopt() function in an application can request.
For further details, see the documentation of the programming language of your application. If
you are not the developer of the application, contact the developer.
Verification
2. Monitor the packet drop statistics using the same method that you used when you encountered
the packet drops.
If packet drops still occur but at a lower rate, increase the buffer sizes further.
Additional resources
For example, on a 1 Gbps connection with 1.5 ms Round Trip Time (RTT):
With TCP Window Scaling enabled, approximately 630 Mbps are realistic.
With TCP Window Scaling disabled, the throughput goes down to 380 Mbps.
One of the features TCP provides is flow control. With flow control, a sender can send as much data as
the receiver can receive, but no more. To achieve this, the receiver advertises a window value, which is
the amount of data a sender can send.
TCP originally supported window sizes up to 64 KiB, but at high Bandwidth Delay Products (BDP), this
value becomes a restriction because the sender cannot send more than 64 KiB at a time. High-speed
connections can transfer much more than 64 KiB of data at a given time. For example, a 10 Gbps link
with 1 ms of latency between systems can have more than 1 MiB of data in transit at a given time. It
would be inefficient if a host sends only 64 KiB, then pauses until the other host receives that 64 KiB.
To remove this bottleneck, the TCP Window Scaling extension allows the TCP window value to be
arithmetically shifted left to increase the window size beyond 64 KiB. For example, the largest window
value of 65535 shifted 7 places to the left, resulting in a window size of almost 8 MiB. This enables
transferring much more data at a given time.
TCP Window Scaling is negotiated during the three-way TCP handshake that opens every TCP
connection. Both sender and receiver must support TCP Window Scaling for the feature to work. If
either or both participants do not advertise window scaling ability in their handshake, the connection
256
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
# sysctl net.ipv4.tcp_window_scaling
net.ipv4.tcp_window_scaling = 1
If TCP Window Scaling is disabled (0) on your server, revert the setting in the same way as you set it.
Additional resources
In TCP transmissions, the receiver sends an ACK packet to the sender for every packet it receives. For
example, a client sends the TCP packets 1-10 to the server but the packets number 5 and 6 get lost.
Without TCP SACK, the server drops packets 7-10, and the client must retransmit all packets from the
point of loss, which is inefficient. With TCP SACK enabled on both hosts, the client must re-transmit
only the lost packets 5 and 6.
IMPORTANT
Disabling TCP SACK decreases the performance and causes a higher packet drop rate on
the receiver side in a TCP connection.
# sysctl net.ipv4.tcp_sack
1
If TCP SACK is disabled (0) on your server, revert the setting in the same way as you set it.
Additional resources
257
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
reach reliable communication over UDP with a throughput rate that is close to the maximum speed of
the network interface controller (NIC).
Note that you can ignore a very small rate of dropped packets. However, if you encounter a significant
rate, consider tuning measures.
NOTE
The kernel drops network packets if the networking stack cannot handle the incoming
traffic.
Procedure
# ethtool -S enp1s0
NIC statistics:
...
rx_queue_0_drops: 17657
...
The naming of the statistics and if they are available depend on the NIC and the driver.
2. Identify UDP protocol-specific packet drops due to too small socket buffers or slow application
processing:
258
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
Additional resources
NOTE
The throughput of applications depend on many factors, such as the buffer sizes that the
application uses. Therefore, the results measured with testing utilities, such as iperf3, can
significantly be different from those of applications on a server under production
workload.
Prerequisites
No other services on both hosts cause network traffic that substantially affects the test result.
Optional: You increased the maximum UDP socket sizes on both the server and the client. For
details, see Increasing the system-wide UDP socket buffers .
Procedure
1. Optional: Display the maximum network speed of the network interface controller (NIC) on both
the server and client:
2. On the server:
a. Display the maximum UDP socket read buffer size, and note the value:
# sysctl net.core.rmem_max
net.core.rmem_max = 16777216
b. Temporarily open the default iperf3 port 5201 in the firewalld service:
Note that, iperf3 opens only a TCP socket on the server. If a clients wants to use UDP, it
first connects to this TCP port, and then the server opens a UDP socket on the same port
number for performing the UDP traffic throughput test. For this reason, you must open port
5201 for both the TCP and UDP protocol in the local firewall.
259
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# iperf3 --server
3. On the client:
a. Display the Maximum Transmission Unit (MTU) of the interface that the client will use for
the connection to the server, and note the value:
b. Display the maximum UDP socket write buffer size, and note the value:
# sysctl net.core.wmem_max
net.core.wmem_max = 16777216
--time <seconds>: Defines the time in seconds when the client stops the transmission.
--window <size>: Sets the UDP socket buffer size. Ideally, the sizes are the same on
both the client and server. In case that they are different, set this parameter to the value
that is smaller: net.core.wmem_max on the client or net.core.rmem_max on the
server.
--length <size>: Sets the length of the buffer to read and write. Set this option to the
largest unfragmented payload. Calculate the ideal value as follows: MTU - IP header
(20 bytes for IPv4 and 40 bytes for IPv6) - 8 bytes UDP header.
--bitrate <rate>: Limits the bit rate to the specified value in bits per second. You can
specify units, such as 2G for 2 Gbps.
Set this parameter to a value that you expect to work and increase it in later
measurements. If the server sends packets at a faster rate than the devices on the
transmit path or the client can process them, packets can be dropped.
--client <server>: Enables the client mode and sets the IP address or name of the
server that runs the iperf3 server.
4. Wait until iperf3 completes the test. Both the server and the client display statistics every
second and a summary at the end. For example, the following is a summary displayed on a client:
260
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
In this example, the average bit rate was 2 Gbps, and no packets were lost.
5. On the server:
Additional resources
Jumbo frames are non-standardized frames that have a larger Maximum Transmission Unit (MTU) than
the standard Ethernet payload size of 1500 bytes. For example, if you configure jumbo Frames with the
maximum allowed MTU of 9000 bytes payload, the overhead of each frame reduces to 0.2%.
IMPORTANT
All network devices on the transmission path and the involved broadcast domains must
support jumbo frames and use the same MTU. Packet fragmentation and reassembly due
to inconsistent MTU settings on the transmission path reduces the network throughput.
IP over InfiniBand (IPoIB) in datagram mode: The MTU is limited to 4 bytes less than the
InfiniBand MTU.
In-memory networking commonly supports larger MTUs. For details, see the respective
documentation.
For example, on a tuned host with a high Maximum Transmission Unit (MTU) and large socket buffers, a
3 GHz CPU can process the traffic of a 10 GBit NIC that sends or receives UDP traffic at full speed.
However, you can expect about 1-2 Gbps speed loss for every 100 MHz CPU speed under 3 GHz when
you transmit UDP traffic. Also, if a CPU speed of 3 GHz can closely achieve 10 Gbps, the same CPU
restricts UDP traffic on a 40 GBit NIC to roughly 20-25 Gbps.
261
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The read socket buffer holds packets that the kernel has received but which the application has
not read yet.
The write socket buffer holds packets that an application has written to the buffer but which the
kernel has not passed to the IP stack and network driver yet.
If a UDP packet is too large and exceeds the buffer size or packets are sent or received at a too fast
rate, the kernel drops any new incoming UDP packet until the data is removed from the buffer. In this
case, increasing the socket buffers can prevent packet loss.
IMPORTANT
Setting too large buffer sizes wastes memory. Each socket can be set to the size that the
application requests, and the kernel doubles this value. For example, if an application
requests a 256 KiB socket buffer size and opens 1 million sockets, the system requires 512
GB RAM (512 KiB x 1 million) only for the potential socket buffer space.
Prerequisites
Procedure
1. Create the /etc/sysctl.d/10-udp-socket-buffers.conf file and either set the maximum read or
write buffer size, or both, based on your requirements:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
Specify the values in bytes. The values in this example set the maximum size of buffers to 16
MiB. The default values of both parameters are 212992 bytes (208 KiB).
# sysctl -p /etc/sysctl.d/10-udp-socket-buffers.conf
For further details, see the documentation of the programming language of your application. If
you are not the developer of the application, contact the developer.
Verification
Monitor the packet drop statistics using the same method as you used when you encountered
the packet drops.
262
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
If packet drops still occur but at a lower rate, increase the buffer sizes further.
Additional resources
If collapsing fails to free sufficient space for additional traffic, the kernel prunes new data that arrives.
This means that the kernel removes the data from the memory and the packet is lost.
To avoid collapsing and pruning operations, monitor whether TCP buffer collapsing and pruning
happens on your server and, in this case, tune the TCP buffers.
Procedure
1. Use the nstat utility to query the TcpExtTCPRcvCollapsed and TcpExtRcvPruned counters:
3. If the values of the counters have increased compared to the first run, tuning is required:
If the application uses the setsockopt(SO_RCVBUF) call, consider removing it. With this
call, the application only uses the receive buffer size specified in the call and turns off the
socket’s ability to auto-tune its size.
If the application does not use the setsockopt(SO_RCVBUF) call, tune the default and
maximum values of the TCP read socket buffer.
263
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# ss -nti
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 0 192.0.2.1:443 192.0.2.125:41574
:7,7 ... lastrcv:543 ...
ESTAB 78 0 192.0.2.1:443 192.0.2.56:42612
:7,7 ... lastrcv:658 ...
ESTAB 88 0 192.0.2.1:443 192.0.2.97:40313
:7,7 ... lastrcv:5764 ...
...
5. Run the ss -nt command multiple times with a few seconds waiting time between each run.
If the output lists only one case of a high value in the Recv-Q column, the application was
between two receive operations. However, if the values in Recv-Q stays constant while lastrcv
continually grows, or Recv-Q continually increases over time, one of the following problems can
be the cause:
The application does not check its socket buffers often enough. Contact the application
vendor for details about how you can solve this problem.
The application does not get enough CPU time. To further debug this problem:
# ps -eo pid,tid,psr,pcpu,stat,wchan:20,comm
PID TID PSR %CPU STAT WCHAN COMMAND
...
44594 44594 5 0.0 Ss do_select httpd
44595 44595 3 0.0 S skb_wait_for_more_pa httpd
44596 44596 5 0.0 Sl pipe_read httpd
44597 44597 5 0.0 Sl pipe_read httpd
44602 44602 5 0.0 Sl pipe_read httpd
...
The PSR column displays the CPU cores the process is currently assigned to.
ii. Identify other processes running on the same cores and consider assigning them to
other cores.
Additional resources
34.8.1. Tuning the TCP listen backlog to process a high number of TCP connection
attempts
When an application opens a TCP socket in LISTEN state, the kernel limits the number of accepted
client connections this socket can handle. If clients try to establish more connections than the
application can process, the new connections get lost or the kernel sends SYN cookies to the client.
264
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
If the system is under normal workload and too many connections from legitimate clients cause the
kernel to send SYN cookies, tune Red Hat Enterprise Linux (RHEL) to avoid them.
Prerequisites
The high number of connection attempts are from valid sources and not caused by an attack.
Procedure
1. To verify whether tuning is required, display the statistics for the affected port:
If the current number of connections in the backlog (Recv-Q) is larger than the socket backlog
(Send-Q), the listen backlog is still not large enough and tuning is required.
# sysctl net.core.somaxconn
net.core.somaxconn = 4096
3. Create the /etc/sysctl.d/10-socket-backlog-limit.conf file, and set a larger listen backlog limit:
net.core.somaxconn = 8192
Note that applications can request a larger listen backlog than specified in the
net.core.somaxconn kernel parameter but the kernel limits the application to the number you
set in this parameter.
# sysctl -p /etc/sysctl.d/10-socket-backlog-limit.conf
If the application provides a config option for the limit, update it. For example, the Apache
HTTP Server provides the ListenBacklog configuration option to set the listen backlog
limit for this service.
Verification
1. Monitor the Systemd journal for further occurrences of possible SYN flooding on port
<port_number> error messages.
2. Monitor the current number of connections in the backlog and compare it with the socket
backlog:
265
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
If the current number of connections in the backlog (Recv-Q) is larger than the socket backlog
(Send-Q), the listen backlog is not large enough and further tuning is required.
Additional resources
Listening TCP server ignores SYN or ACK for new connection handshake solution
Significant contention on the receive buffer, which can cause packet drops and higher CPU
usage.
With the SO_REUSEPORT or SO_REUSEPORT_BPF socket option, multiple sockets on one host can
bind to the same port:
266
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
Red Hat Enterprise Linux provides a code example of how to use the SO_REUSEPORT socket options
in the kernel sources. To access the code example:
Additional resources
/usr/src/debug/kernel-<version>/linux-
<version>/tools/testing/selftests/net/reuseport_bpf_cpu.c
Certain drivers, such as ixgbe, i40e, and mlx5 automatically configure XPS. To identify if the driver
supports this capability, consult the documentation of your NIC driver. Consult your NIC driver’s
documentation to identify if the driver supports this capability. If the driver does not support XPS auto-
tuning, you can manually assign CPU cores to the transmit queues.
NOTE
Red Hat Enterprise Linux does not provide an option to permanently assign transmit
queues to CPU cores. Use the commands in a script and run it when the system boots.
Prerequisites
267
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Procedure
# ethtool -l enp1s0
Channel parameters for enp1s0:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 4
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 1
The Pre-set maximums section shows the total number of queues and Current hardware
settings the number of queues that are currently assigned to the receive, transmit, other, or
combined queues.
2. Optional: If you require queues on specific channels, assign them accordingly. For example, to
assign the 4 queues to the Combined channel, enter:
3. Display to which Non-Uniform Memory Access (NUMA) node the NIC is assigned:
# cat /sys/class/net/enp1s0/device/numa_node
0
If the file is not found or the command returns -1, the host is not a NUMA system.
4. If the host is a NUMA system, display which CPUs are assigned to which NUMA node:
5. In the example above, the NIC has 4 queues and the NIC is assigned to NUMA node 0. This node
uses the CPU cores 0-3. Consequently, map each transmit queue to one of the CPU cores from
0-3:
If the number of CPU cores and transmit (TX) queues is the same, use a 1 to 1 mapping to avoid
268
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
any kind of contention on the TX queue. Otherwise, if you map multiple CPUs on the same TX
queue, transmit operations on different CPUs will cause TX queue lock contention and
negatively impacts the transmit throughput.
Note that you must pass the bitmap, containing the CPU’s core numbers, to the queues. Use
the following command to calculate the bitmap:
Verification
# pidof <process_name>
12345 98765
# tc -s qdisc
qdisc fq_codel 0: dev enp10s0u1 root refcnt 2 limit 10240p flows 1024 quantum 1514 target
5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
Sent 125728849 bytes 1067587 pkt (dropped 0, overlimits 0 requeues 30)
backlog 0b 0p requeues 30
...
If the requeues counter no longer increases at a significant rate, TX queue lock contention no
longer happens.
Additional resources
/usr/share/doc/kernel-doc-_<version>/Documentation/networking/scaling.rst
34.9.3. Disabling the Generic Receive Offload feature on servers with high UDP
traffic
Applications that use high-speed UDP bulk transfer should enable and use UDP Generic Receive
Offload (GRO) on the UDP socket. However, you can disable GRO to increase the throughput if the
following conditions apply:
The application does not support GRO and the feature cannot be added.
269
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
WARNING
Prerequisites
The host does not use UDP tunnel protocols, such as VXLAN.
Procedure
Verification
2. Monitor the throughput on the server. Re-enable GRO in the NetworkManager profile if the
setting has negative side effects to other applications on the host.
Additional resources
270
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
support parameters to tune and optimize the device driver and the NIC. For example, if the driver
supports delaying the generation of receive interrupts, you can reduce the value of the corresponding
parameter to avoid running out of receive descriptors.
NOTE
Not all modules support custom parameters, and the features depend on the hardware,
as well as the driver and firmware version.
IMPORTANT
If you set parameters on a kernel module, RHEL applies these settings to all devices that
use this driver.
Prerequisites
The kernel module that provides the driver for the NIC supports the required tuning feature.
You are logged in locally or using a network interface that is different from the one that uses the
driver for which you want to change the parameters.
Procedure
# ethtool -i enp0s31f6
driver: e1000e
version: ...
firmware-version: ...
...
Note that certain features can require a specific driver and firmware version.
# modinfo -p e1000e
...
SmartPowerDownEnable:Enable PHY smart power down (array of int)
parm:RxIntDelay:Receive Interrupt Delay (array of int)
For further details on the parameters, see the kernel module’s documentation. For modules in
RHEL, see the documentation in the /usr/share/doc/kernel-
doc-<version>/Documentation/networking/device_drivers/ directory that is provided by the
kernel-doc package.
3. Create the /etc/modprobe.d/nic-parameters.conf file and specify the parameters for the
module:
271
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
For example, to enable the port power saving mechanism and set the generation of receive
interrupts to 4 units, enter:
# modprobe -r e1000e
WARNING
# modprobe e1000e
Verification
# dmesg
...
[35309.225765] e1000e 0000:00:1f.6: Transmit Interrupt Delay set to 16
[35309.225769] e1000e 0000:00:1f.6: PHY Smart Power Down Enabled
...
Note that not all modules log parameter settings to the kernel ring buffer.
2. Certain kernel modules create files for each module parameter in the
/sys/module/<driver>/parameters/ directory. Each of these files contain the current value of
this parameter. You can display these files to verify a setting:
# cat /sys/module/<driver_name>/parameters/<parameter_name>
272
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
CPU load.
By default, most offloading features in Red Hat Enterprise Linux are enabled. Only disable them in the
following cases:
Permanently disable offload features when a specific feature negatively impacts your host.
If a performance-related offload feature is not enabled by default in a network driver, you can enable it
manually.
If you temporarily enable or disable an offload feature, it returns to its previous value on the next reboot.
Prerequisites
Procedure
1. Display the interface’s available offload features and their current state:
# ethtool -k enp1s0
...
esp-hw-offload: on
ntuple-filters: off
rx-vlan-filter: off [fixed]
...
The output depends on the capabilities of the hardware and its driver. Note that you cannot
change the state of features that are flagged with [fixed].
For example, to temporarily disable IPsec Encapsulating Security Payload (ESP) offload on
the enp10s0u1 interface, enter:
For example, to temporarily enable accelerated Receive Flow Steering (aRFS) filtering on
the enp10s0u1 interface, enter:
Verification
273
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# ethtool -k enp1s0
...
esp-hw-offload: off
ntuple-filters: on
...
2. Test whether the problem you encountered before changing the offload feature still exists.
ii. Consider permanently setting the offload feature until a fix is available.
i. Reset the setting to its previous state by using the ethtool -K <interface> <feature>
[on|off] command.
ii. Enable or disable a different offload feature to narrow down the problem.
Additional resources
If you permanently enable or disable an offload feature, NetworkManager ensures that the feature still
has this state after a reboot.
Prerequisites
You identified a specific offload feature to limit the performance on your host.
Procedure
1. Identify the connection profile that uses the network interface on which you want to change the
state of the offload feature:
For example, to permanently disable IPsec Encapsulating Security Payload (ESP) offload in
the Example connection profile, enter:
274
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
For example, to permanently enable accelerated Receive Flow Steering (aRFS) filtering in
the Example connection profile, enter:
Verification
# ethtool -k enp1s0
...
esp-hw-offload: off
ntuple-filters: on
...
Additional resources
Tuning the interrupt coalescence settings involves adjusting the parameters that control:
IMPORTANT
The optimal coalescence settings depend on the specific network conditions and
hardware in use. Therefore, it might take several attempts to find the settings that work
best for your environment and needs.
You can adjust the settings on your network card to increase or decrease the number of packets that
are combined into a single interrupt. As a result, you can achieve improved throughput or latency for
your traffic.
Procedure
275
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
# ethtool -S enp1s0
NIC statistics:
rx_packets: 1234
tx_packets: 5678
rx_bytes: 12345678
tx_bytes: 87654321
rx_errors: 0
tx_errors: 0
rx_missed: 0
tx_dropped: 0
coalesced_pkts: 0
coalesced_events: 0
coalesced_aborts: 0
Identify the packet counters containing "drop", "discard", or "error" in their name. These
particular statistics measure the actual packet loss at the network interface card (NIC) packet
buffer, which can be caused by NIC coalescence.
NOTE
Other important factors are for example CPU usage, memory usage, and disk
I/O when identifying a network bottleneck.
# ethtool enp1s0
Settings for enp1s0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
276
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
Transceiver: internal
Auto-negotiation: on
MDI-X: Unknown
Supports Wake-on: g
Wake-on: g
Current message level: 0x00000033 (51)
drv probe link
Link detected: yes
In this output, monitor the Speed and Duplex fields. These fields display information about the
network interface operation and whether it is running at its expected values.
# ethtool -c enp1s0
Coalesce parameters for enp1s0:
Adaptive RX: off
Adaptive TX: off
RX usecs: 100
RX frames: 8
RX usecs irq: 100
RX frames irq: 8
TX usecs: 100
TX frames: 8
TX usecs irq: 100
TX frames irq: 8
The usecs values refer to the number of microseconds that the receiver or transmitter
waits before generating an interrupt.
The frames values refer to the number of frames that the receiver or transmitter waits
before generating an interrupt.
The irq values are used to configure the interrupt moderation when the network interface is
already handling an interrupt.
NOTE
Not all network interface cards support reporting and changing all values
from the example output.
The Adaptive RX/TX value represents the adaptive interrupt coalescence mechanism,
which adjusts the interrupt coalescence settings dynamically. Based on the packet
conditions, the NIC driver auto-calculates coalesce values when Adaptive RX/TX are
enabled (the algorithm differs for every NIC driver).
277
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Users concerned with low latency (sub-50us) should not enable Adaptive-RX.
Users concerned with throughput can probably enable Adaptive-RX with no harm. If
they do not want to use the adaptive interrupt coalescence mechanism, they can try
setting large values like 100us, or 250us to ethtool.coalesce-rx-usecs.
Users unsure about their needs should not modify this setting until an issue occurs.
Verification steps
# ethtool -S enp1s0
NIC statistics:
rx_packets: 1234
tx_packets: 5678
rx_bytes: 12345678
tx_bytes: 87654321
rx_errors: 0
tx_errors: 0
rx_missed: 0
tx_dropped: 0
coalesced_pkts: 12
coalesced_events: 34
coalesced_aborts: 56
...
The value of the rx_errors, rx_dropped, tx_errors, and tx_dropped fields should be 0 or close
to it (up to few hundreds, depending on the network traffic and system resources). A high value
in these fields indicates a network problem. Your counters can have different names. Closely
monitor packet counters containing "drop", "discard", or "error" in their name.
The value of the rx_packets, tx_packets, rx_bytes, and tx_bytes should increase over time. If
the values do not increase, there might be a network problem. The packet counters can have
different names, depending on your NIC driver.
IMPORTANT
The ethtool command output can vary depending on the NIC and driver in use.
Users with focus on extremely low latency can use application-level metrics or the kernel packet
time-stamping API for their monitoring purposes.
Additional resources
278
CHAPTER 34. TUNING THE NETWORK PERFORMANCE
Timestamping
Additionally, TCP Timestamps provide an alternative method to determine the age and order of a
segment, and protect against wrapped sequence numbers. TCP packet headers record the sequence
number in a 32-bit field. On a 10 Gbps connection, the value of this field can wrap after 1.7 seconds.
Without TCP Timestamps, the receiver could not determine whether a segment with a wrapped
sequence number is a new segment or an old duplicate. With TCP Timestamps, however, the receiver
can make the correct choice to receive or discard the segment. Therefore, enabling TCP Timestamps on
systems with fast network interfaces is essential.
The net.ipv4.tcp_timestamps kernel parameter can have one of the following values:
IMPORTANT
By default, TCP Timestamps are enabled in Red Hat Enterprise Linux and use random offsets for each
connection instead of only storing the current time:
# sysctl net.ipv4.tcp_timestamps
net.ipv4.tcp_timestamps = 1
If the net.ipv4.tcp_timestamps parameter has a different value than the default ( 1), revert the setting
in the same way as you set it.
Additional resources
279
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The flow control mechanism manages data transmission across the Ethernet link where each sender and
receiver has different sending and receiving capacities. To avoid packet loss, the Ethernet flow control
mechanism temporarily suspends the packet transmission to manage a higher transmission rate from a
switch port. Note that routers do not forward pause frames beyond a switch port.
When receive (RX) buffers become full, a receiver sends pause frames to the transmitter. The
transmitter then stops data transmission for a short sub-second time frame, while continuing to buffer
incoming data during this pause period. This duration provides enough time for the receiver to empty its
interface buffers and prevent buffer overflow.
NOTE
Either end of the Ethernet link can send pause frames to another end. If the receive
buffers of a network interface are full, the network interface will send pause frames to the
switch port. Similarly, when the receive buffers of a switch port are full, the switch port
sends pause frames to the network interface.
By default, most of the network drivers in Red Hat Enterprise Linux have pause frame support enabled.
To display the current settings of a network interface, enter:
Verify with your switch vendor to confirm if your switch supports pause frames.
Additional resources
What is network link flow control and how does it work in Red Hat Enterprise Linux?
280
CHAPTER 35. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE MEMORY ACCESS
vmstat tool, provided by the procps-ng package, displays reports of a system’s processes,
memory, paging, block I/O, traps, disks, and CPU activity. It provides an instantaneous report of
the average of these events since the machine was last turned on, or since the previous report.
valgrind framework provides instrumentation to user-space binaries. Install this tool, using the
yum install valgrind command. It includes a number of tools, that you can use to profile and
analyze program performance, such as:
memcheck option is the default valgrind tool. It detects and reports on a number of
memory errors that can be difficult to detect and diagnose, such as:
Pointer overlap
Memory leaks
NOTE
Memcheck can only report these errors, it cannot prevent them from
occurring. However, memcheck logs an error message immediately
before the error occurs.
cachegrind option simulates application interaction with a system’s cache hierarchy and
branch predictor. It gathers statistics for the duration of application’s execution and outputs
a summary to the console.
massif option measures the heap space used by a specified application. It measures both
useful space and any additional space allocated for bookkeeping and alignment purposes.
Additional resources
/usr/share/doc/valgrind-version/valgrind_manual.pdf file
281
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The Linux Kernel is designed to maximize the utilization of a system’s memory resources (RAM). Due to
these design characteristics, and depending on the memory requirements of the workload, part of the
system’s memory is in use within the kernel on behalf of the workload, while a small part of the memory is
free. This free memory is reserved for special system allocations, and for other low or high priority
system services.
The rest of the system’s memory is dedicated to the workload itself, and divided into the following two
categories:
File memory
Pages added in this category represent parts of files in permanent storage. These pages, from the
page cache, can be mapped or unmapped in an application’s address spaces. You can use
applications to map files into their address space using the mmap system calls, or to operate on files
via the buffered I/O read or write system calls.
Buffered I/O system calls, as well as applications that map pages directly, can re-utilize unmapped
pages. As a result, these pages are stored in the cache by the kernel, especially when the system is
not running any memory intensive tasks, to avoid re-issuing costly I/O operations over the same set
of pages.
Anonymous memory
Pages in this category are in use by a dynamically allocated process, or are not related to files in
permanent storage. This set of pages back up the in-memory control structures of each task, such as
the application stack and heap areas.
vm.dirty_ratio
Is a percentage value. When this percentage of the total system memory is modified, the system
begins writing the modifications to the disk with the pdflush operation. The default value is 20
percent.
vm.dirty_background_ratio
A percentage value. When this percentage of total system memory is modified, the system begins
writing the modifications to the disk in the background. The default value is 10 percent.
vm.overcommit_memory
Defines the conditions that determine whether a large memory request is accepted or denied.The
282
CHAPTER 35. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE MEMORY ACCESS
Defines the conditions that determine whether a large memory request is accepted or denied.The
default value is 0.
By default, the kernel performs checks if a virtual memory allocation request fits into the present
amount of memory (total + swap) and rejects only large requests. Otherwise virtual memory
allocations are granted, and this means they allow memory overcommitment.
When this parameter is set to 1, the kernel performs no memory overcommit handling. This
increases the possibility of memory overload, but improves performance for memory-
intensive tasks.
When this parameter is set to 2, the kernel denies requests for memory equal to or larger
than the sum of the total available swap space and the percentage of physical RAM specified
in the overcommit_ratio. This reduces the risk of overcommitting memory, but is
recommended only for systems with swap areas larger than their physical memory.
vm.overcommit_ratio
Specifies the percentage of physical RAM considered when overcommit_memory is set to 2. The
default value is 50.
vm.max_map_count
Defines the maximum number of memory map areas that a process can use. The default value is
65530. Increase this value if your application needs more memory map areas.
vm.min_free_kbytes
Sets the size of the reserved free pages pool. It is also responsible for setting the min_page,
low_page, and high_page thresholds that govern the behavior of the Linux kernel’s page reclaim
algorithms. It also specifies the minimum number of kilobytes to keep free across the system. This
calculates a specific value for each low memory zone, each of which is assigned a number of reserved
free pages in proportion to their size.
Setting the vm.min_free_kbytes parameter’s value:
Increasing the parameter value effectively reduces the application working set usable
memory. Therefore, you might want to use it for only kernel-driven workloads, where driver
buffers need to be allocated in atomic contexts.
Decreasing the parameter value might render the kernel unable to service system requests,
if memory becomes heavily contended in the system.
WARNING
The vm.min_free_kbytes parameter also sets a page reclaim watermark, called min_pages.
283
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
The vm.min_free_kbytes parameter also sets a page reclaim watermark, called min_pages.
This watermark is used as a factor when determining the two other memory watermarks,
low_pages, and high_pages, that govern page reclaim algorithms.
/proc/PID/oom_adj
In the event that a system runs out of memory, and the panic_on_oom parameter is set to 0, the
oom_killer function kills processes, starting with the process that has the highest oom_score, until
the system recovers.
The oom_adj parameter determines the oom_score of a process. This parameter is set per process
identifier. A value of -17 disables the oom_killer for that process. Other valid values range from -16
to 15.
NOTE
vm.swappiness
The swappiness value, ranging from 0 to 200, controls the degree to which the system favors
reclaiming memory from the anonymous memory pool, or the page cache memory pool.
Setting the swappiness parameter’s value:
Higher values favor file-mapped driven workloads while swapping out the less actively
accessed processes’ anonymous mapped memory of RAM. This is useful for file-servers or
streaming applications that depend on data, from files in the storage, to reside on memory to
reduce I/O latency for the service requests.
Low values favor anonymous-mapped driven workloads while reclaiming the page cache (file
mapped memory). This setting is useful for applications that do not depend heavily on the
file system information, and heavily utilize dynamically allocated and private memory, such as
mathematical and number crunching applications, and few hardware virtualization
supervisors like QEMU.
The default value of the vm.swappiness parameter is 60.
284
CHAPTER 35. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE MEMORY ACCESS
WARNING
force_cgroup_v2_swappiness
This control is used to deprecate the per-cgroup swappiness value available only in cgroupsV1. Most
of all system and user processes are run within a cgroup. Cgroup swappiness values default to 60.
This can lead to effects where systems swappiness value has little effect on the swap behavior of
their system. If a user does not care about the per-cgroup swappiness feature they can configure
their system with force_cgroup_v2_swappiness=1 to have more consistent swappiness behavior
across their whole system.
Additional resources
aio-max-nr
Defines the maximum allowed number of events in all active asynchronous input/output contexts.
The default value is 65536, and modifying this value does not pre-allocate or resize any kernel data
structures.
file-max
Determines the maximum number of file handles for the entire system. The default value on Red Hat
Enterprise Linux 8 is either 8192 or one tenth of the free memory pages available at the time the
kernel starts, whichever is higher.
Raising this value can resolve errors caused by a lack of available file handles.
Additional resources
285
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Additional resources
The following are the available kernel parameters used to set up limits for the msg* and shm* System V
IPC (sysvipc) system calls:
msgmax
Defines the maximum allowed size in bytes of any single message in a message queue. This value
must not exceed the size of the queue (msgmnb). Use the sysctl msgmax command to determine
the current msgmax value on your system.
msgmnb
Defines the maximum size in bytes of a single message queue. Use the sysctl msgmnb command to
determine the current msgmnb value on your system.
msgmni
Defines the maximum number of message queue identifiers, and therefore the maximum number of
queues. Use the sysctl msgmni command to determine the current msgmni value on your system.
shmall
Defines the total amount of shared memory pages that can be used on the system at one time. For
example, a page is 4096 bytes on the AMD64 and Intel 64 architecture. Use the sysctl shmall
command to determine the current shmall value on your system.
shmmax
Defines the maximum size in bytes of a single shared memory segment allowed by the kernel. Shared
memory segments up to 1Gb are now supported in the kernel. Use the sysctl shmmax command to
determine the current shmmax value on your system.
shmmni
Defines the system-wide maximum number of shared memory segments. The default value is 4096
on all systems.
Additional resources
This procedure describes how to set a memory-related kernel parameter temporarily and persistently.
Procedure
To temporarily set the memory-related kernel parameters, edit the respective files in the /proc
file system or the sysctl tool.
For example, to temporarily set the vm.overcommit_memory parameter to 1:
286
CHAPTER 35. CONFIGURING AN OPERATING SYSTEM TO OPTIMIZE MEMORY ACCESS
To persistently set the memory-related kernel parameter, edit the /etc/sysctl.conf file and
reload the settings.
For example, to persistently set the vm.overcommit_memory parameter to 1:
vm.overcommit_memory=1
# sysctl -p
Additional resources
287
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
However, specific applications can benefit from using larger page sizes in certain cases. For example, an
application that works with a large and relatively fixed data set of hundreds of megabytes or even
dozens of gigabytes can have performance issues when using 4 KB pages. Such data sets can require a
huge amount of 4 KB pages, which can lead to overhead in the operating system and the CPU.
This section provides information about huge pages available in RHEL 8 and how you can configure
them.
The following are the huge page methods, which are supported in RHEL 8:
HugeTLB pages
HugeTLB pages are also called static huge pages. There are two ways of reserving HugeTLB pages:
At boot time: It increases the possibility of success because the memory has not yet been
significantly fragmented. However, on NUMA machines, the number of pages is automatically
split among the NUMA nodes.
For more information about parameters that influence HugeTLB page behavior at boot time, see
Parameters for reserving HugeTLB pages at boot time and how to use these parameters to configure
HugeTLB pages at boot time, see Configuring HugeTLB at boot time.
At run time: It allows you to reserve the huge pages per NUMA node. If the run-time reservation
is done as early as possible in the boot process, the probability of memory fragmentation is
lower.
For more information about parameters that influence HugeTLB page behavior at run time, see
Parameters for reserving HugeTLB pages at run time and how to use these parameters to configure
HugeTLB pages at run time, see Configuring HugeTLB at run time .
system-wide: Here, the kernel tries to assign huge pages to a process whenever it is possible
to allocate the huge pages and the process is using a large contiguous virtual memory area.
per-process: Here, the kernel only assigns huge pages to the memory areas of individual
processes which you can specify using the madvise() system call.
NOTE
288
CHAPTER 36. CONFIGURING HUGE PAGES
For more information about parameters that influence HugeTLB page behavior at boot time, see
Enabling transparent hugepages and Disabling transparent hugepages.
For more infomration on how to use these parameters to configure HugeTLB pages at boot time, see
Configuring HugeTLB at boot time.
Procedure
[Unit]
Description=HugeTLB Gigantic Pages Reservation
DefaultDependencies=no
Before=dev-hugepages.mount
ConditionPathExists=/sys/devices/system/node
ConditionKernelCommandLine=hugepagesz=1G
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/lib/systemd/hugetlb-reserve-pages.sh
[Install]
WantedBy=sysinit.target
3. Create a new file called hugetlb-reserve-pages.sh in the /usr/lib/systemd/ directory and add
the following content:
While adding the following content, replace number_of_pages with the number of 1GB pages
you want to reserve, and node with the name of the node on which to reserve these pages.
#!/bin/sh
nodes_path=/sys/devices/system/node/
if [ ! -d $nodes_path ]; then
echo "ERROR: $nodes_path does not exist"
exit 1
fi
reserve_pages()
{
echo $1 > $nodes_path/$2/hugepages/hugepages-1048576kB/nr_hugepages
}
For example, to reserve two 1 GB pages on node0 and one 1GB page on node1, replace the
number_of_pages with 2 for node0 and 1 for node1:
reserve_pages 2 node0
reserve_pages 1 node1
# chmod +x /usr/lib/systemd/hugetlb-reserve-pages.sh
NOTE
290
CHAPTER 36. CONFIGURING HUGE PAGES
NOTE
Reserving static huge pages can effectively reduce the amount of memory
available to the system, and prevents it from properly utilizing its full memory
capacity. Although a properly sized pool of reserved huge pages can be
beneficial to applications that utilize it, an oversized or unused pool of reserved
huge pages will eventually be detrimental to overall system performance. When
setting a reserved huge page pool, ensure that the system can properly utilize its
full memory capacity.
Additional resources
/usr/share/doc/kernel-doc-kernel_version/Documentation/vm/hugetlbpage.txt file
For more information about how to use these parameters to configure HugeTLB pages at run time, see
Configuring HugeTLB at run time .
291
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
node2 with the node on which you wish to reserve the pages.
Procedure
Verification steps
Additional resources
Procedure
# cat /sys/kernel/mm/transparent_hugepage/enabled
2. Enable THP:
292
CHAPTER 36. CONFIGURING HUGE PAGES
3. To prevent applications from allocating more memory resources than necessary, disable the
system-wide transparent huge pages and only enable them for the applications that explicitly
request it through the madvise:
NOTE
Sometimes, providing low latency to short-lived allocations has higher priority than
immediately achieving the best performance with long-lived allocations. In such cases,
you can disable direct compaction while leaving THP enabled.
Direct compaction is a synchronous memory compaction during the huge page allocation.
Disabling direct compaction provides no guarantee of saving memory, but can decrease
the risk of higher latencies during frequent page faults. Note that if the workload benefits
significantly from THP, the performance decreases. Disable direct compaction:
Additional resources
Procedure
# cat /sys/kernel/mm/transparent_hugepage/enabled
2. Disable THP:
If a requested address mapping is not in the TLB, called a TLB miss, the system still needs to read the
page table to determine the physical to virtual address mapping. Because of the relationship between
application memory requirements and the size of pages used to cache address mappings, applications
293
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
with large memory requirements are more likely to suffer performance degradation from TLB misses
than applications with minimal memory requirements. It is therefore important to avoid TLB misses
wherever possible.
Both HugeTLB and Transparent Huge Page features allow applications to use pages larger than 4 KB.
This allows addresses stored in the TLB to reference more memory, which reduces TLB misses and
improves application performance.
294
CHAPTER 37. GETTING STARTED WITH SYSTEMTAP
As an application developer, you can use SystemTap to monitor in fine detail how your application
behaves within the Linux system.
SystemTap aims to supplement the existing suite of Linux monitoring tools by providing users with the
infrastructure to track kernel activity and combining this capability with two attributes:
Flexibility
the SystemTap framework enables you to develop simple scripts for investigating and monitoring a
wide variety of kernel functions, system calls, and other events that occur in kernel space. With this,
SystemTap is not so much a tool as it is a system that allows you to develop your own kernel-specific
forensic and monitoring tools.
Ease-of-Use
SystemTap enables you to monitor kernel activity without having to recompile the kernel or reboot
the system.
Prerequisites
You have enabled debug repositories as described in Enabling debug and source repositories.
Procedure
a. Using stap-prep:
# stap-prep
b. If stap-prep does not work, install the required kernel packages manually:
295
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
$(uname -i) is automatically replaced with the hardware platform of your system and
$(uname -r) is automatically replaced with the version of your running kernel.
Verification steps
If the kernel to be probed with SystemTap is currently in use, test if your installation was
successful:
The last three lines of output (beginning with Pass 5) indicate that:
1 SystemTap successfully created the instrumentation to probe the kernel and ran the
instrumentation.
2 SystemTap detected the specified event (in this case, A VFS read).
3 SystemTap executed a valid handler (printed text and then closed it with no errors).
To allow users to run SystemTap without root access, add users to both of these user groups:
stapdev
Members of this group can use stap to run SystemTap scripts, or staprun to run SystemTap
instrumentation modules.
Running stap involves compiling SystemTap scripts into kernel modules and loading them into the
kernel. This requires elevated privileges to the system, which are granted to stapdev members.
Unfortunately, such privileges also grant effective root access to stapdev members. As such, only
grant stapdev group membership to users who can be trusted with root access.
stapusr
Members of this group can only use staprun to run SystemTap instrumentation modules. In addition,
296
CHAPTER 37. GETTING STARTED WITH SYSTEMTAP
Members of this group can only use staprun to run SystemTap instrumentation modules. In addition,
they can only run those modules from the /lib/modules/kernel_version/systemtap/ directory. This
directory must be owned only by the root user, and must only be writable by the root user.
Sample scripts that are distributed with the installation of SystemTap can be found in the
/usr/share/systemtap/examples directory.
Prerequisites
1. SystemTap and the associated required kernel packages are installed as described in Installing
Systemtap.
2. To run SystemTap scripts as a normal user, add the user to the SystemTap groups:
Procedure
This command instructs stap to run the script passed by echo to standard input. To add
stap options, insert them before the - character. For example, to make the results from this
command more verbose, the command is:
From a file:
# stap file_name
297
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Normally, SystemTap scripts can run only on systems where SystemTap is deployed. To run SystemTap
on ten systems, SystemTap needs to be deployed on all those systems. In some cases, this might be
neither feasible nor desired. For example, corporate policy might prohibit you from installing packages
that provide compilers or debug information about specific machines, which will prevent the deployment
of SystemTap.
The kernel information packages for various machines can be installed on a single host machine.
IMPORTANT
Kernel packaging bugs may prevent the installation. In such cases, the kernel-
debuginfo and kernel-devel packages for the host system and target system
must match. If a bug occurs, report the bug at https://bugzilla.redhat.com/.
Each target machine needs only one package to be installed to use the generated SystemTap
instrumentation module: systemtap-runtime.
IMPORTANT
The host system must be the same architecture and running the same
distribution of Linux as the target system in order for the built instrumentation
module to work.
TERMINOLOGY
298
CHAPTER 38. CROSS-INSTRUMENTATION OF SYSTEMTAP
TERMINOLOGY
instrumentation module
The kernel module built from a SystemTap script; the SystemTap module is built on
the host system, and will be loaded on the target kernel of the target system.
host system
The system on which the instrumentation modules (from SystemTap scripts) are
compiled, to be loaded on target systems.
target system
The system in which the instrumentation module is being built (from SystemTap
scripts).
target kernel
The kernel of the target system. This is the kernel that loads and runs the
instrumentation module.
Prerequisites
Both the host system and target system are the same architecture.
Both the host system and target system are running the same major version of Red Hat
Enterprise Linux (such as Red Hat Enterprise Linux 8), they can be running different minor
versions (such as 8.1 and 8.2).
IMPORTANT
Procedure
$ uname -r
2. On the host system, install the target kernel and related packages for each target system by the
method described in Installing Systemtap.
299
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
3. Build an instrumentation module on the host system, copy this module to and run this module on
on the target system either:
This command remotely implements the specified script on the target system. You must
ensure an SSH connection can be made to the target system from the host system for this
to be successful.
b. Manually:
Here, kernel_version refers to the version of the target kernel determined in step 1,
script refers to the script to be converted into an instrumentation module, and
module_name is the desired name of the instrumentation module. The -p4 option tells
SystemTap to not load and run the compiled module.
ii. Once the instrumentation module is compiled, copy it to the target system and load it
using the following command:
# staprun module_name.ko
300
CHAPTER 39. MONITORING NETWORK ACTIVITY WITH SYSTEMTAP
PID
The ID of the listed process.
UID
User ID. A user ID of 0 refers to the root user.
DEV
Which ethernet device the process used to send or receive data (for example, eth0, eth1).
XMIT_PK
The number of packets transmitted by the process.
RECV_PK
The number of packets received by the process.
XMIT_KB
The amount of data sent by the process, in kilobytes.
RECV_KB
The amount of data received by the service, in kilobytes.
Prerequisites
Procedure
[...]
PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
0 0 eth0 0 5 0 0 swapper
11178 0 eth0 2 0 0 0 synergyc
PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
2886 4 eth0 79 0 5 0 cups-polld
11362 0 eth0 0 61 0 5 firefox
301
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
0 0 eth0 3 32 0 3 swapper
2886 4 lo 4 4 0 0 cups-polld
11178 0 eth0 3 0 0 0 synergyc
PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
0 0 eth0 0 6 0 0 swapper
2886 4 lo 2 2 0 0 cups-polld
11178 0 eth0 3 0 0 0 synergyc
3611 0 eth0 0 1 0 0 Xorg
PID UID DEV XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
0 0 eth0 3 42 0 2 swapper
11178 0 eth0 43 1 3 0 synergyc
11362 0 eth0 0 7 0 0 firefox
3897 0 eth0 0 1 0 0 multiload-apple
Prerequisites
Procedure
A 3-second excerpt of the output of the socket-trace.stp script looks similar to the following:
[...]
0 Xorg(3611): -> sock_poll
3 Xorg(3611): <- sock_poll
0 Xorg(3611): -> sock_poll
3 Xorg(3611): <- sock_poll
0 gnome-terminal(11106): -> sock_poll
5 gnome-terminal(11106): <- sock_poll
0 scim-bridge(3883): -> sock_poll
3 scim-bridge(3883): <- sock_poll
0 scim-bridge(3883): -> sys_socketcall
4 scim-bridge(3883): -> sys_recv
8 scim-bridge(3883): -> sys_recvfrom
12 scim-bridge(3883):-> sock_from_file
16 scim-bridge(3883):<- sock_from_file
20 scim-bridge(3883):-> sock_recvmsg
24 scim-bridge(3883):<- sock_recvmsg
28 scim-bridge(3883): <- sys_recvfrom
31 scim-bridge(3883): <- sys_recv
35 scim-bridge(3883): <- sys_socketcall
[...]
302
CHAPTER 39. MONITORING NETWORK ACTIVITY WITH SYSTEMTAP
The dropwatch.stp SystemTap script uses kernel.trace("kfree_skb") to trace packet discards; the
script summarizes what locations discard packets in every 5-second interval.
Prerequisites
Procedure
Running the dropwatch.stp script for 15 seconds results in output similar to the following:
NOTE
[...]
ffffffff8024c5cd T unlock_new_inode
ffffffff8024c5da t unix_stream_sendmsg
ffffffff8024c920 t unix_stream_recvmsg
ffffffff8024cea1 t udp_v4_lookup_longway
[...]
ffffffff8044addc t arp_process
ffffffff8044b360 t arp_rcv
ffffffff8044b487 t parp_redo
ffffffff8044b48c t arp_solicit
[...]
303
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
Procedure
This script takes the targeted kernel function as an argument. You can use the argument
wildcards to target multiple kernel functions up to a certain extent.
The output of the script, in alphabetical order, contains the names of the functions called and
how many times it was called during the sample time.
where:
-w : Suppresses warnings.
-c command : Tells SystemTap to count function calls during the execution of a command, in
this example being /bin/true.
The output should look similar to the following:
[...]
__vma_link 97
__vma_link_file 66
__vma_link_list 97
__vma_link_rb 97
__xchg 103
add_page_to_active_list 102
add_page_to_inactive_list 19
add_to_page_cache 19
add_to_page_cache_lru 7
all_vm_events 6
alloc_pages_node 4630
alloc_slabmgmt 67
304
CHAPTER 40. PROFILING KERNEL ACTIVITY WITH SYSTEMTAP
anon_vma_alloc 62
anon_vma_free 62
anon_vma_lock 66
anon_vma_prepare 98
anon_vma_unlink 97
anon_vma_unlock 66
arch_get_unmapped_area_topdown 94
arch_get_unmapped_exec_area 3
arch_unmap_area_topdown 97
atomic_add 2
atomic_add_negative 97
atomic_dec_and_test 5153
atomic_inc 470
atomic_inc_and_test 1
[...]
Prerequisites
Procedure
2. An optional trigger function, which enables or disables tracing on a per-thread basis. Tracing in
each thread will continue as long as the trigger function has not exited yet.
where:
-w : Suppresses warnings.
-c command : Tells SystemTap to count function calls during the execution of a command, in
this example being /bin/true.
[...]
305
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
Procedure
This script will display the top 20 processes taking up CPU time during a 5-second period, along
with the total number of CPU ticks made during the sample. The output of this script also notes
the percentage of CPU time each process used, as well as whether that time was spent in kernel
space or user space.
306
CHAPTER 40. PROFILING KERNEL ACTIVITY WITH SYSTEMTAP
Prerequisites
Procedure
This script will track how many times each application uses the following system calls over time:
poll
select
epoll
itimer
futex
nanosleep
signal
In this example output you can see which process used which system call and how many times.
Prerequisites
Procedure
--------------------------------------------------------------
SYSCALL COUNT
gettimeofday 1857
read 1821
ioctl 1568
poll 1033
close 638
open 503
select 455
write 391
writev 335
futex 303
recvmsg 251
socket 137
clock_gettime 124
rt_sigprocmask 121
sendto 120
setitimer 106
stat 90
time 81
sigreturn 72
fstat 66
--------------------------------------------------------------
308
CHAPTER 40. PROFILING KERNEL ACTIVITY WITH SYSTEMTAP
Prerequisites
Procedure
309
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
Prerequisites
Procedure
The script displays the top ten processes responsible for the heaviest reads or writes to a disk.
UID
User ID. A user ID of 0 refers to the root user.
PID
The ID of the listed process.
PPID
The process ID of the listed process’s parent process.
CMD
The name of the listed process.
DEVICE
Which storage device the listed process is reading from or writing to.
T
The type of action performed by the listed process, where W refers to write, and R refers to
read.
BYTES
The amount of data read to or written from disk.
[...]
Mon Sep 29 03:38:28 2008 , Average: 19Kb/sec, Read: 7Kb, Write: 89Kb
UID PID PPID CMD DEVICE T BYTES
0 26319 26294 firefox sda5 W 90229
0 2758 2757 pam_timestamp_c sda5 R 8064
0 2885 1 cupsd sda5 W 1678
Mon Sep 29 03:38:38 2008 , Average: 1Kb/sec, Read: 7Kb, Write: 1Kb
310
CHAPTER 41. MONITORING DISK AND I/O ACTIVITY WITH SYSTEMTAP
41.2. TRACKING I/O TIME FOR EACH FILE READ OR WRITE WITH
SYSTEMTAP
You can use the iotime.stp SystemTap script to monitor the amount of time it takes for each process to
read from or write to any file. This helps you to determine what files are slow to load on a system.
Prerequisites
Procedure
The script tracks each time a system call opens, closes, reads from, and writes to a file. For each
file any system call accesses, It counts the number of microseconds it takes for any reads or
writes to finish and tracks the amount of data , in bytes, read from or written to the file.
A timestamp, in microseconds
[...]
825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0
825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9
[...]
117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0
117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7
[...]
3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 0
3973744 2886 (sendmail) iotime /proc/loadavg time: 11
[...]
311
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
You can use the traceio.stp SystemTap script to track the cumulative amount of I/O to the system.
Prerequisites
Procedure
The script prints the top ten executables generating I/O traffic over time. It also tracks the
cumulative amount of I/O reads and writes done by those executables. This information is
tracked and printed out in 1-second intervals, and in descending order.
[...]
Xorg r: 583401 KiB w: 0 KiB
floaters r: 96 KiB w: 7130 KiB
multiload-apple r: 538 KiB w: 537 KiB
sshd r: 71 KiB w: 72 KiB
pam_timestamp_c r: 138 KiB w: 0 KiB
staprun r: 51 KiB w: 51 KiB
snmpd r: 46 KiB w: 0 KiB
pcscd r: 28 KiB w: 0 KiB
irqbalance r: 27 KiB w: 4 KiB
cupsd r: 4 KiB w: 18 KiB
Xorg r: 588140 KiB w: 0 KiB
floaters r: 97 KiB w: 7143 KiB
multiload-apple r: 543 KiB w: 542 KiB
sshd r: 72 KiB w: 72 KiB
pam_timestamp_c r: 138 KiB w: 0 KiB
staprun r: 51 KiB w: 51 KiB
snmpd r: 46 KiB w: 0 KiB
pcscd r: 28 KiB w: 0 KiB
irqbalance r: 27 KiB w: 4 KiB
cupsd r: 4 KiB w: 18 KiB
Prerequisites
Procedure
312
CHAPTER 41. MONITORING DISK AND I/O ACTIVITY WITH SYSTEMTAP
This script takes the whole device number as an argument. To find this number you can use:
[...]
synergyc(3722) vfs_read 0x800005
synergyc(3722) vfs_read 0x800005
cupsd(2889) vfs_write 0x800005
cupsd(2889) vfs_write 0x800005
cupsd(2889) vfs_write 0x800005
[...]
Prerequisites
Procedure
313
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
805 1078319
where:
805 is the base-16 (hexadecimal) device number. The last two digits are the minor device
number, and the remaining digits are the major number.
In the first two arguments you must use 0x prefixes for base-16 numbers.
314
CHAPTER 42. ANALYZING SYSTEM PERFORMANCE WITH BPF COMPILER COLLECTION
Procedure
1. Install bcc-tools.
# ll /usr/share/bcc/tools/
...
-rwxr-xr-x. 1 root root 4198 Dec 14 17:53 dcsnoop
-rwxr-xr-x. 1 root root 3931 Dec 14 17:53 dcstat
-rwxr-xr-x. 1 root root 20040 Dec 14 17:53 deadlock_detector
-rw-r--r--. 1 root root 7105 Dec 14 17:53 deadlock_detector.c
drwxr-xr-x. 3 root root 8192 Mar 11 10:28 doc
-rwxr-xr-x. 1 root root 7588 Dec 14 17:53 execsnoop
-rwxr-xr-x. 1 root root 6373 Dec 14 17:53 ext4dist
-rwxr-xr-x. 1 root root 10401 Dec 14 17:53 ext4slower
...
The doc directory in the listing above contains documentation for each tool.
Prerequisites
Root permissions
# /usr/share/bcc/tools/execsnoop
315
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
$ ls /usr/share/bcc/tools/doc/
3. The terminal running execsnoop shows the output similar to the following:
The execsnoop program prints a line of output for each new process, which consumes system
resources. It even detects processes of programs that run very shortly, such as ls, and most
monitoring tools would not register them.
PCOMM
The parent process name. (ls)
PID
The process ID. (8382)
PPID
The parent process ID. (8287)
RET
The return value of the exec() system call (0), which loads program code into new processes.
ARGS
The location of the started program with arguments.
To see more details, examples, and options for execsnoop, refer to the
/usr/share/bcc/tools/doc/execsnoop_example.txt file.
# /usr/share/bcc/tools/opensnoop -n uname
The above prints output for files, which are opened only by the process of the uname command.
$ uname
The command above opens certain files, which are captured in the next step.
3. The terminal running opensnoop shows the output similar to the following:
316
CHAPTER 42. ANALYZING SYSTEM PERFORMANCE WITH BPF COMPILER COLLECTION
The opensnoop program watches the open() system call across the whole system, and prints a
line of output for each file that uname tried to open along the way.
PID
The process ID. (8596)
COMM
The process name. (uname)
FD
The file descriptor - a value that open() returns to refer to the open file. ( 3)
ERR
Any errors.
PATH
The location of files that open() tried to open.
If a command tries to read a non-existent file, then the FD column returns -1 and the ERR
column prints a value corresponding to the relevant error. As a result, opensnoop can help you
identify an application that does not behave properly.
To see more details, examples, and options for opensnoop, refer to the
/usr/share/bcc/tools/doc/opensnoop_example.txt file.
# /usr/share/bcc/tools/biotop 30
The command enables you to monitor the top processes, which perform I/O operations on the
disk. The argument ensures that the command will produce a 30 second summary.
NOTE
# dd if=/dev/vda of=/dev/zero
The command above reads the content from the local hard disk device and writes the output to
the /dev/zero file. This step generates certain I/O traffic to illustrate biotop.
3. The terminal running biotop shows the output similar to the following:
317
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
PID
The process ID. (9568)
COMM
The process name. (dd)
DISK
The disk performing the read operations. (vda)
I/O
The number of read operations performed. (16294)
Kbytes
The amount of Kbytes reached by the read operations. (14,440,636)
AVGms
The average I/O time of read operations. (3.69)
To see more details, examples, and options for biotop, refer to the
/usr/share/bcc/tools/doc/biotop_example.txt file.
# /usr/share/bcc/tools/xfsslower 1
The command above measures the time the XFS file system spends in performing read, write,
open or sync (fsync) operations. The 1 argument ensures that the program shows only the
operations that are slower than 1 ms.
NOTE
318
CHAPTER 42. ANALYZING SYSTEM PERFORMANCE WITH BPF COMPILER COLLECTION
NOTE
$ vim text
The command above creates a text file in the vim editor to initiate certain interaction with the
XFS file system.
3. The terminal running xfsslower shows something similar upon saving the file from the previous
step:
Each line above represents an operation in the file system, which took more time than a certain
threshold. xfsslower is good at exposing possible file system problems, which can take form of
unexpectedly slow operations.
COMM
The process name. (b’bash')
T
The operation type. (R)
Read
Write
Sync
OFF_KB
The file offset in KB. (0)
FILENAME
The file being read, written, or synced.
To see more details, examples, and options for xfsslower, refer to the
/usr/share/bcc/tools/doc/xfsslower_example.txt file.
319
Red Hat Enterprise Linux 8 Monitoring and managing system status and performance
320