0% found this document useful (0 votes)
48 views

5-TECS OpenStack (V7.23.40) Troubleshooting

Uploaded by

khidr.gadora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

5-TECS OpenStack (V7.23.40) Troubleshooting

Uploaded by

khidr.gadora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 155

TECS OpenStack

Tulip Elastic Cloud System OpenStack


Troubleshooting

Version: V7.23.40

ZTE CORPORATION
ZTE Plaza, Keji Road South, Hi-Tech Industrial Park,
Nanshan District, Shenzhen, P.R.China
Postcode: 518057
Tel: +86-755-26771900
URL: http://support.zte.com.cn
E-mail: [email protected]
LEGAL INFORMATION
Copyright 2024 ZTE CORPORATION.
The contents of this document are protected by copyright laws and international treaties. Any reproduction or
distribution of this document or any portion of this document, in any form by any means, without the prior written

consent of ZTE CORPORATION is prohibited. Additionally, the contents of this document are protected by

contractual confidentiality obligations.

All company, brand and product names are trade or service marks, or registered trade or service marks, of ZTE

CORPORATION or of their respective owners.

This document is provided as is, and all express, implied, or statutory warranties, representations or conditions are

disclaimed, including without limitation any implied warranty of merchantability, fitness for a particular purpose,

title or non-infringement. ZTE CORPORATION and its licensors shall not be liable for damages resulting from the

use of or reliance on the information contained herein.

ZTE CORPORATION or its licensors may have current or pending intellectual property rights or applications

covering the subject matter of this document. Except as expressly provided in any written license between ZTE
CORPORATION and its licensee, the user of this document shall not acquire any license to the subject matter

herein.

ZTE CORPORATION reserves the right to upgrade or make technical change to this product without further notice.

Users may visit the ZTE technical support website http://support.zte.com.cn to inquire for related information.

The ultimate right to interpret this product resides in ZTE CORPORATION.

Statement on the Use of Third-Party Embedded Software:


If third-party embedded software such as Oracle, Sybase/SAP, Veritas, Microsoft, VMware, and Redhat is

delivered together with this product of ZTE, the embedded software must be used as only a component of this
product. If this product is discarded, the licenses for the embedded software must be void either and must not be

transferred. ZTE will provide technical support for the embedded software of this product.

Revision History

Revision No. Revision Date Revision Reason


R1.0 2024-01-20 First edition

Serial Number: SJ-20240124113225-026

Publishing Date: 2024-01-20 (R1.0)


Contents

1 Fault Handling Overview....................................................................................... 1


1.1 Introduction to Faults...........................................................................................................1
1.2 Requirements for Maintenance Engineers.......................................................................... 3
1.3 Precautions for Fault Handling............................................................................................4
1.4 Fault Location Thinking and Method Descriptions.............................................................. 5
2 Fault Handling Procedures................................................................................... 8
2.1 Common Fault Handling Procedure.................................................................................... 8
2.2 Emergency Fault Handling Procedure.............................................................................. 11
3 Cluster Faults of the Control Node.................................................................... 14
3.1 Service Resources in Failed Status in the Cluster............................................................14
3.2 Failed to Start File System Resources in the Cluster....................................................... 15
3.3 Multiple Times of Automatic Switchover During Cluster Startup....................................... 17
3.4 Cannot Display Some Resources..................................................................................... 19
3.5 HA Suspends a Node....................................................................................................... 20
3.6 Clusters Co-exist but Cannot Find Each Other.................................................................21
3.7 Pacemaker Fails to Operate............................................................................................. 24
3.8 HA Node Restart............................................................................................................... 25
3.9 HA Cluster Split-Brain....................................................................................................... 26
4 Database Faults....................................................................................................29
4.1 Database Is Read-Only or Fails to Execute Commands...................................................29
4.2 Database Login Failure..................................................................................................... 30
4.3 Cannot Start the Database................................................................................................31
4.4 Cannot Back Up the Database......................................................................................... 32
5 System Environment Faults................................................................................34
5.1 Keystone-Related Faults................................................................................................... 34
5.1.1 Keystone Prompts Too Many Database Connections.......................................... 34
5.1.2 Keystone Authentication Failure........................................................................... 36
5.1.3 Keystone Authorization Failure.............................................................................36
5.1.4 Connections to the Server Cannot Be Established Due to Keystone
Authorization Failure.......................................................................................... 38
5.1.5 Fails to Create a User..........................................................................................39
5.1.6 Cloud Environment Access Is Improper............................................................... 39

I
5.1.7 Virtual Resource Page Prompts That The Current User Needs to Be Bound
With a Project.................................................................................................... 40
5.2 Nova Service Faults.......................................................................................................... 41
5.2.1 NOVA Fails to Be Connected to RabbitMQ......................................................... 41
5.3 Neutron Service Failure.....................................................................................................42
5.3.1 Neutron Server Error............................................................................................ 42
5.3.2 Neutron Agent Error............................................................................................. 43
5.3.3 Network Service Startup Failure...........................................................................44
5.4 Rabbitmq-Related Faults................................................................................................... 45
5.4.1 Failed to Start rabbitmq-server.............................................................................45
5.4.2 Message Server Connection Failure.................................................................... 46
5.4.3 General Rabbitmq-Related Fault Location........................................................... 47
5.4.4 Nova Cannot be Connected to Rabbitmq............................................................ 48
5.5 Automatic Restart Every Other Minutes in a New Physical Environment..........................49
6 Faults Related to Virtual Resources.................................................................. 50
6.1 Cannot Create a Cloud Drive............................................................................................50
6.1.1 Cannot Create a Cloud Drive With a Mirror......................................................... 50
6.1.2 Cannot Create a Cloud Drive (Based on a Fujitsu Disk Array).............................52
6.1.3 Cannot Create a Cloud Drive With a Mirror (Based on an IPSAN Disk Array)...... 55
6.1.4 The Volume With Images Fails to be Created Due to "Failed to Copy Image
to Volume"......................................................................................................... 55
6.1.5 The Volumes With Images Fail to Be Created in Batches................................... 56
6.1.6 The Volume With Images Fails to Be Created on a Fujitsu Disk Array.................57
6.1.7 The Volume Fails to Be Created and the Status of the Volume Is "error,
volume service is down or disabled"................................................................. 58
6.1.8 The Volume With Images Fails to be Created Due to "_is_valid_iscsi_ip, iscsi
ip:() is invalid".................................................................................................... 59
6.2 Cloud Drive Deletion Failure............................................................................................. 60
6.2.1 Cannot Delete a Cloud Drive, the Status of the Cloud Drive is "Error-
Deleting".............................................................................................................60
6.2.2 A Volume Fails to Be Deleted From a ZTE Disk Array Due to "Failed to
signin.with ret code:1466"..................................................................................61
6.2.3 A Volume Fails to Be Deleted From a ZTE Disk Array Due to "error-deleting"..... 63
6.2.4 No Response and Log Are Returned After a Volume Is Deleted......................... 63
6.3 VM Cannot Mount a Cloud Drive......................................................................................64
6.3.1 Cannot Mount a Cloud Drive When a Fujitsu Disk Array Is Used........................ 64
6.3.2 Cannot Mount a Cloud Drive When IPSAN Back-End Storage Is Used............... 65

II
6.4 Cannot Unmount a Cloud Drive........................................................................................ 67
6.5 Cannot Upload a Mirror.....................................................................................................68
6.5.1 Mirror Server Space Insufficient........................................................................... 68
6.5.2 Insufficient Permissions on the Mirror Storage Directory..................................... 69
6.6 Security Group Faults........................................................................................................70
6.6.1 Network Congestion Caused by Security Groups................................................ 70
7 VM Life Cycle Management Faults.....................................................................72
7.1 VM Deployment Faults...................................................................................................... 72
7.1.1 Deployment Fault Handling Entrance...................................................................72
7.1.2 No valid host was found.......................................................................................73
7.1.3 Failed to Deploy a VM on a Compute Node........................................................ 78
7.2 Hot Migration Faults.......................................................................................................... 88
7.2.1 Hot Migration Is Allowed Only in One Direction................................................... 88
7.2.2 Inter-AZ Hot Migration of VM Fails.......................................................................89
7.2.3 Destination Host Has Not Enough Resources (Not Referring to Disk Space).......90
7.2.4 Destination Host Has Not Enough Disk Space.................................................... 91
7.2.5 Source Computing Service Unavailable............................................................... 91
7.2.6 VM Goes into Error Status After Live Migration................................................... 92
7.3 Cold Migration and Resizing Faults.................................................................................. 93
7.3.1 Authentication Fails During Migration...................................................................93
7.3.2 Error "No valid host was found" Reported During Migration.................................94
7.3.3 Error "Unable to resize disk down" Reported During Resizing............................. 94
7.3.4 VM Always in "verify_resize" Status After Cold Migration or Resizing..................95
7.3.5 Mirror Error Reported During Cold Migration or Resize Operation....................... 96
7.4 Cannot Delete VM............................................................................................................. 97
7.4.1 Deletion Error Caused by Abnormal Compute Node Service...............................97
7.4.2 Control Node's cinder-volume Service Abnormal................................................. 97
7.4.3 Network Service Abnormal................................................................................... 98
8 VM Operation Failure.........................................................................................100
8.1 VM OS Startup Failure....................................................................................................100
8.1.1 Some Services of the VM are not Started......................................................... 100
8.1.2 Failed to Start the VM Due to Loss of grub Information..................................... 101
8.1.3 Too Long VM Startup Time Due to Too Large Disk...........................................101
8.1.4 Failed to Start the VM After Power Off.............................................................. 102
8.1.5 Failed to Start the VM OS, no bootable device..................................................104
8.1.6 Error Status of VM............................................................................................. 105
8.1.7 Cannot Power on the VM After Restart..............................................................106

III
8.1.8 Failed to Start the VM, Insufficient Memory....................................................... 106
8.1.9 VM File System Read-Only Due to Disk Array Network Interruption.................. 107
8.2 Network Disconnection (Non-SDN Scenario, VLAN)...................................................... 108
8.2.1 Cannot Ping the VM From the External Debugging Machine............................. 108
8.2.2 Cannot Ping the External Debugging Machine From the VM............................. 110
8.2.3 Cannot Ping Ports on a VLAN........................................................................... 111
8.2.4 OVS VM Cannot Be Connected.........................................................................112
8.2.5 Floating IP Address Cannot Be Pinged..............................................................113
8.2.6 The Service VM Media Plane Using the SRIOV Port Cannot Be Connected......116
8.2.7 VM (OVS+DPDK Type) Communication Failure................................................ 118
8.3 Network Disconnection (SDN Scenario, VXLAN)............................................................130
8.3.1 OVS (User Mode) VMs Not Connected............................................................. 130
8.3.2 Failed to Obtain an IP Address..........................................................................131
8.4 DHCP Faults....................................................................................................................132
8.4.1 Cannot Obtain IP Addresses Distributed by DHCP............................................132
8.4.2 Connection Failure If the Address Distributed by DHCP Is Not Used.................134
8.5 VM's NIC Unavailable......................................................................................................135
8.6 Control Console Cannot Connect to VM.........................................................................136
8.7 VM Restart Due to Invoked OOM-Killer.......................................................................... 137
9 O&M System Faults........................................................................................... 139
9.1 TECS Interface-Related Faults........................................................................................139
9.1.1 Image Uploading Queued...................................................................................139
9.1.2 Database Server Startup Failure Due to the Damaged Data File.......................140
9.1.3 Account Locked Due to Incorrect Passwords.....................................................143
9.2 Performance Index Collection Faults.............................................................................. 143
9.2.1 Cannot Obtain Performance Indexes of a Physical Machine............................. 143
9.2.2 Performance Data Record Failure......................................................................144
10 Troubleshooting Records................................................................................147
Glossary..................................................................................................................148

IV
Chapter 1
Fault Handling Overview
Table of Contents
Introduction to Faults....................................................................................................................1
Requirements for Maintenance Engineers...................................................................................3
Precautions for Fault Handling.....................................................................................................4
Fault Location Thinking and Method Descriptions....................................................................... 5

1.1 Introduction to Faults


Fault Classification

Faults refer to the phenomena in which the equipment or system software loses specified
functions or causes dangers during operation due to a certain reason. Based on services
affected by the faults and fault impact ranges, faults can be classified as critical faults and minor
faults.
 Critical faults
Critical faults are the faults that seriously affect system services and operations, including
serious decline of system key performance indicators (KPIs), large-area or even full
interruption of services, and abnormal charging.
 Minor faults
Minor faults are the faults that have minor impacts on services and operations, excluding the
critical faults.

Sources for Discovery of faults

The sources for discovery of faults can be divided into the following three categories:
 Complaints of terminal users
Services cannot be used properly, so users complain.
 Alarms on EMS pages
Due to equipment or software faults, the system reports the alarms to the EMS. Audible and
visual alarms are raised on EMS pages.
 Routine maintenance and inspections

SJ-20240124113225-026 | 2024-01-20 (R1.0) 1


TECS OpenStack Troubleshooting

Maintenance engineers detect equipment or system faults during routine maintenance and
inspections.

Common Causes of Faults

Faults of the TECS OpenStack are generally caused by the following reasons:
 Faults in the hardware
Contact hardware platform engineers to resolve the faults.
 Faults in software and the system
à Faults of the operating system
The operating system has memory management and security problems.
à Database faults
Improper settings and usages of databases also cause various problems in usage and
security.
à Program or software faults
A module of the TECS OpenStack may be faulty, or an unmatched version of software is
used.
 Faults caused by environmental changes
à Stable system operation has strict requirements on the environment. If the temperature
or humidity does not meet the requirements, the system generates alarms.
à The occurrence of natural disasters or accidents can also cause alarms, such as
lightning, fire, and infrared sensor alarms.
 Setting or configuration faults
à Problems in settings
Problems in settings of equipment interfaces, system rights, files or folders
à Problems in configuration
Invalid or unreasonable configuration reduces system performance and capacity. Alarm
thresholds must be configured appropriately.
à Command errors
An invalid command is entered or a program with error scripts is executed.
 Network faults
à Local connection problems
The VM is not installed correctly or properly. The ports of the network cable has contact
problems. The IP address, subnet mask, default gateway, and route are not set correctly.
These normally cause network connection problems.
à Faults of network equipment
Faults occur in the network equipment, routing and switching equipment, or intermediate
links on the Internet.

2 SJ-20240124113225-026 | 2024-01-20 (R1.0)


1 Fault Handling Overview

à Network application problems


The FTP and NTP services required by the TECS OpenStack are unavailable. Network
problems also include network security problems, and congestion and voice quality
problems caused by cyber storms.
 Faults in interconnected equipment
Errors in the peer equipment also cause faults in the system. Check the operational status
of the interconnected equipment.

1.2 Requirements for Maintenance Engineers


Description

Mastering the required skills is a prerequisite for maintenance engineers to successfully


conduct troubleshooting.

Fundamental Knowledge

 Be familiar with the basic knowledge of computer networks such as Ethernet and TCP/IP.
 Be familiar with the basic knowledge of the MariaDB and MongoDB databases.
 Be familiar with the basic knowledge of the Linux system.
 Be familiar with basic virtualization knowledge.

Network Architecture and Operating Environment

 Be familiar with the network architecture and IP planning of the TECS OpenStack.
 Be familiar with the connection relations between the TECS OpenStack and other devices in
the network.

Device Operation Requirements

 Be good at the use of commands in the Linux system.


 Be good at the use of the MariaDB and MongoDB database commands.
 Be good at the use of SSH commands.
 Be good at the operation methods of the management portal of the TECS OpenStack.
 Be familiar with the operations that may interrupt some or all services.
 Be familiar with the operations that may damage the device.
 Be familiar with the operations that may cause user complaints.
 Be familiar with the emergency measures and backup measures.
 Grasp the determination, location and handling methods of critical faults. Skillfully use these
methods to handle major faults.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 3


TECS OpenStack Troubleshooting

 Collect and save the on-site data. On-site data collection and saving includes the periodic
data collection during proper device operation and the data collection when device faults
occur. Generally, acquire and save the on-site data before fault handling.

1.3 Precautions for Fault Handling


Precautions

 During routine maintenance


To raise the efficiency of fault handling, make the following preparations during routine
maintenance:
à Draft the diagram of physical connection between on-site equipment.
à Make the table of component/equipment communication, interconnection and rights
information, including VLAN, IP address, interconnected port No., firewall configuration,
user name, and password.
à Make software archives, and record the software configuration, version number, and any
change information.
à Check the remote access equipment and various fault diagnosis tools periodically to
ensure their proper operation, including the test instruments and packet capture tool.
à Put contact information of ZTE technical support in a noticeable place for timely contact.
à Update the contact details periodically.
 During the occurrence of the fault:
à Handle the fault timely if it occurs. Contact ZTE technical support if the problem is hard
to be resolved, especially in one of the following cases:

No. Case

1 A critical fault occurs, and part or all of the services are interrupted.

2 The problem cannot be solved by using the known fault handling methods.

3 The problem cannot be solved by using the previous handling methods of similar
faults.

à For any problem during maintenance, record the raw information in detail, including the
symptom of the fault, operations before the occurrence of the fault, versions, and data
changes.
 During the handling of faults
à Observe operation regulations and industry safety regulations strictly to ensure personal
and equipment safety.
à During component replacement and maintenance, take antistatic measures and wear
antistatic wrist straps.

4 SJ-20240124113225-026 | 2024-01-20 (R1.0)


1 Fault Handling Overview

à Do not connect or disconnect unknown network cables.


à Keep a detailed fault handling and trace log and record the fault handling steps for
analysis and handling. For long-term fault handling process, keep a shift record to define
the responsibilities clearly.
à Record all major operations, such as restarting a process, deleting a file, and modifying
a configuration file. Before each operation, verify the feasibility. After preparation of
backup, emergency, and safety measures, major operations must be implemented by
qualified operators.

Dangerous Operations

The following are dangerous operations that must be implemented with caution during fault
handling:
 Modifying service parameters.
 Deleting the configuration file of the system.
 Modifying the configuration of the network equipment.
 Altering the network architecture.

Conditions for Contacting ZTE Engineers

The following information is required to be provided to ZTE technical support engineers:


 Fault details: time, place, and event.
 Log and ticket query results.
 Captured packet.
 Implemented operations, commands and results of the commands.
 Telnet methods and contact number
 For service faults, provide the caller number, called number, fault occurrence time, and the
number of affected users.

1.4 Fault Location Thinking and Method Descriptions


Fault Location Thinking

 Principle 1: If data is modified before a fault occurs, data must be restored immediately.
à Quickly determine whether the fault is related to the operation in accordance with the
operation contents, operation time, and fault occurrence time.
à After the preliminary determination, perform the corresponding restoration operation in
accordance with the operation performed before the fault occurs.
 Principle 2: If a fault occurs in the equipment room construction procedure, check whether
the fault is related to construction.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 5


TECS OpenStack Troubleshooting

à In accordance with onsite conditions, determine whether the device fault can be
caused in the construction procedure. For example, an internal cable of the system is
disconnected by mistake.
à Determine the operational status of the system in accordance with alarm management
and board indicators. Focus on internal cable connections.
 Principle 3: Verify that physical machines are in normal state.
All physical machines must be normal state. You can check their status by viewing blade
indicators. Ensure that physical machine operate properly.
 Principle 4: Ensure that the control nodes in two-server cluster mode operate properly.
Run the crm_mon -1 command to check whether the two-server cluster is in normal state.
 Principle 5: Verify that VMs are in normal state.
Through the TECS OpenStack, verify that all VMs are in "active running" state.
 Principle 6: Verify that the network of physical machines is normal.
à On a physical machine, ping another physical machine, and check the ping packet result.
à If there is a disk array, ping the disk array management interface from the control node,
and ping the disk array service interface from the computing node.

Fault Location Method Descriptions

If a fault occurs in the system, analyze it in accordance with the alarm information and the
generated error log to find the fault cause and fix the fault.

Determining the Fault Cause Through Alarm Information

On the TECS OpenStack, check all the current alarms in the system. Analyze and determine
the fault cause in accordance with the alarm information. Keys to query and collect alarm
information:
 In the collected alarm information, focus on the alarm level, alarm code, location, time, and
details.
 After the current alarm information is collected, determine whether to collect historical alarm
information as required.
After the current alarm information is collected, perform analysis as follows:
1. Preliminary analysis and determination: In accordance with the keys of the current alarm
information (for example, the alarm is a critical or major alarm), determine the fault cause
and impact.
2. Alarm relationship analysis: Analyze the sequence and codes of the current alarms, and
clarify the relationships between the alarms. In this way, the fault occurrence procedure can
be known.

6 SJ-20240124113225-026 | 2024-01-20 (R1.0)


1 Fault Handling Overview

Locating the Fault Cause Through Log Analysis

In the quick troubleshooting procedure, the logs of the TECS OpenStack are important
methods. After the log information is collected, the fault can be analyzed quickly.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 7


Chapter 2
Fault Handling
Procedures
Table of Contents
Common Fault Handling Procedure.............................................................................................8
Emergency Fault Handling Procedure....................................................................................... 11

Fault handling includes common fault handling and emergency fault handling, which have
different procedures.
When a fault occurs, on-site maintenance personnel must determine whether the fault is an
emergency fault. If it is an emergency fault, follow the emergency fault handling procedure. If it
is a common fault, follow the common fault handling procedure.

2.1 Common Fault Handling Procedure


Fault Handling Procedure

To locate a fault to the specific module during the troubleshooting, you need to analyze the flow
and the network element (NE). During the fault location, you should start with the flow and the
system composition, analyze and determine the fault in accordance with the symptoms, exclude
normal modules and determine the fault module.
Figure 2-1 shows the common procedure for handling a fault.

8 SJ-20240124113225-026 | 2024-01-20 (R1.0)


2 Fault Handling Procedures

Figure 2-1 Common Fault Handling Procedure

Fault Handling Operations

During a fault handling procedure, the troubleshooting engineers should perform the following
steps in turn:
1. Determine the situations
If a fault occurs, perform a simple test to know the situation of the fault.
2. Collect source information
If a fault occurs, record detailed information about the fault, including the symptom, alarms
and operating information displayed on the TECS OpenStack window, operations that you

SJ-20240124113225-026 | 2024-01-20 (R1.0) 9


TECS OpenStack Troubleshooting

have performed to handle this fault, and other information that you can collect with the
maintenance tools (such as performance management).
3. Classify the fault
Analyze the fault initially and classify it in accordance with the symptom and the information
that you have collected with the maintenance tools.
4. Locate the fault
Locate the fault and determine the possible causes by analyzing the flow and NEs.
5. Remove the fault
Remove the fault in accordance with the located fault reasons
6. Record the fault handling information
Record details about the fault handling, including the symptom and the handling methods.
Such information is a helpful reference for the handling of similar faults. It is recommended
that the sheet shown in 10 Troubleshooting Records be used to record the fault handling
information, and you can also record the fault handling information with a sheet designed by
yourself.

Precautions

 Make rules and regulations for fault handling and tracing for all maintenance personnel
to follow. Only authorized and relevant persons are allowed to participate in the
troubleshooting, to avoid worse faults caused by misoperations.
 Perform operations and maintenance by following the instructions in the documents of the
TECS OpenStack.
 Back up service data and system operating parameters before the fault handling. Make
a detailed record about the fault symptoms, versions, and configuration changes and
operations that you have performed. Collect other data about the fault for analyzing and
removing the fault.
 Trace and record the detailed fault handling procedure. For a fault that may last for days,
make detailed shift records to clarify the responsibilities.
 Handle every fault promptly. If there is any fault that you cannot remove, contact ZTE techni
cal support.
In any of the following situations, you should contact ZTE technical support.
à Emergency faults, for example, all services or some services are interrupted.
à Faults that you cannot remove with the methods described in this document.
à Faults that you cannot remove with your own knowledge.
à Faults that you cannot remove by referring to the similar fault removal cases.
 Paste a list of contacts of ZTE in a conspicuous place, and remember to confirm and update
the contacts frequently.

10 SJ-20240124113225-026 | 2024-01-20 (R1.0)


2 Fault Handling Procedures

 When you are contacting ZTE for technical support, you may be required to provide the
following information:
à Detailed symptoms about the fault, including the time, place, and events.
à Alarm management data, performance management data, signaling tracing result, and
failure observation result.
à Operations that you have performed after the fault occurred.
à Way to remotely log in to the system and the telephone numbers of persons for contact.

2.2 Emergency Fault Handling Procedure


Definition of Emergency Fault

When an emergency fault occurs, the device cannot provide basic services, operate for more
than 30 minutes, or the device causes human safety hazards. All emergency faults must be
handled immediately.
Emergency faults of the TECS OpenStack can be classified into five types:
 Failing in providing basic services for multiple causes, such as equipment breakdown,
power off, system crash, environmental or human factors. The faults are not removed after
the preliminary handling and need to be handled immediately.
 The rate of successful service handling operations declines 5% or more, or many
subscribers or important customers complain about the interruption or poor quality of
services.
 Failing in accessing subscriber data, or subscriber data completeness and consistency are
damaged.
 Failing in maintaining the device with the TECS OpenStack window.
 Influences on basic services provisioning by other equipment
 Hazards to human safety caused by use of the product

Principles for Emergency Fault Handling

Once emergency faults on the equipment are reported or found, to restore the system as soon
as possible, you should handle the fault in accordance with the procedure shown in Figure 2-2.
You also need to contact the local ZTE technical support.
In accordance with the statistical data, network failures, and other faults. You can troubleshoot
the fault based on the statistical result. When you have confirmed that the power and
communication are normal, you can use the alarm management function to locate the node
where the fault possibly lies.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 11


TECS OpenStack Troubleshooting

Procedure of Emergency Fault Handling

Once an emergency fault happens, contact and organize relevant persons and department
to perform the emergency fault handling, and make a call to the authority department and
supervisors immediately. After recovery, you need to submit a written failure report to the
authority departments and supervisors. You also need to organize relevant technicians,
departments, and equipment suppliers to locate causes so that lessons may be drawn from
it and effective measures may be taken in the subsequent operations to avoid its to avoid its
recurrence. After the recovery, you need to make a detailed emergency fault record carefully
and archive it.
Figure 2-2 shows the procedure of handling emergency faults.

12 SJ-20240124113225-026 | 2024-01-20 (R1.0)


2 Fault Handling Procedures

Figure 2-2 Procedure of Emergency Fault Handling

SJ-20240124113225-026 | 2024-01-20 (R1.0) 13


Chapter 3
Cluster Faults of the
Control Node
Table of Contents
Service Resources in Failed Status in the Cluster.................................................................... 14
Failed to Start File System Resources in the Cluster................................................................15
Multiple Times of Automatic Switchover During Cluster Startup................................................17
Cannot Display Some Resources.............................................................................................. 19
HA Suspends a Node................................................................................................................ 20
Clusters Co-exist but Cannot Find Each Other......................................................................... 21
Pacemaker Fails to Operate...................................................................................................... 24
HA Node Restart........................................................................................................................ 25
HA Cluster Split-Brain................................................................................................................ 26

3.1 Service Resources in Failed Status in the Cluster


Symptom

When you run the crm_mon -1 command to query resource status, a TECS service resource is
in failed status. For example, the following message is displayed:

openstack-nova-api (ocf::heartbeat:systemd-ctl): FAILED host127

Note
This section uses the openstack-nova-api resource as an example to describe the fault symptom and
troubleshooting procedure.

Probable Cause

Probable causes of this fault are as follows:


 A server fails to be started.
 The service is aborted.

14 SJ-20240124113225-026 | 2024-01-20 (R1.0)


3 Cluster Faults of the Control Node

Action

1. Run the following command as the root user to disable the monitoring of the openstack-
nova-api resource in the HA:
pcs resource unmanage openstack-nova-api
2. Run the following command to start the resource service:
systemctl start openstack-nova-api.service
3. Run the following command to check the service status and logs, and handle possible
problems according to prompts.
crm_mon -1
4. Run the following command to enable the monitoring of the openstack-nova-api
resource in the HA:
pcs resource meta openstack-nova-api is-managed=true
Check whether the fault is removed.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

Expected Result

 The cluster is started properly.


 When you run the crm_mon -1 command, no resource is in failed status.

3.2 Failed to Start File System Resources in the Cluster


Symptom

When the cluster is started, after running the crm_mon command, it is found that the file
system resources (Filesystem) fail to be started. For example, the following information is
displayed:

mysql_fs (ocf::heartbeat:Filesystem): FAILED host127

Note
This section uses the mysql_fs resource as an example to describe the fault symptom and
troubleshooting procedure.

Action

1. Perform the following steps as the root user to check whether the disk to be mounted and
the mounting point exist.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 15


TECS OpenStack Troubleshooting

a. Run the pcs resource show mysql_fs command. For the command result, see Figure
3-1.

Figure 3-1 pcs resource show mysql_fs Command Result

b. Run the mount |grep mysql command. For the command result, see Figure 3-2.

Figure 3-2 mount |grep mysql Command Result

c. Check the Attribute line in Figure 3-1. device indicates the path and name of the device
to be mounted, and directory indicates the directory where the device is mounted.
Compare the check result with the information displayed in Figure 3-2 to see whether
they are consistent.
 Yes → it indicates that the disk to be mounted and the mounting point exist. Step 4.
 No → it indicates that the disk to be mounted and the mounting point do not exist or
they have errors. Step 2.
2. Run the mount command to attempt to manually mount the disk.
Example: mount -t ext4 /dev/mapper/vg_db-lv_db /var/lib/mysql , where:
 -t ext4 : file system type.
 /dev/mapper/vg_db-lv_db : device to be mounted.
 /var/lib/mysql : mounting point of the device.
3. Run the df |grep mysql command to check whether the name of the mounted device is the
same as the device parameter. For the command result, see Figure 3-3.

Figure 3-3 df |grep mysql Command Result

 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

The file system resources are started successfully, and the cluster is started successfully.

16 SJ-20240124113225-026 | 2024-01-20 (R1.0)


3 Cluster Faults of the Control Node

3.3 Multiple Times of Automatic Switchover During Cluster


Startup
Symptom

When the cluster is started, the active/standby switchover notification is repeatedly reported
from upper-layer services. Run the crm_mon -1 command for multiple times to check whether
the cluster is automatically switched over (the started host is changed). For example,

[root@NJIROS2 ~]# crm_mon -1

Last updated: Wed Jul 22 11:11:49 2015

Last change: Thu Jul 16 16:33:43 2015 via crm_attribute on host5

Stack: corosync

Current DC: host4 (1) - partition with quorum

Version: 1.1.10.36791-1.el7-f2d0cbc

2 Nodes configured

41 Resources configured

Online: [ host4 host5 ] //host4 and host5 are the two servers that form the cluster.

Clone Set: bond0-clone [bond0]

Started: [ host4 host5 ]

database_float_ip (ocf::heartbeat:IPaddr2): Started host5

//Run the command for multiple times to check whether the services are switched between

//two servers.

Resource Group: DB

mysql_fs (ocf::heartbeat:Filesystem): Started host5

mariadb (ocf::heartbeat:systemd-ctl): Started host5

Resource Group: AMQP

rabbitmq-server (ocf::heartbeat:systemd-ctl): Started host5

openstack-keystone (ocf::heartbeat:systemd-ctl): Started host5

neutron-server (ocf::heartbeat:systemd-ctl): Started host5

neutron-l3-agent (ocf::heartbeat:systemd-ctl): Started host5

neutron-dhcp-agent (ocf::heartbeat:systemd-ctl): Started host5

glance_fs (ocf::heartbeat:Filesystem): Started host5

openstack-glance-api (ocf::heartbeat:systemd-ctl): Started host5

openstack-glance-registry (ocf::heartbeat:systemd-ctl): Started host5

openstack-cinder-api (ocf::heartbeat:systemd-ctl): Started host5

openstack-cinder-scheduler (ocf::heartbeat:systemd-ctl): Started host5

openstack-nova-api (ocf::heartbeat:systemd-ctl): Started host5

openstack-nova-conductor (ocf::heartbeat:systemd-ctl): Started host5

SJ-20240124113225-026 | 2024-01-20 (R1.0) 17


TECS OpenStack Troubleshooting

openstack-nova-scheduler (ocf::heartbeat:systemd-ctl): Started host5

openstack-nova-cert (ocf::heartbeat:systemd-ctl): Started host5

openstack-nova-consoleauth (ocf::heartbeat:systemd-ctl): Started host5

openstack-nova-novncproxy (ocf::heartbeat:systemd-ctl): Started host5

openstack-nova-monitor (ocf::heartbeat:systemd-ctl): Started host5

httpd (ocf::heartbeat:systemd-ctl): Started host5

opencos-alarmmanager (ocf::heartbeat:systemd-ctl): Started host5

opencos-alarmagent (ocf::heartbeat:systemd-ctl): Started host5

openstack-heat-api (ocf::heartbeat:systemd-ctl): Started host5

openstack-heat-engine (ocf::heartbeat:systemd-ctl): Started host5

openstack-heat-api-cfn (ocf::heartbeat:systemd-ctl): Started host5

openstack-heat-api-cloudwatch (ocf::heartbeat:systemd-ctl): Started host5

mongod (ocf::heartbeat:systemd-ctl): Started host5

openstack-ceilometer-api (ocf::heartbeat:systemd-ctl): Started host5

openstack-ceilometer-central (ocf::heartbeat:systemd-ctl): Started host5

openstack-ceilometer-alarm-evaluator (ocf::heartbeat:systemd-ctl): Started host5

openstack-ceilometer-alarm-notifier (ocf::heartbeat:systemd-ctl): Started host5

openstack-ceilometer-notification (ocf::heartbeat:systemd-ctl): Started host5

openstack-ceilometer-collector (ocf::heartbeat:systemd-ctl): Started host5

openstack-ceilometer-mend (ocf::heartbeat:systemd-ctl): Started host5

Action

1. Run the following command as the root user to check the detailed information about the
resource:
pcs resource op defaults on-fail =ignore
In this command, on-fail =ignore indicates that server restart upon startup failure is
disabled. In that case, the HA does not restart a service when the service is in failed status.
Thus, it can be avoided that active/standby switchover occurs when the failed service is
restarted by the HA for a specific number of times.
2. Run the crm_mon -1 command and check the output information.

If... Then...

A service resource is in failed status. Refer to 3.1 Service Resources in Failed Status in
the Cluster

A file system resource cannot be started. Refer to 3.2 Failed to Start File System Resources
in the Cluster

18 SJ-20240124113225-026 | 2024-01-20 (R1.0)


3 Cluster Faults of the Control Node

If... Then...

Resources are normal ("failure ignored" can be Go to Step 3.


seen).

3. Check whether the fault is removed.


 Yes → Step 4.
 No → Step 5.
4. Run the following command to restore the environment:
pcs resource op defaults on-fail =restart
5. Contact ZTE technical support.

Expected Result

Resources are started successfully, and the cluster is started successfully.

3.4 Cannot Display Some Resources


Symptom

After the cluster is started, some functions cannot be used, for example, alarms are not
reported. When you run the crm_mon -1 command, some resources are not displayed, for
example, opencos-alarmmanager.

Action

1. Run the following command as the root user to check whether the stonith resource is
configured. If the stonith resource is configured, "stonith-enabled: false"" is returned.
pcs property show |grep stonith
 Yes → Step 4.
 No → Step 2.
2. Run the following command to set stonith-enabled to false :
pcs property set stonith-enabled =false
3. Run the following command to check whether all the configured resources can be displayed.
For example, the opencos-alarmmanager resource is displayed after configuration. In
addition, check whether no resource is in Offline, Stopped, or Failed status.
crm_mon -1
 Yes → End.
 No → Step 4.
4. Run the following command to check whether any resource is disabled by the HA:
crm_mon -! |grep opencos-alarmmanager |grep Stopped
For example, if a resource is disabled by the HA, the following message is displayed:

SJ-20240124113225-026 | 2024-01-20 (R1.0) 19


TECS OpenStack Troubleshooting

<nvpair id="opencos-alarmagent-meta_attributes-target-role"

name="target-role" value="Stopped"/>

 Yes→Step 5.
 No → Step 7.
5. Run the following command to enable the resource:
pcs resource enable opencos-alarmmanager

Note
Repeat this step to enable all the other resources that are not displayed.

6. Run the following command to check whether all the configured resources can be displayed
and no resource is in Offline, Stopped, or Failed status. For example, a resource was
disabled and in Stopped status before, and not it is in Started status.
crm_mon -1
 Yes → End.
 No → Step 7.
7. Contact ZTE technical support.

Expected Result

 The cluster is started successfully.


 When you run the crm_mon -1 command, all the resources are displayed.

3.5 HA Suspends a Node


Symptom

After the cluster is started, run the # crm_mon -! | grep disable_fence_reboot command on a
node. Check whether there is execution result. If yes, this indicates that the node is suspended
by the HA, at this time the node does not run any resources.
Example:

[root@A10157785 init.d]# crm_mon -! | grep disable_fence_reboot

<nvpair id="status-1-disable_fence_reboot" name="disable_fence_reboot"

value="ClusterIP2(IGN_FA:0x00000001) stop failed 12 times on pc1(threshold:10)"

Note
When detecting that a node is repeatedly restarted within a specific period due to resource errors, the HA
suspends the node. If there is no manual intervention, the node will be automatically restored to normal
30 minutes later.

20 SJ-20240124113225-026 | 2024-01-20 (R1.0)


3 Cluster Faults of the Control Node

Action

1. Run the following command as the root user to manually restore the node to normal:
crmadmin –c host_name

Note
host_name is the node name, and there is no space between host_name and -c.

2. Run the reboot command to restart the node server.


3. Run the following command to check whether the node is restored to normal:
crm_mon -! | grep disable_fence_reboot
 Yes → Step 4.
 No → End.
4. Contact ZTE technical support.

Expected Result

When you run the crm_mon -! | grep disable_fence_reboot command, no output information
is displayed.

3.6 Clusters Co-exist but Cannot Find Each Other


Symptom

The pacemaker is operating properly, but the clusters fails to find each other on the pacemaker.
For example, run the crm_mon -1 command on the host-2018-abcd-
abcd-1234-4321-5678-8765-12aa, and the execution result is that the host-2018-abcd-
abcd-1234-4321-5678-8765-12 cc is OFFLINE.
Run the command on the host-2018-abcd-abcd-1234-4321-5678-8765-12 cc, and the execution
result is that the host-2018-abcd-abcd-1234-4321-5678-8765-12aa is OFFLINE.

[root@host-2018-abcd-abcd-1234-4321-5678-8765-12aa vtu]# crm_mon -1

Last updated: Sat May 16 14:08:21 2020

Last change: Sat May 16 11:03:58 2020 via cibadmin on

host-2018-abcd-abcd-1234-4321-5678-8765-12aa

Stack: corosync

Current DC: host-2018-abcd-abcd-1234-4321-5678-8765-12aa (1) - partition with quorum-180408

2 Nodes configured

83 Resources configured

Online: [ host-2018-abcd-abcd-1234-4321-5678-8765-12aa ]

SJ-20240124113225-026 | 2024-01-20 (R1.0) 21


TECS OpenStack Troubleshooting

OFFLINE: [ host-2018-abcd-abcd-1234-4321-5678-8765-12cc ]

Action

1. Run the corosync-cmapctl | grep member command on both nodes that cannot find each
other, and perform the following operations in accordance with the output result.
 If the following result is returned, this indicates that the corosync exits abnormally and
the pcsd monitoring fails to restart the corosync. In this case, the pacemaker cannot
operate normally. At this time, it is necessary to collect the coredump log and black box
data for further analysis, and go to Step 6.

Failed to initialize the cmap API. Error CS_ERR_LIBRARY

 If the following result is displayed, this indicates that the peer end is not found.

runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0

runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(128.0.0.14) r(1)

ip(129.0.0.14) r(2) ip(130.0.0.14)

runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1

runtime.totem.pg.mrp.srp.members.1.status (str) = joined

 If the following result is returned, this indicates that the local end finds the peer end but
the peer end leaves.

runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0

runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(128.0.0.14) r(1)

ip(129.0.0.14) r(2) ip(130.0.0.14)

runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1

runtime.totem.pg.mrp.srp.members.1.status (str) = joined

runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0

runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(128.0.0.15) r(1)

ip(129.0.0.15) r(2) ip(130.0.0.15)

runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 2

runtime.totem.pg.mrp.srp.members.2.status (str) = left

2. If the two ends can find each other, the pacemaker is faulty, go to Step 6.
3. If the two ends cannot find each other, the network may be disconnected or the heartbeat IP
configuration may be incorrect. Run the corosync-cfgtool -s command to check whether
the heartbeat IP addresses are the same as the configurations.

[root@zteha ~]# corosync-cfgtool -s

Printing ring status.

Local node ID 1

RING ID 0

22 SJ-20240124113225-026 | 2024-01-20 (R1.0)


3 Cluster Faults of the Control Node

id = 128.0.0.15

status = ring 0 active with no faults

RING ID 1

id = 129.0.0.15

status = ring 1 active with no faults

RING ID 2

id = 130.0.0.15

status = ring 2 active with no faults

Where, "ring 0 active with no faults" indicates that the heartbeat link is normal only when the
two sides can find each other.
4. Check the configuration file. The configurations are as follows:

nodelist {

node {

ring0_addr: 128.0.0.14

ring1_addr: 129.0.0.14

ring2_addr: 130.0.0.14

name: test1

nodeid: 1

node {

ring0_addr: 128.0.0.15

ring1_addr: 129.0.0.15

ring2_addr: 130.0.0.15

name: test2

nodeid: 2

The contents displayed in the configuration file are consistent with the result returned by the
corosync-cfgtool -s command.
If the heartbeat IP addresses meet the configurations, but the two ends cannot find each
other, the heartbeat link may be broken. Run the ping and route commands to check
whether the heartbeat links are normal and whether the routes are correct.
5. If the heartbeat links are normal, shut down the firewall and check again. If the fault persists,
collect all information for further analysis.
6. Contact ZTE technical support.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 23


TECS OpenStack Troubleshooting

Expected Result

Each node in the cluster on the pacemaker can find the peer end.
Run the crm_mon -1 command on all nodes. If the following result is displayed, all nodes in the
cluster are online.
[root@host-2018-abcd-abcd-1234-4321-5678-8765-12cc vtu]# crm_mon -1
Last updated: Sat May 16 14:27:06 2020
Last change: Sat May 16 11:03:58 2020 via cibadmin on host-2018-abcd-
abcd-1234-4321-5678-8765-12aa
Stack: corosync
Current DC: host-2018-abcd-abcd-1234-4321-5678-8765-12aa (1) - partition with
quorum-180408
2 Nodes configured
83 Resources configured
Online: [ host-2018-abcd-abcd-1234-4321-5678-8765-12aa host-2018-abcd-
abcd-1234-4321-5678-8765-12cc ]

3.7 Pacemaker Fails to Operate


Symptom

Run the corosync-cmapctl | grep member command at both sides of the cluster. The
following result indicates that the corosync exits abnormally and the pcsd monitoring fails to
restart the corosync.
[root@cent6c corosync]# corosync-cmapctl | grep member

Failed to initialize the cmap API. Error CS_ERR_LIBRARY

Action

1. Check whether the configuration file of Corosync is correct and whether the heartbeat IP
address is correctly configured for the network port. Run the corosync -t command to check
whether the configuration file is available.
 If the following result is displayed, it indicates that the configuration file is damaged or
incorrectly configured. Collect the configuration file for troubleshooting.

[root@cent6c corosync]# corosync -t

info [MAIN ] Maximum core file size is: 18446744073709551615

Jan 30 15:53:58 error [MAIN ] parse error in config: No multicast address

specified

Jan 30 15:53:58 error [MAIN ] Corosync Cluster Engine exiting with status 8

at main.c:1416.

24 SJ-20240124113225-026 | 2024-01-20 (R1.0)


3 Cluster Faults of the Control Node

 If the following result is displayed, it indicates that the configuration file is correct. The
network port may not be configured with an IP address.

[root@cent6c corosync]# corosync -t

info [MAIN ] Maximum core file size is: 18446744073709551615

Jan 30 16:35:20 notice [MAIN ] Corosync Cluster Engine exiting normally

2. Run the corosync -f command to manually run Corosync in the foreground to find the
causes. If the following information is displayed, it indicates that the network port is not
configured with an IP address. Configure a correct heartbeat IP address for the network port
and restart the two-server cluster. If other information is displayed, collect all related data for
further analysis.

[root@cent6c corosync]# corosync -f

info [MAIN ] Maximum core file size is: 18446744073709551615

Jan 30 16:38:12 notice [MAIN ] Corosync Cluster Engine ('2.3.4.3'): started and

ready to provide service.

Jan 30 16:38:12 info [MAIN ] Corosync built-in features: pie relro bindnow

Jan 30 16:38:12 warning [TOTEM ] bind token socket failed: Cannot assign requested

address (99)

Jan 30 16:38:12 error [TOTEM ] totemudpu_create_sending_socket error:-1.

Jan 30 16:38:12 error [MAIN ] Corosync Cluster Engine exiting with status 15 at

totemudpu.c:1237.

3. Contact ZTE technical support.

Expected Result

Pacemaker can properly operate.

3.8 HA Node Restart


Symptom

The HA nodes are restarted.

Action

1. Check the faulty resources. For details, refer to 3.1 Service Resources in Failed Status in
the Cluster.
2. Check whether there are heartbeat faults between the nodes, especially heartbeat
disconnection. During normal operation of the nodes, both nodes may operate as active
nodes due to disconnection of a heartbeat line. After the fault is resolved, the HA restarts
the nodes and runs the node with the less number of resources.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 25


TECS OpenStack Troubleshooting

3. Check whether the HA processes operate properly. During normal operation of the two
nodes, if the HA process of one node is faulty, the node without a fault restarts the faulty
one.
4. Collect log information. To collect memory log information, run the crm_mon -! | grep _
reboot command. File logs are saved in the cat /var/lib/pacemaker/pengine/
crm_status_save.xml file.
5. Contact ZTE technical support.

Expected Result

The HA nodes operate properly.

3.9 HA Cluster Split-Brain


Symptom

Run the crm_mon -1 command on a node. If split-brain occurs on the node, the node status
is displayed as offline. If the number of split-brain nodes in the cluster is smaller than half of
the total number of nodes in the cluster, the single-instance service does not operate on the
split-brain nodes. The multi-instance service operates properly, and only the standby service is
operating.

[root@host-192-168-32--ab27 tecs]# crm_mon -1

Last updated: Fri Dec 6 15:45:38 2019

Last change: Fri Dec 6 10:01:47 2019 via cibadmin on host-192-168-32--ab27

Stack: corosync

Current DC: host-192-168-32--ab27 (1) - partition WITHOUT quorum-180408

Version: 1.1.10.40182-1.el7.centos-f2d0cbc

3 Nodes configured

16 Resources configured

Online: [ host-192-168-32--ab27 ]

OFFLINE: [ host-192-168-32--ab28 host-192-168-32--ab29 ]

Clone Set: MANAGEMENT-clone [MANAGEMENT]

Started: [ host-192-168-32--ab27 ]

Stopped: [ host-192-168-32--ab28 host-192-168-32--ab29 ]

Status: [ host-192-168-32--ab27,0 ]

Master/Slave Set: mongod-master [mongod]

Slaves: [ host-192-168-32--ab27 ]

Stopped: [ host-192-168-32--ab28 host-192-168-32--ab29 ]

Master score: [ host-192-168-32--ab27=null host-192-168-32--ab29=null

26 SJ-20240124113225-026 | 2024-01-20 (R1.0)


3 Cluster Faults of the Control Node

host-192-168-32--ab28=null ]

Clone Set: mongod_fs-clone [mongod_fs]

Started: [ host-192-168-32--ab27 ]

Stopped: [ host-192-168-32--ab28 host-192-168-32--ab29 ]

Clone Set: lv_mongodb_plug_agent-clone [lv_mongodb_plug_agent]

Started: [ host-192-168-32--ab27 ]

Stopped: [ host-192-168-32--ab28 host-192-168-32--ab29 ]

Action

1. Check the /etc/corosync/corosync.conf configuration file to obtain the HA heartbeat


addresses. Run the ping command to check whether all heartbeat links are broken. As long
as one heartbeat link is normal, split-brain will not occur.

[root@host-2018-abcd-abcd-1234-4321-5678-8765-12aa vtu]# cat /etc/corosync/corosync.conf

totem {

crypto_hash: none

token_retransmits_before_loss_const: 30

netmtu: 1500

crypto_cipher: none

cluster_name: HA_Cluster

token: 30000

version: 2

ip_version: ipv4

transport: udpu

nodelist {

node {

ring0_addr: 2018:abcd:abcd:1234:4321:5678:8765:12aa

name: host-2018-abcd-abcd-1234-4321-5678-8765-12aa

nodeid: 1

node {

ring0_addr: 2018:abcd:abcd:1234:4321:5678:8765:12cc

name: host-2018-abcd-abcd-1234-4321-5678-8765-12cc

nodeid: 2

SJ-20240124113225-026 | 2024-01-20 (R1.0) 27


TECS OpenStack Troubleshooting

 Yes → Step 2.
 No → Step 3.
2. Make the heartbeat addresses connected, and check whether the split-brain alarm is
cleared.
 Yes → End.
 No → Step 3.
3. Check whether the port 5405 is restricted due to the firewall.
iptables -S | grep 5405
-A INPUT -p udp -m udp --dport 5405 -j ACCEPT
 Yes → Step 4.
 No → Step 5.
4. Run the iptables -A INPUT -p $port -j ACCEPT command to open the port, and check
whether the split-brain alarm is cleared.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

Expected Result

The split-brain fault of the HA node is removed. After the recovery, the HA cluster software
restarts the nodes with fewer resources. You can configure that these nodes are not restarted
after split-brain recovery.
pcs property set pcmk_option=0x08

28 SJ-20240124113225-026 | 2024-01-20 (R1.0)


Chapter 4
Database Faults
Table of Contents
Database Is Read-Only or Fails to Execute Commands........................................................... 29
Database Login Failure.............................................................................................................. 30
Cannot Start the Database........................................................................................................ 31
Cannot Back Up the Database.................................................................................................. 32

4.1 Database Is Read-Only or Fails to Execute Commands


Symptom

The database is read-only or fails to execute commands.

Probable Cause

In most cases, the database enters split-brain status because the network between control
nodes is disconnected, or more than half of control nodes are down, or the database service
is stopped abnormally. The database does not provide any service. You need to wait for more
than half of the nodes to recover.

Action

1. Verify that more than half of the control nodes are in Online status:

# crm_mon -1

Online: [ host-192-168-0-140 host-192-168-0-141 ]

Master/Slave Set: mariadb-master [mariadb]

Stopped: [ host-192-168-0-140 host-192-168-0-141 ]

2. Restart the database:

# pcs resource disable mariadb

//wait for 3 minutes and then execute

# pcs resource enable mariadb

3. Run the following commands to restart MySQL and check whether the database is started
properly.

# crm_mon -1

SJ-20240124113225-026 | 2024-01-20 (R1.0) 29


TECS OpenStack Troubleshooting

Online: [ host-192-168-0-140 host-192-168-0-141 ]

Master/Slave Set: mariadb-master [mariadb]

Master score: [ host-192-168-0-140=100 host-192-168-0-141=100 ]

 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

The database is properly started.

4.2 Database Login Failure


Symptom

Failed to log in to the database.

Probable Cause

 The database service is abnormal.


 The username and password for logging in to the database are incorrect.
 When you access the database through the floating IP, the database cannot be accessed
because the floating IP is disconnected.

Action

1. Run the crm_mon -1 command to check whether the database service is normal.

Master/Slave Set: mariadb-master [mariadb]

Masters: [ host-192-168-0-140 host-192-168-0-141 ]

Master score: [ host-192-168-0-140=100 host-192-168-0-141=100 ]

 Yes → Step 3.
 No → Step 2.
2. Restart the MySQL by using the following method and check whether the database is
started properly.

# pcs resource disable mariadb

//Wait for 3 minutes and then perform the following operation.

# pcs resource enable mariadb

3. Use the following method to log in to the MySQL and check whether you can log in by using
the username and password.

# mysql –uusername –ppassword

30 SJ-20240124113225-026 | 2024-01-20 (R1.0)


4 Database Faults

 Yes → Step 5.
 No → Step 4.
4. Set the correct username and password, and then check whether the fault is removed.
 Yes → End.
 No → Step 5.
5. Check whether you can access the database through the floating IP address.

# mysql -uusername -ppassword -hfloat_ip

 Yes → End.
 No → Step 6.
6. Confirm the network status, make sure you can ping the floating IP from the local end, and
check whether the fault is fixed.
 Yes → End.
 No → Step 7.
7. Contact ZTE technical support.

Expected Result

The database can be accessed properly.

4.3 Cannot Start the Database


Symptom

The database cannot be started.

Probable Cause

 The database space is full.


 Database files are damaged.

Action

1. Run the df -h command to check whether the database space is full.

# df -h

/dev/mapper/vg_local-lv_db 15G 1.4G 13G 10% /var/lib/mysql

 Yes → Step 2.
 No → Step 3.
2. Clear the database space, and check whether the database is started successfully.

# rm -f /var/lib/mysql/mariadb-bin.*

 Yes → End.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 31


TECS OpenStack Troubleshooting

 No → Step 3.
3. Check whether there is abnormal printing of damaged files in the database logs.

# tail -100 /var/log/mariadb/mariadb.log |grep "mysqld got signal 6"

 Yes → Wait for the database to be automatically restored.


 No → Step 4.
4. Contact ZTE technical support.

Expected Result

The database can be started properly.

4.4 Cannot Back Up the Database


Symptom

The database cannot be backed up. The provider raises an alarm about automatic database
backup failure.

Probable Cause

The automatic backup function is not configured properly.

Action

1. Check the installation of the automatic database backup function.

# dbmanager job-list

| ddb579c58d214bae93fd906d584027a1 | mysql_backup.py | cron:minute@7

| 3 | success | scheduled | mysql backup |

2. Check whether the database backup task (such as ddb579c58d214bae93fd906d584027a1)


is in scheduled status.
 Yes → End.
 No → Step 3.
3. Configure automatic database backup:

# dbmanager job-enable ddb579c58d214bae93fd906d584027a1

4. Configure the automatic database backup correctly, and then check whether the fault is
fixed.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

32 SJ-20240124113225-026 | 2024-01-20 (R1.0)


4 Database Faults

Expected Result

The database can be backed up properly.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 33


Chapter 5
System Environment
Faults
Table of Contents
Keystone-Related Faults............................................................................................................ 34
Nova Service Faults................................................................................................................... 41
Neutron Service Failure............................................................................................................. 42
Rabbitmq-Related Faults............................................................................................................45
Automatic Restart Every Other Minutes in a New Physical Environment.................................. 49

5.1 Keystone-Related Faults

5.1.1 Keystone Prompts Too Many Database Connections


Symptom

An error occurs during the login to the TECS, indicating an authentication error and prompting
the user to try again later. The Keystone log prompts that there are too many MySQL
connections.
“Can not connect to MySQL server. Too many connections”-mysql 1040

Probable Cause

The database configuration parameters do not meet the actual conditions of each component.

Action

1. Run the mysql command to enter the database. Run the show variables like '%conn%';
command to view the following variables:
 max_connections:mysql: maximum number of connections allowed by the server
 max_user_connections: maximum number of connections for each database user
Check whether the following variables are 0. If the value is 0, the number of connections is
not limited.

34 SJ-20240124113225-026 | 2024-01-20 (R1.0)


5 System Environment Faults

 Yes → Step 4.
 No → Step 2.
2. Check the status of the following parameters:

MariaDB [(none)]> show status like '%connect%';

+----------------------+-------+

| Variable_name | Value |

+----------------------+-------+

| Aborted_connects | 0 |

| Connections | 5 |

| Max_used_connections | 1 |

| Threads_connected | 1 |

+----------------------+-------+

 Threads_connected: current number of connections


 Connections: number of attempts to connect to the MySQL server (no matter whether
the connection is successful)
 Max_used_connections: maximum number of connections (maximum number of
concurrent connections) that have been used concurrently after the server is started
In normal cases:
 Max_used_connections <= max_connections + 1 (MySQL keeps an extra connection for
users with super rights to ensure that the administrator can connect to the database and
check the system at any time)
 Threads_connected <= max_connections
Check whether the above state values are normal.
 Yes → Step 4.
 No → Step 3.
3. Methods for modifying the max_connections:
a. Set GLOBAL max_connections = number of connections
b. Modify the value of max_connections in the /etc/my.cnf file.
c. Run the systemctl restart mariadb.service command to restart the MySQL service.
Run the openstack token issue command to check whether the fault is fixed.
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

Keystone authentication is successful.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 35


TECS OpenStack Troubleshooting

5.1.2 Keystone Authentication Failure


Symptom

Keystone authentication failure occurs when the following operations are performed.
1. The login webpage can be opened. After you enter the username and password, an
authentication error is displayed.
2. After the username and password in the source keystonerc are modified, the webpage
is normal, but an authorization error is reported when a command is executed in the
command line.

Probable Cause

The token configuration item in the keystone configuration file is incorrect.

Action

1. Modify the parameters in the /etc/keystone/keystone.conf file:

[token]

provider=fernet

2. Run the systemctl restart httpd.service command to restart Keystone.


3. If the fault persists, contact ZTE technical support.

Expected Result

Keystone authentication is successful.

5.1.3 Keystone Authorization Failure


Symptom

When you log in, an authentication error occurs and it prompts you try again later.
When you use the openstack use list command in the keystone command line, the following
information is displayed:

[root@controller11 ~(keystone_admin)]# source /home/tecs/source keystonerc

Enter the username and password as prompted.

[root@controller11 ~(keystone_admin)]# openstack use list

Authorization Failed: An unexpected error prevented the server from fulfilling your request.

(HTTP 500)

View the /var/log/keystone/keystone.log of Keystone. The following information is displayed:

OperationalError: (OperationalError) (2003, "Can't connect to mysql server on

'192.168.7.201' (113)") None None

36 SJ-20240124113225-026 | 2024-01-20 (R1.0)


5 System Environment Faults

Probable Cause

 There is no sufficient disk space for proper operation of the mariadb service.
 Keystone is not correctly installed.

Action

1. Check whether the mariadb service operates properly in the following way:

[root@controller11 ]# service mariadb status

Redirecting to /bin/systemctl status mariadb.service

mariadb.service - MariaDB database server Loaded: loaded (/usr/lib/systemd/system/

mariadb.service; enabled)

Active: failed (Result: exit-code) since Fri 2015-04-17 17:58:09 CST; 2 days ago

Apr 17 17:58:06 controller11 mariadb_safe[5126]: /usr/bin/mariadb_safe: line 138: echo:

write error: No space left on device

 Yes → Step 3.
 No → Step 2.
2. Troubleshoot and then restart the mariadb service.

[root@controller11 ]# service mariadb restart

Redirecting to /bin/systemctl restart mariadb.service

Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn'

for details.[root@controller11 ]# journalctl -xn

-- Logs begin at Thu 2015-04-09 13:48:14 CST, end at Mon 2015-04-20 09:20:15 CST. --Apr

20 09:20:15 controller11 nova-conductor[4660]: self.flush()

Apr 20 09:20:15 controller11 nova-conductor[4660]: File "/usr/lib64/python2.7/logging/

__init__.py", line 835, in flushApr 20 09:20:15 controller11 nova-conductor[4660]:

self.stream.flush()

Apr 20 09:20:15 controller11 nova-conductor[4660]: IOError: [Errno 28] No space left

on device

Check whether the fault is resolved.


 Yes → End.
 No → Step 3.
3. Check whether the tables in the Keystone database are in good condition.

MariaDB [(none)]> use keystone

Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A

Database changed

SJ-20240124113225-026 | 2024-01-20 (R1.0) 37


TECS OpenStack Troubleshooting

MariaDB [keystone]>

MariaDB [keystone]> show tables; Empty set (0.00 sec)

 Yes → Step 5.
 No → Step 4.
4. There is no table in the keystone database. The database table may be deleted by mistake,
resulting in data loss. Contact technical support to check whether database tables are
backed up and whether they can be restored.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

Expected Result

Keystone authentication is successful.

5.1.4 Connections to the Server Cannot Be Established Due to Keystone


Authorization Failure
Symptom

The keystone authorization fails, and it prompts that the server cannot be connected. The
following information is displayed:

[root@opencos_all ~(keystone_admin)]# source /root/source keystonerc

[root@opencos_all ~(keystone_admin)]# keystone role-list

Authorization Failed: Unable to establish connection to http://10.43.175.2:5000/v2.0/tokens

Probable Cause

The Keystone service starts improperly.

Action

1. Check whether the Keystone service operates properly.

[root@opencos_all ]# systemctl status openstack-keystoneopenstack

-keystone.service - OpenStack Identity Service (code-named Keystone)

Loaded: loaded (/usr/lib/systemd/system/openstack-keystone.service; enabled)

Active: failed (Result: start-limit) since Mon 2015-04-20 14:12:35 CST; 29min ago

 Yes → Step 3.
 No → Step 2.
2. Set a executable permission for the /var/log/keystone/keystone.log file.

[root@opencos_all (keystone_admin)]# chmod 777 /var/log/keystone/keystone.log

38 SJ-20240124113225-026 | 2024-01-20 (R1.0)


5 System Environment Faults

3. Run the systemctl restart openstack-keystone command to restart the Keystone service.
4. If the fault persists, contact ZTE technical support.

Expected Result

Keystone authentication is successful.

5.1.5 Fails to Create a User


Symptom

The system fails to create a user and "create openstack user failed" is displayed on the screen.

Probable Causes

 The cloud environment access is improper.


 The user already exists in the cloud environment.

Action

1. Check whether the cloud environment is imported properly.


 Yes → Step 2.
 No → Step 3.
2. Re-import the cloud environment, and check whether the fault is removed.
 Yes → End.
 No → Step 3.
3. Check whether the user already exists in the cloud environment.
 Yes → Step 4.
 No → Step 5.
4. Recreate a new user with a different name. Check whether the fault is removed.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

Expected Result

The user is created successfully.

5.1.6 Cloud Environment Access Is Improper


Symptom

No resource information exists in the cloud environment overview of cloud management,


and when you search endpoints in the log file in the /var/zte-log/zte-api/logs/all
directory, the endpoints contain the "public-vip: port number" information.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 39


TECS OpenStack Troubleshooting

Probable Causes

A floating IP address that cannot be identified by the TECS is set when the cloud environment
is deployed.

Action

Add an association between the floating and actual IP addresses in the /etc/hosts
configuration file of the TECS host.

Expected Result

The cloud environment access is proper.

5.1.7 Virtual Resource Page Prompts That The Current User Needs to Be
Bound With a Project
Symptom

When you open the virtual resource page, a "Please Bind Project &User First!" message is
displayed on the screen.

Probable Causes

 The current login user is not bound with a project.


 The clound environment has not been completely imported.

Action

1. In the project list, bind the user with the expected project, and check whether the fault is
removed.
 Yes → End.
 No → Step 2.
2. Check whether the cloud environement has been completely imported.
 Yes → Step 4.
 No → Step 3.
3. Wait until the cloud environment has been completely imported, and check whether the fault
is removed.
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

The virtual resource page is displayed properly.

40 SJ-20240124113225-026 | 2024-01-20 (R1.0)


5 System Environment Faults

5.2 Nova Service Faults

5.2.1 NOVA Fails to Be Connected to RabbitMQ


Symptom

NOVA fails to be connected to RabbitMQ because the heat process creates thousands of
queues.

Probable Cause

The heat process creates a message queue beginning with "heat" whenever it is restarted. The
new queue cannot be automatically deleted if it is not connected to the client. After the queue
name, a random UUID is followed. Thus, after each restart, the queue name is different and the
old queues are not deleted when the new queue is generated. For example, the heat process
initiates 16 processes and 16 heat queues are generated in the RabbitMQ server during each
HA node switchover. After several switchover, the number of RabbitMQ queues reaches the
threshold, running out of memory and limiting the number of connections.

Action

1. Set automatic deletion by setting the message queue strategy. For example, if a queue
beginning with heat-engine-listener is not connected in more than one hour, it is
automatically deleted.

rabbitmqctl set_policy ha-all "." '{"ha-mode":"all",

"ha-sync-mode":"automatic"}' --apply-to all --priority 0

rabbitmqctl set_policy heat_rpc_expire "^heat-engine-listener\\." '{"expires":

3600000,"ha-mode":"all","ha-sync-mode":"automatic"}' --apply-to all --priority 1

Note
The command can be executed during the installation of RabbitMQ. After RabbitMQ is installed,
execute the statement to set the strategy for the message queue to the RabbitMQ server.

2. If the fault persists, contact ZTE technical support.

Expected Result

NOVA is successfully connected to the RabbitMQ service.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 41


TECS OpenStack Troubleshooting

5.3 Neutron Service Failure

5.3.1 Neutron Server Error


Symptom

When the neutron agent-list command is executed, the following information is displayed:

[root@slot2 network-scripts(admin)]# neutron agent-list

Connection to neutron failed: Maximum attempts reached

Action

1. Log in to the control node and check whether the status of the Neutron server is "active". If
no, an error occurs.

[root@slot2 network-scripts(admin)]# systemctl status neutron-server.service

neutron-server.service - OpenStack Neutron Server

Loaded: loaded (/usr/lib/systemd/system/neutron-server.service; enabled)

Active: active (running) since Tue 2015-08-25 13:15:10 CST; 2min 35s ago

Main PID: 21972 (neutron-server)

CGroup: /system.slice/neutron-server.service

2. If the Neutron server fails to be started, troubleshoot in the following way:


a. If the fault is caused by configuration file errors:

[root@slot2 network-scripts(admin)]# systemctl status neutron-server.service –l

Run the command to show error information about the configuration file and modify
following the prompt.

Note
There must not be any blank space left at the beginning of a line in the configuration file.

b. If the fault is caused by database mismatch:


View the server.log file in the /var/log/neutron directory for database-related
error information. The fault is generally caused by unsuccessful update. Contact ZTE tec
hnical support for troubleshooting.
3. If the fault persists even though the server status is "active":
# rabbit_hosts = $rabbit_host:$rabbit_port is present in the /etc/
neutron/neutron.conf file. This fault is caused by message service errors or
configuration errors. Contact ZTE technical support for troubleshooting.
4. If the Neutron server is configured with several nodes, set one of them to be "active".

42 SJ-20240124113225-026 | 2024-01-20 (R1.0)


5 System Environment Faults

In normal cases, the result of the openstack-status command executed on both the control
node and the compute node shows that only one node is "active" and others are "disabled".
5. If the fault persists, contact ZTE technical support.

Expected Result

The Neutron server operates properly.

5.3.2 Neutron Agent Error


Symptom

The execution result of the neutron agent-list command shows that neutron-openvswitch-
agent of the board concerned is XX.

[root@tecs220 (keystone_admin)]# neutron agent-list

+--------------------------------------+----------------------+---------+-------+

| id | agent_type | host | alive |

+--------------------------------------+----------------------+---------+-------+

| 50b31ba5-04b6-4e80-818b-4eefca06b706 | PCI NIC Switch agent | tecs220 | :-) |

| 62ac7bfd-cfa6-4962-8e08-711472650d8d | DHCP agent | tecs220 | :-) |

| 77356695-e56f-428b-aada-6ab97e137006 | Metadata agent | tecs220 | :-) |

| 96d95407-2ed3-4b24-8e31-29070339fe9f | Open vSwitch agent | tecs220 | xxx |

| 9b212b92-9b8f-439e-a5d8-d1b066ddfe87 | L3 agent | tecs220 | :-) |

+--------------------------------------+----------------------+---------+-------+

Action

1. Log in to the service with the fault and check whether the service status is "active". If no, the
service does not operate properly. For example,

[root@slot2 (keystone_admin)]# systemctl status neutron-openvswitch-agent.service

neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent

Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; enabled)

Active: active (running) since Tue 2015-08-25 10:49:38 CST; 2h 7min ago

Main PID: 31660 (neutron-openvsw)

 Yes → End.
 No → Step 2.
2. Run the date command to check whether the time of the control node is synchronous with
that of the compute node.
 Yes → Step 5.
 No → Step 3.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 43


TECS OpenStack Troubleshooting

3. On the control node and compute node, run the systemctl status chronyd command to
check whether the service is Active: Active (running).
 Yes → Step 5.
 No → Step 4.
4. Run the systemctl restart command to restart the service, and check whether the service
status is running.
 Yes → Step 5.
 No → Step 7.
5. Check whether the time is synchronized.
 Yes → Step 6.
 No → Step 7.
6. Run the following command when no service is operating, and check whether the alarm is
cleared.
systemctl restart neutron-openvswitch-agent.service
 Yes → End.
 No → Step 7.
7. Contact ZTE technical support.

Expected Result

The Neutron agent operates properly.

5.3.3 Network Service Startup Failure


Symptom

The network service fails to be started. The execution result of the systemctl status network
command shows that the status of the service is "fail".

Probable Cause

The configuration files for the network adapters in the /etc/sysconfig/network-scripts


directory are incorrect.

Action

1. Check whether the network adapters in the configuration files are consistent with those
shown in the execution result of the ifconfig command and perform the following operations
as required:

If... Then...

Too many network adapters are configured in the Delete the invalid configuration files.
configuration files

44 SJ-20240124113225-026 | 2024-01-20 (R1.0)


5 System Environment Faults

If... Then...

The number of network adapters shown in the Check information about the network adapters
execution result of the ifconfig command is more with the ip link |grep network adapter name
than that in the configuration files command and contact ZTE technical support.

The number of network adapters shown in the Check whether there is DHCP configuration in the
execution result of the ifconfig command is equal configuration file. If yes, change the DHCP attribute
to that in the configuration files to "static".

2. If the fault persists, contact ZTE technical support.

Expected Result

The network service can be successfully started after the systemctl restart network command
is executed.

5.4 Rabbitmq-Related Faults

5.4.1 Failed to Start rabbitmq-server


Symptom

The rabbitmq-server cannot be started. The storage space of the /var/lib/ directory may be full.
Because large images are uploaded to the /var/lib/ directory, so the space is full, and thus the
rabbitmq files cannot be written.

Action

1. Run the systemctl start rabbitmq-server command to start the rabbitmq-server service.
2. Run the journalctl -xe command. It prompts some related errors.
3. Check the /var/log/rabbitmq/rabbit@host name.log file, and turn to the end to check whether
the following information is printed.

Unable to recover vhost <<"/">> data. Reason {badmatch,{error,{{{badmatch,{error,

{not_a_dets_file,"/var/lib/rabbitmq/mnesia/rabbit@rabbitmq/msg_stores/vhosts/

628WB79CIFDYO9LJI6DKMI09L/recovery.dets"}}}

Yes → Step 4.
No → Step 6.
4. Delete the /var/lib/rabbitmq/mnesia/rabbit@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9
LJI6DKMI09L/recovery.dets file.
5. Run the systemctl restart rabbitmq-server command to restart the rabbitmq-server service.
6. Contact ZTE technical support.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 45


TECS OpenStack Troubleshooting

Expected Result

The rabbitmq-server can be started normally.

5.4.2 Message Server Connection Failure


Symptom

The message server fails to be connected although the RabbitMQ service is already started.

Probable Cause

 Domain name resolution or the firewall is abnormal.


 The rabbitmq configuration file is missing.
 The network is abnormal.
 Some nodes in the cluster are down.

Action

Run the vim /var/log/rabbitmq/rabbit@host name.log command. If there is a large amount of {


handshake_timeout, handshake} printing, the operation method is as follows:
1. Find a connection that reports this error and execute ssh to the corresponding error node.
2. Run the ping domain name command or the ping6 domain name command.
 For the value of domain name, you can obtain the value of transport_url from /etc/nova/
nova.conf.
 Execute the value of python/usr/lib/python2.7/site-packages/oslo_config/aes.py decrypt
transport_url to see the domain name used by the nova service.
Run the ping command to check the packet loss.
 If packet loss occurs after command execution, it means that the network is faulty.
 If no packet is lost after command execution, run the curl domain name:5672 command
and the curl ip:5672 command.
If the the curl domain name:5672 command returns a result faster than the curl domain
name:5672 command, the domain name resolution is faulty. Especially in the dual-
controller environment, because the domain name resolution result is two control nodes (
you can view the /etc/resolv.conf file). If the first control node configured in domain name
resolution is faulty, this error occurs.
 If the ping command does not return a result, the domain name cannot be resolved. This
error may be caused by the firewall. Run the systemctl stop iptables command on all
control nodes to check whether the error can be fixed. If not, contact ZTE technical su
pport. At this time, service logs print errors related to domain name service resolution.

46 SJ-20240124113225-026 | 2024-01-20 (R1.0)


5 System Environment Faults

Some nodes may be down and fail to be connected. You can run the rabbitmqctl
cluster_status command to view the cluster status.

Cluster status of node rabbit@gltest-ctrl ...

[{nodes,[{disc,['rabbit@gltest-ctrl']}]},

{running_nodes,['rabbit@gltest-ctrl']},

{cluster_name,<<"rabbit@gltest-ctrl">>},

{partitions,[]},

{alarms,[{'rabbit@gltest-ctrl',[]}]}]

In normal cases, the "running_nodes" column contains all nodes. If all nodes are not
contained, this indicates that some nodes are not running. You can run the systemctl
restart rabbitmq-server command to start the service.
3. If the fault persists, contact ZTE technical support.

Expected Result

The message server can be successfully connected.

5.4.3 General Rabbitmq-Related Fault Location


Symptom

 Users have no consumption records.


 The service log contains the printed information about disconnection.

Probable Cause

 This may be caused by the message backlog in the service queue. Run the rabbitmqctl
list_queues --local|grep command. The second column of the service queue name is the
quantity of consumers. If it is 0, this indicates that no user is consuming.
 The service processing is slow. Check the system load (top) and the memory and CPU
usage.

Action

1. Check whether there is a backlog of messages. Run the rabbitmqctl list_queues --local|awk
'$2>50' command on all control nodes. If a value is returned, this indicates that there is a
backlog of messages. If the backlog persists for five minutes, contact ZTE technical support.
2. Check the service log. If the information such as "has taken %ss to process msg with"
is output before link disconnection, this indicates that the service processing is too slow.
Check whether the corresponding service is operating properly. If the fault persists, contact
ZTE technical support.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 47


TECS OpenStack Troubleshooting

Expected Result

The message server can be connected properly, and there is no backlog of messages.

5.4.4 Nova Cannot be Connected to Rabbitmq


Symptom

The heat process has created thousands of queues, so nova cannot be connected to rabbitmq.

Probable Cause

Each time the heat process restarts, a message queue starting with heat will be created, and
the queue will not be automatically deleted when there is no client connection. The name of
the queue generated every time the heat is restarted is followed by a random uuid. Therefore,
the queue names are different. When the heat process is restarted next time, a new queue is
generated and the previous queues are not deleted. For example, the heat process has created
16 queues. In this case, the 16 queues will be added to the rabbitmq server each time the HA
node is switched over. After multiple times of switchover, the rabbitmq queues will quickly reach
the limit, exhausting the memory, and restricting the connection.

Action

1. Configure a message queue policy to enable the automatic deletion function. For example,
when the message queue starting with heat-engine-listener is not connected for one hour, it
will be deleted automatically. The method is as follows:
rabbitmqctl set_policy ha-all "." '{"ha-mode":"all", "ha-sync-mode":"automatic"}' --apply-to all
--priority 0
rabbitmqctl set_policy heat_rpc_expire "^heat-engine-listener\\." '{"expires": 3600000,"ha-mo
de":"all","ha-sync-mode":"automatic"}' --apply-to all --priority 1

Note
The command can be executed during the rabbitmq installation. After the rabbitmq software is
installed and the rabbitmq service is started, this command is executed to configure the message
queue policy on the rabbitmq server.

2. If the fault persists, contact ZTE technical support.

Expected Result

The nova is successfully connected to the rabbitmq service.

48 SJ-20240124113225-026 | 2024-01-20 (R1.0)


5 System Environment Faults

5.5 Automatic Restart Every Other Minutes in a New Physical


Environment
Symptom

After manual restart, the physical machine is automatically restarted every other few minutes
and there is systemd-logind information in the log file in the /var/log/messages
directory.

Probable Cause

The physical blade is not inserted tightly or the ejectors are not in place.

Action

1. Check whether the physical blade is inserted tightly and the ejectors are in place.
 Yes → Step 3.
 No → Step 2.
2. Insert the physical blade tightly and put the ejectors in place. Check whether the fault is
resolved.
 Yes → End.
 No → Step 3.
3. Contact ZTE technical support.

Expected Result

The physical machine operates properly.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 49


Chapter 6
Faults Related to Virtual
Resources
Table of Contents
Cannot Create a Cloud Drive.................................................................................................... 50
Cloud Drive Deletion Failure...................................................................................................... 60
VM Cannot Mount a Cloud Drive...............................................................................................64
Cannot Unmount a Cloud Drive.................................................................................................67
Cannot Upload a Mirror..............................................................................................................68
Security Group Faults................................................................................................................ 70

6.1 Cannot Create a Cloud Drive


6.1.1 Cannot Create a Cloud Drive With a Mirror
Symptom

When you run the cinder list command to check whether a cloud drive with a mirror is
successfully created, the status of the cloud drive is "downloading" and then becomes "error".
When you check /var/log/cinder/volume.log on the control node, the log shows that
the cloud drive cannot be created, see Figure 6-1.

Figure 6-1 volume.log

Probable Cause

The mirror description contains Chinese characters.

50 SJ-20240124113225-026 | 2024-01-20 (R1.0)


6 Faults Related to Virtual Resources

Action

1. Check whether the mirror property description of the cloud drive contains Chinese
characters.
On the control node, run the glance image-show test command ("test" is the mirror name).
Figure 6-2 shows an example of the check result.

Figure 6-2 Checking Mirror Property Information (Chinese Characters Are Found)

In the result, the value of Property "description" contains Chinese characters, which
cannot be resolved.
 Yes → Step 2.
 No → Step 3.
2. Run the following command to modify the mirror property and re-create the cloud drive with
a mirror:
glance image-update --property description ="test" test
For a description of the parameters, refer to the following table.

Parameter Meaning

property Indicates that the mirror property is to be modified.

description ="test" Indicates that the property description is modified to "test".

test Mirror name or ID.

3. Check whether the mirror property description of the cloud drive contains Chinese
characters.
On the control node, run the glance image-show test command (test is the mirror name).
Figure 6-3 shows an example of the check result.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 51


TECS OpenStack Troubleshooting

Figure 6-3 Checking Mirror Property Information (Not Chinese Character Is Found)

 Yes → Step 4.
 No→ End.
4. Contact ZTE technical support.

Expected Result

When you run the cinder list command to check whether the cloud drive with a mirror is
successfully created, the status of the cloud drive is "available".

6.1.2 Cannot Create a Cloud Drive (Based on a Fujitsu Disk Array)


Symptom

When you run the cinder list command to check whether a cloud drive is successfully created,
the status of the cloud drive is "creating" and then becomes "error".
When you check /var/log/cinder/volume.log on the control node, the following
information is displayed:

Return code:4097,Error:Size Not Supported

Probable Cause

Continuous storage space on the disk array is insufficient.

Action

1. On the control node, run the following command to obtain the address of the disk array
management interface:
cat /etc/cinder/cinder_fujitsu_eternus_dx.xml
Figure 6-4 shows an example of the output of this command.

52 SJ-20240124113225-026 | 2024-01-20 (R1.0)


6 Faults Related to Virtual Resources

Figure 6-4 Checking the Address of the Disk Array Management Interface

In Figure 6-4, EternusIP is 10.43.230.23, that is, the address of the disk array
management interface.
2. Enter the address of the disk array management interface in the address bar of an IE
browser to log in to the disk array management page.
3. Select RAID GROUP. The RAID GROUP page is displayed.
4. Enter the corresponding RAID group. Figure 6-4 shows an example that the RAID group is
CG_raid_04.
5. Click the Volume Layout tab, see Figure 6-5.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 53


TECS OpenStack Troubleshooting

Figure 6-5 Volume Layout Tab

6. Check whether the disk array has sufficient continuous space. If the size of the cloud drive is
greater than the maximum value of Free, the cloud drive cannot be created.
 Yes → Step 9.
 No → Step 7.
7. Check whether a cloud drive whose size is smaller than or equal to Free can satisfy the
requirements of the user.
 Yes → Step 9.
 No → Step 8.
8. Create a cloud drive whose size is smaller than or equal to Free, and check whether the
cloud drive is successfully created.
 Yes → End.
 No → Step 9.
9. Contact ZTE technical support.

Expected Result

When you run the cinder list command to check whether the cloud drive is successfully
created, the status of the cloud drive is "available".

54 SJ-20240124113225-026 | 2024-01-20 (R1.0)


6 Faults Related to Virtual Resources

6.1.3 Cannot Create a Cloud Drive With a Mirror (Based on an IPSAN Disk
Array)
Symptom

When you run the cinder list command to check whether a cloud drive with a mirror is
successfully created, the status of the cloud drive is "downloading" and then becomes "error".
When you check /var/log/cinder/volume.log on the control node, the following
information is displayed:

_is_valid_iscsi_ip, iscsi ip:(162.161.1.208) is invalid. _is_valid_iscsi_ip, iscsi ip:

(162.162.1.208) is invalid. _is_valid_iscsi_ip, iscsi ip:(162.161.1.209) is invalid.

_is_valid_iscsi_ip, iscsi ip:(162.162.1.209) is invalid Not connect iSCSI device of volume

Probable Cause

The link between the service interface of the disk array and the control node is abnormal.

Action

1. On the control node, ping the address of the service interface of the disk array and check
whether it can be pinged successfully. The service interface (for example, 162.161.1.208) is
stored in /var/log/cinder/volume.log.
 Yes → Step 4.
 No → Step 2.
2. Check whether the control node and the service interface of the disk array are properly
connected.
3. Check whether a cloud drive with a mirror can be successfully created.
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

When you run the cinder list command to check whether the cloud drive with a mirror is
successfully created, the status of the cloud drive is "available".

6.1.4 The Volume With Images Fails to be Created Due to "Failed to Copy
Image to Volume"
Symptom

The volume with images fails to be created and the status of the volume is "error". In this case,
the following information is displayed:

SJ-20240124113225-026 | 2024-01-20 (R1.0) 55


TECS OpenStack Troubleshooting

2015-04-07 15:15:31.155 3599 ERROR cinder.volume.flows.manager.create_volume

[req-b5e20b82-7755-4712-ad45-9141d1512945 d75bb947cc7e450b83d4b006fc7656ef

ae74753aa2034c4aba49400a514d110e - - -] Failed to copy image to volume:

77179410-36ca-4d1f-8d21-fb092009d67d, error: Image 43a17a9f-95d7-4152-ba5f-6476f917a534

is unacceptable: Size is 48GB and doesn't fit in a volume of size 30GB.

2015-04-07 15:15:31.203 3599 ERROR cinder.volume.flows.manager.create_volume

[req-b5e20b82-7755-4712-ad45-9141d1512945 d75bb947cc7e450b83d4b006fc7656ef

ae74753aa2034c4aba49400a514d110e - - -]

Volume 77179410-36ca-4d1f-8d21-fb092009d67d: create failed

Run the following command to view the image information in the /var/lib/glance/image
directory:

[root@sbcr13 images(keystone_admin)]# qemu-img info 43a17a9f-95d7-4152-ba5f-6476f917a534

image: 43a17a9f-95d7-4152-ba5f-6476f917a534

file format: qcow2

virtual size: 40G (42949672960 bytes)

disk size: 2.4G

cluster_size: 65536

Format specific information:

compat: 1.1

lazy refcounts: false

Probable Cause

The virtual size of the image is 40 G, while the size of the volume is only 30 G, less than the
virtual size of the image.

Action

Recreate the volume with the size more than 48 G (1.2 times the virtual size of the image).

Expected Result

The volume with images is successfully created.

6.1.5 The Volumes With Images Fail to Be Created in Batches


Symptom

The volumes with images fail to be created in batches and the following information is displayed
in the volume.log file of the cinder:

ImageCopyFailure: Failed to copy image to volume:

56 SJ-20240124113225-026 | 2024-01-20 (R1.0)


6 Faults Related to Virtual Resources

mke2fs 1.42.9 (28-Dec-2013)\n\nWarning, had trouble writing out superblocks.\n]

Probable Cause

If a single volume can be successfully created, while several volumes fail to be created in
batches, it indicates that the fault is caused by the residual path.

Action

If multipath is used, run the multipath -f device name command to remove the residual device.

Expected Result

The volumes with images is successfully created in batches.

6.1.6 The Volume With Images Fails to Be Created on a Fujitsu Disk Array
Symptom

The volume with images fails to be created on a Fujitsu disk array and the log information is as
follows:
error: qemu-img: /dev/disk/by-path/ip-192.168.113.11:3260-iscsi-iqn.2000-09.com.fujitsu:

storage-system.eternus-dxl:000000:port010-lun-8: error while converting raw: Device is

too small

Probable Cause

The size of the volume on the Fujitsu disk array is less than the virtual size of the image.

Action

1. Check whether the size of the volume is less than the virtual size of the image in the
following way:

[root@2c514-1-13-SBCJ images]# qemu-img info 5be510e2-09a8-47ff-8d56-8ecfdf7465d0

image: 5be510e2-09a8-47ff-8d56-8ecfdf7465d0

file format: qcow2

virtual size: 200G (214748364800 bytes)

disk size: 7.1G

cluster_size: 65536

Format specific information:

compat: 1.1

lazy refcounts: false

 Yes → Step 2.
 No → Step 3.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 57


TECS OpenStack Troubleshooting

2. Modify the size of the volume to a value lager than or equal to the virtual size of the image.
Check whether the fault is resolved.
 Yes → End.
 No → Step 3.
3. Contact ZTE technical support.

Expected Result

The volume with images can be successfully created on the Fujitsu disk array.

6.1.7 The Volume Fails to Be Created and the Status of the Volume Is "
error, volume service is down or disabled"
Symptom

The volume fails to be created and the status of the volume is "error, volume service is down or
disabled". In this case, the following information is displayed in the cinder-scheduler log:

File"/usr/lib/python2.7/site-packages/cinder/volume/drivers/zte/zte_ks.py", line 225, in

_create_volume\n raise exception.CinderException(err_msg)\n', u'CinderException:

_create_volume:Failed to Create volume. volume name:OpenCos_7354260518372572139. ret code:

17109062\n']

2015-04-21 06:23:22.084 10981 ERROR cinder.scheduler.flows.create_volume

[req-114bc608-a921-4fb7-af26-1880535f2c40 bd9ffbdac44d48f780343de74ebd5913

348daede79974e7488006415a2a12a6f - - -] Failed to schedule_create_volume:

No valid host was found. Exceeded max scheduling attempts 3 for volume None

2015-04-21 06:24:10.105 10981 WARNING cinder.scheduler.host_manager

[req-f969943a-51b2-4a40-9aad-23910767a158 bd9ffbdac44d48f780343de74ebd5913

348daede79974e7488006415a2a12a6f - - -] volume service is down or disabled. (host: sbcr13)

Probable Cause

Run the cinder service-list command to check the status of the cinder-volume service in
the back end. The old host (sbcr13) is disabled and the new host (cinder) configured in the
cinder.conf file is enabled, thus the cinder-volume service corresponding to sbcr13 goes
down. However, sbcr13 is scheduled during the creation of volume scheduling, resulting in the
failure of volume creation.

Action

Modify the status of the cinder-volume service corresponding to sbcr13 to "disabled".

cinder service-disable sbcr13 cinder-volume

58 SJ-20240124113225-026 | 2024-01-20 (R1.0)


6 Faults Related to Virtual Resources

Expected Result

The volume is successfully created.

6.1.8 The Volume With Images Fails to be Created Due to "_is_valid_iscsi_


ip, iscsi ip:() is invalid"
Symptom

The volume with images fails to be created due to "_is_valid_iscsi_ip,iscsi ip:()is invalid". In this
case, the log information is as follows:

2015-05-06 09:25:30.531 476 INFO cinder.volume.drivers.fujitsu_eternus_dx_iscsi

[req-f442e802-113a-45b4-b7c0-ca54db02c588 19ae1a4a1efc47bab9e5a40efd278af6 c6ce862a2c5c

45369eb300c2dac6b4aa - - -] initialize_connection,Exit method

2015-05-06 09:25:30.587 476 WARNING cinder.brick.initiator.connector [req-f442e802-113a

-45b4-b7c0-ca54db02c588 19ae1a4a1efc47bab9e5a40efd278af6 c6ce862a2c5c45369eb300c2dac6b4aa

- - -] _is_valid_iscsi_ip, iscsi ip:(162.161.1.208) is invalid.

2015-05-06 09:25:30.638 476 WARNING cinder.brick.initiator.connector [req-f442e802-113a-

45b4-b7c0-ca54db02c588 19ae1a4a1efc47bab9e5a40efd278af6 c6ce862a2c5c45369eb300c2dac6b4aa

- - -] _is_valid_iscsi_ip, iscsi ip:(162.162.1.208) is invalid.

Probable Cause

The log shows that there is an error on the service ports of the disk array after the ports are
pinged. When you manually ping the codes, the execution result shows that all the service ports
are in good condition. This indicates that the execution permission of the ping command is
incorrect.

Action

1. Add an execution permission of the ping command in the /usr/share/cinder/


rootwrap/volume.filters configuration file.

# cinder/volume/driver.py: 'dd', 'if=%s' % srcstr, 'of=%s' % deststr,...

dd: CommandFilter, dd, root

ping: CommandFilter, ping, root

2. Recreate the volume. If the fault persists, contact ZTE technical support.

Expected Result

The volume with images is successfully created.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 59


TECS OpenStack Troubleshooting

6.2 Cloud Drive Deletion Failure

6.2.1 Cannot Delete a Cloud Drive, the Status of the Cloud Drive is "Error-
Deleting"
Symptom

When you attempt to delete a cloud drive, the status of the cloud drive is "error-deleting", and
the cloud drive cannot be deleted.

Probable Cause

The message sent by cinder-api to cinder-volume is lost, and no response is returned.

Action

1. On the control node, run the following command to check whether the volume service status
of cinder is active:
systemctl status openstack-cinder-volume.service
Figure 6-6 shows an example of the output of this command. If the Active field is "active", it
indicates that the service is successfully started. Otherwise, it indicates that the service is
not successfully started.

Figure 6-6 Checking the Cinder Volume Service Status

 Yes → Step 4.
 No → Step 2.
2. Run the following command to check the status of the cloud drive:
cinder reset-state test_reset , where test_reset is the cloud drive name.
3. Check whether the cloud drive can be successfully deleted.
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

The cloud drive is successfully deleted.

60 SJ-20240124113225-026 | 2024-01-20 (R1.0)


6 Faults Related to Virtual Resources

6.2.2 A Volume Fails to Be Deleted From a ZTE Disk Array Due to "Failed to
signin.with ret code:1466"
Symptom

A volume fails to be deleted from a ZTE disk array due to "Failed to signin.with ret code:1466".

2015-04-20 09:53:25.288 19975 INFO cinder.volume.manager [-] Updating volume status

2015-04-20 09:53:25.288 19975 WARNING cinder.volume.manager [-] Unable to update stats,

ZteISCSIDriver -N/A (config name IPSAN) driver is uninitialized.

2015-04-20 09:53:25.289 19975 DEBUG cinder.openstack.common.periodic_task [-] Running periodic

task VolumeManager._report_capcity_alarm run_periodic_tasks /usr/lib/python2.7/site-packages/

cinder/openstack/common/periodic_task.py:178

2015-04-20 09:53:25.289 19975 DEBUG cinder.volume.drivers.zte.zte_ks [-] Updating volume

status _update_volume_status /usr/lib/python2.7/site-packages/cinder/volume/drivers/zte/

zte_ks.py:1225

2015-04-20 09:53:25.295 19975 ERROR cinder.volume.drivers.zte.zte_ks [-] _get_sessionid:

<Fault -506: "Method '__unicode__' not defined">

2015-04-20 09:53:25.295 19975 DEBUG cinder.volume.drivers.zte.zte_ks [-] Change zte server is

http://129.0.63.28:8080/RPC2 _change_xmlrpc_server /usr/lib/python2.7/site-packages/cinder/

volume/drivers/zte/zte_ks.py:99

2015-04-20 09:53:28.312 19975 ERROR cinder.volume.drivers.zte.zte_ks [-] zte iscsi init:

Failed to signin.with ret code: 1466

2015-04-20 09:53:28.342 19975 ERROR cinder.volume.manager [-] Get storage info failed.

2015-04-20 09:53:28.342 19975 TRACE cinder.volume.manager Traceback (most recent call last):

2015-04-20 09:53:28.342 19975 TRACE cinder.volume.manager File "/usr/lib/python2.7/

site-packages/cinder/volume/manager.py", line 342, in _report_capcity_alarm

Check the user name and password in the /etc/cinder/cinder_zte_conf.xml file in the
following way:

[root@control2 cinder]# cat cinder_zte_conf.xml

<?xml version='1.0' encoding='UTF-8'?>

<config>

<Storage>

<ControllerIP0>10.43.16.21</ControllerIP0>

<ControllerIP1 />

<LocalIP>10.43.179.42</LocalIP>

<UserName>!$$$YWRtaW4=</UserName> //user name

<UserPassword>!$$$YWRtaW4=</UserPassword> //user password

</Storage>

SJ-20240124113225-026 | 2024-01-20 (R1.0) 61


TECS OpenStack Troubleshooting

<LUN>

<ChunkSize>4</ChunkSize>

<AheadReadSize>8</AheadReadSize>

<CachePolicy>1</CachePolicy>

<StorageVd>nas_vd</StorageVd>

<StorageVd>san_vd</StorageVd>

<SnapshotPercent>50</SnapshotPercent>

</LUN>

<iSCSI>

</iSCSI>

<VOLUME>

<Volume_Allocation_Ratio>20</Volume_Allocation_Ratio>

</VOLUME>

</config>

Probable Cause

The user name and password are incorrectly configured in the /etc/cinder/cinder_zte_
conf.xml file.

Action

1. Modify the <UserName> and <UserPassword> fields in the configuration file, for example:

<UserName>admin</UserName>

<UserPassword>admin</UserPassword>

2. Change the file execution permission.

[root@control2 cinder]# chmod +x cinder_zte_conf.xml

[root@control2 cinder]# ll

total 812

-rwxr-x--x 1 cinder cinder 664 Apr 2 11:09 cinder_zte_conf.xml

3. Restart the cinder-volume service.

systemctl restart openstack-cinder-volume.service

4. If the fault persists, contact ZTE technical support.

Expected Result

The volume is successfully deleted from the ZTE disk array.

62 SJ-20240124113225-026 | 2024-01-20 (R1.0)


6 Faults Related to Virtual Resources

6.2.3 A Volume Fails to Be Deleted From a ZTE Disk Array Due to "error-
deleting"
Symptom

A volume fails to be deleted from a ZTE disk array due to "error-deleting". For the volume.log
file, an error code 16917030 is returned after you run the cinder command.

Probable Cause

The volume does not properly exit the mapping group after it is added to the group. This is
because the stored resources are not properly released when you delete the residual VMs by
manually modifying database tables.

Action

1. Log in to the disk array management interface and find the volume concerned.
2. Manually remove the mapping group from the volume.
3. Run the cinder reset-state --state error volume_id command to reset the status of the
volume.
4. Delete the volume.
5. If the fault persists, contact ZTE technical support.

Expected Result

The volume is successfully deleted from the ZTE disk array.

6.2.4 No Response and Log Are Returned After a Volume Is Deleted


Symptom

The cinder service operates properly and no response and log are returned after a volume is
mounted, deleted, or dismounted.

Probable Cause

 The fault occurs if the host name or the host field in the cinder.conf file is modified
because the volume belongs to the original host.
 If the name of the volume service is modified after a new host is created under the volume
service or after the host field in the cinder.conf file is modified, then the operations on
the volume fail. In this case, a message sent to the original service cannot be processed.

Action

1. Run the cinder show volume_id command to find the os-vol-host-attr:host field, for
example, opencos263ed0ae9a0440eca446d6155b56b946@IPSAN.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 63


TECS OpenStack Troubleshooting

2. Run the cinder service-list command to check whether the volume service corresponding
to the host goes down.
If yes and the new service is named after a new host name, the fault is caused by the
modification of the host name.
3. Modify host=** to the original value (host=opencos263ed0ae9a0440eca446d6155
b56b946) in the /etc/cinder/cinder.conf file.
4. Run the systemctl restart openstack-cinder-volume command to restart the volume
service.
5. Delete or dismount the volume.
6. Repeat Step 3 to modify the host name to tecs1; otherwise, the volume corresponding to the
host named tecs1 cannot be operated.
7. If the fault persists, contact ZTE technical support.

Expected Result

The volume can be successfully deleted.

6.3 VM Cannot Mount a Cloud Drive


6.3.1 Cannot Mount a Cloud Drive When a Fujitsu Disk Array Is Used
Symptom

When you run the cinder list command to check whether a cloud drive is successfully mounted
to the VM, the status of the cloud drive changes from "attaching" to "available".
When you check /var/log/cinder/volume.log on the control node, error information is
displayed, see Figure 6-7.

Figure 6-7 Cannot Mount a Cloud Drive When a Fujitsu Disk Array Is Used

Probable Cause

The affinity group of the service interface on the disk array is set to off.

Action

1. On the control node, run the following command to obtain the address of the disk array
management interface:

64 SJ-20240124113225-026 | 2024-01-20 (R1.0)


6 Faults Related to Virtual Resources

cat /etc/cinder/cinder_fujitsu_eternus_dx.xml
Figure 6-8 shows an example of the output of this command. EternusIP is 10.43.230.23,
that is, the address of the disk array management interface.

Figure 6-8 Obtaining the Address of the Disk Array Management Interface

2. Telnet to the disk array management interface by using the username root and password
root (enter a password according to the actual situation).
3. Run the following command to set the affinity group of the disk array to enable:
set iscsi-parameters -port all -host-affinity enable
4. Check whether the cloud drive can be successfully mounted.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

Expected Result

When you run the cinder list command to check whether the cloud drive is successfully
mounted, the status of the cloud drive is "in-use".

6.3.2 Cannot Mount a Cloud Drive When IPSAN Back-End Storage Is Used
Symptom

When you run the cinder list command to check whether a cloud drive is successfully mounted
to the VM, the status of the cloud drive changes from "attaching" to "available".
When you check /var/log/nova/nova-compute.log on the host where the VM is located,
the following information is displayed:

NovaException: Not connect iSCSI device of volume

Probable Cause

The link between the service interface of the disk array and the computing node is abnormal.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 65


TECS OpenStack Troubleshooting

Action

1. Telnet to the control node, and run the following command to obtain the service interface
addresses of the disk array:
cat /etc/cinder/cinder_fujitsu_eternus_dx.xml

Note
The management interface address configuration files of different types of disk arrays have different
file names, file contents, and management page styles. Refer to the actual conditions.

Figure 6-9 shows an example of the output of this command. EternusISCSIIP indicates a
service interface address of the disk array.

Figure 6-9 Obtaining the Address of the Service Interface of the Disk Array

2. On the computing node, ping a service interface address of the disk array and check
whether it can be pinged successfully.
 Yes → Step 5.
 No → Step 3.
3. Restore the link between the computing node and the service interface of the disk array, for
example, by replacing faulty cables, so that their addresses can be pinged successfully from
each other.
4. Check whether the cloud drive can be successfully mounted.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

66 SJ-20240124113225-026 | 2024-01-20 (R1.0)


6 Faults Related to Virtual Resources

Expected Result

When you run the cinder list command to check whether the cloud drive is successfully
mounted, the status of the cloud drive is "in-use".

6.4 Cannot Unmount a Cloud Drive


Symptom

When you run the cinder list command to check whether a cloud drive is successfully
unmounted from the VM, the status of the cloud drive is always "detaching".
When you check /var/log/cinder/volume.log on the control node, error information is
displayed, see Figure 6-10.

Figure 6-10 Cannot Unmount a Cloud Drive

Probable Cause

An admin user logs in to the disk array management page and does not log out properly.

Action

1. On the control node, run the following command to obtain the address of the disk array
management interface:
cat /etc/cinder/cinder_fujitsu_eternus_dx.xml
Figure 6-11 shows an example of the output of this command. EternusIP is
10.43.230.23, that is, the address of the disk array management interface.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 67


TECS OpenStack Troubleshooting

Figure 6-11 Obtaining the Address of the Disk Array Management Interface

2. Enter the address of the disk array management interface in the address bar of an IE
browser and check whether you can log in to the disk array management page as the
admin user.
 Yes → Step 4.
 No → Step 3.
3. Troubleshoot connection problems. After the problems are solved, you can log in to the disk
array management page as the admin user.
4. On the disk array management page, click logout.
5. Check whether the cloud drive can be successfully unmounted.
 Yes → End.
 No → Step 6.
6. Contact ZTE technical support.

Expected Result

When you run the cinder list command to check whether the cloud drive is successfully
unmounted from the VM, the status of the cloud drive is "available".

6.5 Cannot Upload a Mirror

6.5.1 Mirror Server Space Insufficient


Symptom

After a mirror is uploaded, its status changes from "saving" to "killed".


When you check /var/log/glance/api.log on the control node, the following information
is displayed:

There is no enough disk space left on the image

68 SJ-20240124113225-026 | 2024-01-20 (R1.0)


6 Faults Related to Virtual Resources

Probable Cause

The storage space of the mirror server is insufficient.

Action

1. Run the following command to check whether the storage space of the mirror server (Avail
parameter in Figure 6-12) can meet the requirement for uploading the mirror file. Figure 6-12
shows an example of the output of this command.
df -h /var/lib/glance/images

Figure 6-12 Check the Storage Space of the Mirror Server

 Yes → Step 4.
 No → Step 2.
2. Delete unwanted files from the /var/lib/glance/images directory.
3. Check whether the mirror can be successfully uploaded.
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

The mirror is successfully uploaded, and its status becomes "active".

6.5.2 Insufficient Permissions on the Mirror Storage Directory


Symptom

After a mirror is uploaded, its status changes from "saving" to "killed".


When you check /var/log/glance/api.log on the control node, the following information
is displayed:

Insufficient permissions on image storage media

Probable Cause

The authorized user of the /var/lib/glance/images directory is not glance, and thus you
cannot store the mirror as the glance user.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 69


TECS OpenStack Troubleshooting

Action

1. On the control node, run the following command to check whether the user of the mirror
storage directory is glance:
ll /var/lib/glance/
Figure 6-13 shows an example of the output of this command. In this example, the user is
root, not glance.

Figure 6-13 Checking the User of the Mirror Storage Directory

2. Perform the following steps to modify the user of /var/lib/glance/images.


a. Run the following command to modify the directory user to glance:
chown -hR glance /var/lib/glance/images
b. Run the following command to modify the owner user group to glance:
chgrp -hR glance /var/lib/glance/images
3. Check whether the mirror can be successfully uploaded.
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

The mirror is successfully uploaded, and its status becomes "active".

6.6 Security Group Faults


6.6.1 Network Congestion Caused by Security Groups
Symptom

The network is congested due to the security groups.

Probable Cause

If the IP address and the MAC address in a VM are inconsistent with the external ones, security
groups may cause VM communication failures. Therefore, in some cases, security groups need
to be disabled.

Action

1. Modify the following contents of the control node:

70 SJ-20240124113225-026 | 2024-01-20 (R1.0)


6 Faults Related to Virtual Resources

a. Modify the /etc/neutron/plugin.ini file.


openstack-config --set /etc/neutron/plugin.ini securitygroup enable_security_group False
b. If port_security exists in extension_drivers in the /etc/neutron/plugin.ini file,
delete it.
c. Restart the service when there are no other services.
openstack-service restart
2. Modify the following contents of the compute node:
a. Change enable_security_group in the /etc/neutron/plugins/ml2/openvswitch_
agent.ini file to False.
openstack-config --set /etc/neutron/plugins/ml2/openvswitch_agent.ini securitygroup
enable_security_group False
b. Modify the firewall_driver.
openstack-config --set /etc/neutron/plugins/ml2/openvswitch_agent.ini securitygroup
firewall_driver neutron.agent.firewall.NoopFirewallDriver
c. Restart the service when there are no other services.
openstack-service restart
3. If the fault persists, contact ZTE technical support.

Expected Result

The security groups are successfully disabled and the VM operates properly.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 71


Chapter 7
VM Life Cycle
Management Faults
Table of Contents
VM Deployment Faults...............................................................................................................72
Hot Migration Faults................................................................................................................... 88
Cold Migration and Resizing Faults........................................................................................... 93
Cannot Delete VM......................................................................................................................97

7.1 VM Deployment Faults


7.1.1 Deployment Fault Handling Entrance
Symptom

The VM deployment fails, and the following information is displayed:

No valid host was found.

Probable Cause

 The nova-compute status of the compute node is down.


 The requested resources cannot be met.
 The destination node network is abnormal.

Action

1. Check whether the states of each compute node is normal. Log in to the controller node in
SSH mode and view the status by using the nova service-list command. If the state is up,
this indicates the compute node is normal.
 Normal → Step 2.
 No → Refer to 7.1.2.1 Nova-compute Status of the Compute Node is Down.
2. The requested resources cannot meet the requirements, and the related destination nodes
are filtered by the filter. Check the filter logs to determine which filter fails.

72 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

Method for viewing logs:


a. Log in to each controller node in SSH mode.
b. Search for the /var/log/nova/nova-scheduler.log file in accordance with the VM ID. For
example:
cat /var/log/nova/nova-scheduler.log |grep cc41f1a2-17b0-4322-8689-3ca5116b2821
c. View the /var/log/nova/nova-compute.log.
 Filter CoreFilter returned 0 hosts or Filter AggregateCoreFilter returned 0 hosts →
Refer to 7.1.2.2 Insufficient CPU Resources for the Compute Nodes.
 Filter RamFilter returned 0 hosts or Filter AggregateRamFilter returned 0 hosts →
Refer to 7.1.2.3 Insufficient Memory Resources for the Compute Nodes.
 Filter DiskFilter returned 0 hosts or Filter AggregateDiskFilter returned 0 hosts →
Refer to 7.1.2.4 Insufficient Disk Resources for the Compute Nodes.
 Filter NetworkFilter returned 0 hosts → Refer to 7.1.2.5 Compute Node Network
Fault.
d. If no filter is prompted that 0 hosts is returned, view the name of the selected compute
node. In this case, log in to the selected node in SSH mode and view the information →
Refer to 7.1.3.1 Failed to Deploy a VM on a Compute Node.

7.1.2 No valid host was found


The VM deployment fails, and the following information is displayed:
No valid host was found.

7.1.2.1 Nova-compute Status of the Compute Node is Down


Probable Cause

The nova-compute status of the compute node is down.

Action

1. Log in to the host whose service status is down through SSH.


2. Run the following command to restart the service.
systemctl restart openstack-nova-compute
3. Check whether the service status is Up.
 Yes → Step 4.
 No → Step 5.
4. Re-deploy the VM, and check whether the fault is fixed.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 73


TECS OpenStack Troubleshooting

Expected Result

The VM is deployed successfully.

7.1.2.2 Insufficient CPU Resources for the Compute Nodes


Probable Cause

The CPU resources for the compute nodes are insufficient.

Action

1. Add a compute node.


2. Re-deploy the VM, and check whether the fault is fixed.
 Yes → End.
 No → Step 3.
3. Contact ZTE technical support.

Expected Result

The VM is deployed successfully.

7.1.2.3 Insufficient Memory Resources for the Compute Nodes


Probable Cause

The memory resources for the compute nodes are insufficient.

Action

1. Add a compute node or add a memory bar to an existing compute node.


2. Re-deploy the VM, and check whether the fault is fixed.
 Yes → End.
 No → Step 3.
3. Contact ZTE technical support.

Expected Result

The VM is deployed successfully.

7.1.2.4 Insufficient Disk Resources for the Compute Nodes


Probable Cause

The disk resources for the compute nodes are insufficient.

Action

1. Add a compute node or add a disk to an existing compute node.

74 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

2. Re-deploy the VM, and check whether the fault is fixed.


 Yes → End.
 No → Step 3.
3. Contact ZTE technical support.

Expected Result

The VM is deployed successfully.

7.1.2.5 Compute Node Network Fault


Probable Cause

This fault occurs when resources are allocated to VMs. Generally, the fault that no network
resources are allocated (mainly for macvtap and sriov VMs) are caused by configuration errors.

Action

1. Check whether the node on which VMs will be deployed supports the VM type (direct or
macvtap). To confirm it, you can view all the agents:

[root@control5 ~(admin)]# neutron agent-list

-------------------------------------+------------------+-----------+-------+------------+

id | agent_type | host |alive|admin_state_up|

-------------------------------------+------------------+-----------+-------+------------+

014fe311-b50a-471c-b5ad-44dd981fb2f5| Open vSwitch agent| compute5_1|:-) |True |

07b759e3-d9ef-454b-b36e-b0f34a4f1fe8| NIC Switch agent | compute5_1|:-) |True |

2037ffba-4b1b-4ce3-a128-1f2eb393e0d7| L3 agent | control5 |:-) |True |

87f6a2fb-29bd-4aa3-8c76-f4eb1048748d| Metadata agent | control5 |:-) |True |

96deaeb6-6c4b-4594-9d87-0e005fe8be7a| Open vSwitch agent| control5 |:-) |True |

a7e1fe82-891d-44a2-b5ac-d8893697a8f6| DHCP agent | control5 |:-) |True |

c6934a55-9e36-4879-a55b-914d8387a795| NIC Switch agent | control5 |:-) |True |

-------------------------------------+------------------+-----------+-----+--------------+

2. Check whether the host on which VMs will be deployed is in correct macvtap or sriov mode.
If not, modify it. You can check the NIC switch agents:

[root@control5 ~(admin)]# neutron agent-show c6934a55-9e36-4879-a55b-914d8387a795

+---------------------+--------------------------------------+

| Field | Value |

+---------------------+--------------------------------------+

| admin_state_up | True |

| agent_type | NIC Switch agent |

| alive | True |

SJ-20240124113225-026 | 2024-01-20 (R1.0) 75


TECS OpenStack Troubleshooting

| binary | neutron-sriov-nic-agent |

| configurations | { |

| | "sriov_vnic_type": "direct", |

| | "devices": 0, |

| | "device_mappings": { |

| | "physnet3": "enp132s0f0", |

| | "physnet2": "enp2s0f0" |

| | } |

| | } |

| created_at | 2015-07-27 12:30:53 |

| description | |

| heartbeat_timestamp | 2015-08-25 02:53:15 |

| host | control5 |

| id | c6934a55-9e36-4879-a55b-914d8387a795 |

| started_at | 2015-08-19 03:36:46 |

| topic | N/A |

3. Check the configuration file. When the direct or macvtap VMs are deployed in a vlan, the
network plane configurations of nova and neutron must be the same. For example,
There are three physical planes in the nova configuration:

[root@control5 ~(admin)]# vi /etc/nova/nova.conf

pci_passthrough_whitelist= [{ "address":"0000:81:00.1","physical_network":"physnet1" },

{ "address":" 0000:02:00.0","physical_network":"physnet2" },

{ "address":"0000:84:00.0","physical_network":"physnet3" }]:

There should also be three physical planes in the neutron configuration file. Otherwise,
when the nova requests resources from the neutron, no resources will be returned due to
mismatch of the planes.

vi /etc/neutron/plugins/ml2/ml2_conf.ini

network_vlan_ranges =physnet1:2001:2050,physnet2:2001:2050,physnet3:2001:2050

The configuration file of sriov agent should also be configured correspondingly.

vi /etc/neutron/plugins/sriovnicagent/sriov_nic_plugin.ini

physical_device_mappings = physnet2:enp2s0f0,physnet3:enp132s0f0

To confirm it, perform the following steps:


a. Check the network of the port used by the VM.

[root@control5 ~(admin)]# neutron port-show ZTE-UMAC-83-UIPB1-S_vMAC_55_SIPI_port

+-----------------------+---------------------------------------+

76 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

| Field | Value |

+-----------------------+---------------------------------------+

| admin_state_up | True |

| allowed_address_pairs | |

| bandwidth | 0 |

| binding:host_id | |

| binding:profile | {} |

| binding:vif_details | {} |

| binding:vif_type | unbound |

| binding:vnic_type | direct |

| bond | 0 |

| device_id | |

| device_owner | |

| extra_dhcp_opts | |

| fixed_ips | |

| id | 9ada8ab5-dcca-449b-be95-27122bcce840 |

| mac_address | 00:d0:d0:6e:00:83 |

| name | ZTE-UMAC-83-UIPB1-S_vMAC_55_SIPI_port |

| network_id | 1ec95950-09a1-4e13-84b9-37d640fd05f4 |

| security_groups | |

| status | DOWN |

| tenant_id | 0d6a1d6602db4021899a29b1e98b3d89

b. Check the physical plane corresponding to the used network.

[root@control5 ~(admin)]# neutron net-show 1ec95950-09a1-4e13-84b9-37d640fd05f4

+---------------------------+--------------------------------------+

| Field | Value |

+---------------------------+--------------------------------------+

| admin_state_up | True |

| attached_port_num | 0 |

| bandwidth | 0 |

| id | 1ec95950-09a1-4e13-84b9-37d640fd05f4 |

| max_server_num | 50 |

| mtu | 1500 |

| name | vMAC_55_SIPI |

| provider:network_type | vlan |

| provider:physical_network | physnet3 |

| provider:segmentation_id | 2009 |

SJ-20240124113225-026 | 2024-01-20 (R1.0) 77


TECS OpenStack Troubleshooting

| router:external | False |

| shared | True |

| status | ACTIVE |

| subnets | |

| tenant_id | 0d6a1d6602db4021899a29b1e98b3d89 |

| vlan_transparent | False |

+---------------------------+--------------------------------------+

c. Confirm whether the physical network plane is correct in each configuration file. If not,
modify the configuration file as required.
4. If the fault persists, contact ZTE technical support.

Expected Result

A VM is successfully created.

7.1.3 Failed to Deploy a VM on a Compute Node


7.1.3.1 Failed to Deploy a VM on a Compute Node
Symptom

A VM group fails to be deployed on a compute node.

Action

1. Check the /var/log/nova/nova-compute.log or /var/log/libvirt/libvirt.


log, which includes error information.
 For unexpected vif_type=binding_failed, refer to 7.1.3.2 Binding Failed.
 For internal error: no supported architecture for os type 'hvm', refer to 7.1.3.4 HVM Not
Supported.
 For error: unsupported configuration: host doesn't support legacy PCI passthrough, refer
to 7.1.3.3 Failed to Start a VM with an SR-IOV NIC.
2. For VXLAN, refer to 7.1.3.5 Failed to Deploy a VM when Network Type is VXLAN.

7.1.3.2 Binding Failed

7.1.3.2.1 Unexpected vif_type=binding_failed

Symptom

A VM fails to be deployed. The following information is displayed on the TESC management


page:
Error: Failed to launch instance "testvm1":

Please try again later [Error: Unexpected vif_type=binding_failed]

78 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

Check the /var/log/nova/nova-compute.log. The following information is displayed:


Unexpected vif_type=binding_failed

Probable Cause

 The time of the compute node is different from that of the controller node, resulting in the
vif_type=binding_failed error.
 The network service status is abnormal.
 The configuration file is incorrect.

Action

1. Run the date command on the compute node and the controller node respectively to check
whether the time on the compute node and the controller node is the same.
 Yes → Step 4.
 No → Step 2.
2. Check whether the chronyd service is started on the controller node and the compute node.
It should be Active. Active (running) indicates the service is started successfully.

systemctl status chronyd

chronyd.service - NTP client/server

Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)

Active: active (running) since Thu 2019-12-05 16:52:11 CST; 23h ago

Docs: man:chronyd(8)

man:chrony.conf(5)

Main PID: 20126 (chronyd)

Tasks: 1

Memory: 348.0K

CGroup: /system.slice/chronyd.service

└─20126 /usr/sbin/chronyd

Dec 05 16:52:11 host-2018-abcd-abcd-1234-4321-5678-8765-12cc systemd[1]: Starting NTP

client/server...

Dec 05 16:52:11 host-2018-abcd-abcd-1234-4321-5678-8765-12cc chronyd[20126]:

chronyd version 3.2 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SECHASH

+SI...+DEBUG)

Dec 05 16:52:11 host-2018-abcd-abcd-1234-4321-5678-8765-12cc chronyd[20126]:

Frequency -1.541 +/- 1.409 ppm read from /var/lib/chrony/drift

Dec 05 16:52:11 host-2018-abcd-abcd-1234-4321-5678-8765-12cc systemd[1]: Started NTP

client/server.

Hint: Some lines were ellipsized, use -l to show in full.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 79


TECS OpenStack Troubleshooting

 Yes → Step 4.
 No → Step 3.
3. If the chronyd service is not in running status, start it. Method:
a. Run the systemctl restart chronyd command to start the chronyd service.
b. Check the time of all nodes to ensure that the time is the same.
 Yes → Next step.
 No → Step 8.
c. Run the following command to start the neutron-openvswitch-agent service.
systemctl start neutron-openvswitch-agent.service
4. Telnet to the host where the VM is located. Perform the following operations in accordance
with VM types.

If… Then…

It is an OVS VM a. Run the following command to view the status of the neutron-
openvswitch-agent service.
systemctl status neutron-openvswitch-agent.service
b. If the service fails to be started, run the following command to
restart it.
systemctl restart neutron-openvswitch-agent.service

It is an SR-IOV VM a. Run the following command to view the status of the neutron-
sriov-nic-switch-agent service.
systemctl status neutron-sriov-nic-switch-agent
b. If the service fails to be started, run the following command to
restart it.
systemctl restart neutron-sriov-nic-switch-agent

Note
Check the service status. In the execution result, if Active is displayed, this means the service is
normally started. If other states are displayed, this means the service fails to be started.

5. Redeploy the VM and check whether the VM can be deployed properly.


 Yes → End.
 No → Step 6.
6. Check the configuration information in the openvswitch_agent.ini or pci_nic_
switch_agent.ini file. Check whether the bridge_mappings and phynic_mappings
parameters are configured correctly, including whether the bridge is configured correctly and
whether the bridge is bound with an NIC.

80 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

Note
In the case of an SR-IOV VM, it is also necessary to check the /etc/nova/nova.conf file to see if
the network port bus_info specified in pci_passthrough_whitelist is the same as the actual NIC value.

 Yes → Step 8.
 No → Step 7.
7. Modify the configuration. Restart the openstack service. Check whether the fault is fixed.
 Yes → End.
 No → Step 8.
8. Contact ZTE technical support.

Expected Result

VMs can be deployed properly.

7.1.3.2.2 Failed to Create the VM, Binding Failed Displayed

Symptom

A VM fails to be created. The execution result of the nova show command shows binding
failed.

Action

1. Run the neutron agent -list command to view the service status of neutron.

[root@control5 ~(admin)]# neutron agent-list

-------------------------------------+------------------+-----------+-------+----------+

id | agent_type | host |alive|admin_state

_up |

-------------------------------------+------------------+-----------+-------+----------+

014fe311-b50a-471c-b5ad-44dd981fb2f5| Open vSwitch agent| compute5_1|:-) |True |

07b759e3-d9ef-454b-b36e-b0f34a4f1fe8| NIC Switch agent | compute5_1|:-) |True |

2037ffba-4b1b-4ce3-a128-1f2eb393e0d7| L3 agent | control5 |:-) |True |

87f6a2fb-29bd-4aa3-8c76-f4eb1048748d| Metadata agent | control5 |:-) |True |

96deaeb6-6c4b-4594-9d87-0e005fe8be7a| Open vSwitch agent| control5 |:-) |True |

a7e1fe82-891d-44a2-b5ac-d8893697a8f6| DHCP agent | control5 |:-) |True |

c6934a55-9e36-4879-a55b-914d8387a795| NIC Switch agent | control5 |:-) |True |

-------------------------------------+------------------+-----------+-----+------------+

If the alive field is not :-) but XX, the probable errors are as follows:

SJ-20240124113225-026 | 2024-01-20 (R1.0) 81


TECS OpenStack Troubleshooting

 If some services are XX, run the date command to check whether the time of the
compute node and the controller node is the same. If not, manually configure the same
time or enable the NTP service.

[root@tecs162 home]# date

Sat Oct 10 16:47:32 CST 2015

 If all services are XX, possibly the message service is abnormal. Check and restart the
qpid or rabbitmq-server service (or contact the message-related personnel to locate the
fault).

[root@tecs162 home]# systemctl status rabbitmq-server.service

rabbitmq-server.service - LSB: Enable AMQP service provided by RabbitMQ broker

Loaded: loaded (/etc/rc.d/init.d/rabbitmq-server)

Active: active (running) since Thu 2015-10-08 15:52:01 CST; 2 days ago

CGroup: /system.slice/rabbitmq-server.service

7732 /bin/sh /etc/rc.d/init.d/rabbitmq-server start

7950 /bin/bash -c ulimit -S -c 0 >/dev/null 2>&1 ; /usr/sbin/rabbitmq-server

7951 /bin/sh /usr/sbin/rabbitmq-server

Oct 08 15:51:54 tecs162 systemd[1]: Starting LSB: Enable AMQP service provided by

RabbitMQ broker...

Oct 08 15:51:54 tecs162 su[5970]: (to rabbitmq) root on none

Oct 08 15:51:55 tecs162 su[7765]: (to rabbitmq) root on none

Oct 08 15:51:55 tecs162 su[7967]: (to rabbitmq) root on none

Oct 08 15:52:01 tecs162 rabbitmq-server[5882]: Starting rabbitmq-server: SUCCESS

Oct 08 15:52:01 tecs162 rabbitmq-server[5882]: rabbitmq-server.

Oct 08 15:52:01 tecs162 systemd[1]: Started LSB: Enable AMQP service provided by

RabbitMQ broker.

For this problem, view the logs of any agent. You can see related prompts of message
service failure.

tail -n 200 /var/log/neutron/sriov-nic-switch-agent.log

2015-10-10 16:52:21.038 5983 ERROR neutron.openstack.common.rpc.common [-] AMQP

server on 10.43.166.162:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying

again in 1 seconds.

2015-10-10 16:52:21.997 5983 INFO neutron.openstack.common.rpc.common [-]

Reconnecting to AMQP server on 10.43.166.162:5672

2015-10-10 16:52:22.006 5983 ERROR neutron.openstack.common.rpc.common [-] AMQP

server on 10.43.166.162:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying

again in 3 seconds.

82 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

2015-10-10 16:52:22.011 5983 INFO neutron.openstack.common.rpc.common [-]

Reconnecting to AMQP server on 10.43.166.162:5672

2015-10-10 16:52:22.019 5983 ERROR neutron.openstack.common.rpc.common [-] AMQP

server on 10.43.166.162:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying

again in 3 seconds.

2015-10-10 16:52:22.038 5983 INFO neutron.openstack.common.rpc.common [-]

Reconnecting to AMQP server on 10.43.166.162:5672

2015-10-10 16:52:22.046 5983 ERROR neutron.openstack.common.rpc.common [-] AMQP

server on 10.43.166.162:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying

again in 3 seconds.

2015-10-10 16:52:25.009 5983 INFO neutron.openstack.common.rpc.common [-]

Reconnecting to AMQP server on 10.43.166.162:5672

2015-10-10 16:52:25.017 5983 ERROR neutron.openstack.common.rpc.common [-] AMQP

server on 10.43.166.162:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying

again in 5 seconds.

2015-10-10 16:52:25.020 5983 INFO neutron.openstack.common.rpc.common [-]

Reconnecting to AMQP server on 10.43.166.162:5672

2015-10-10 16:52:25.027 5983 ERROR neutron.openstack.common.rpc.common [-] AMQP

server on 10.43.166.162:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying

again in 5 seconds.

2015-10-10 16:52:25.047 5983 INFO neutron.openstack.common.rpc.common [-]

Reconnecting to AMQP server on 10.43.166.162:5672

2015-10-10 16:52:25.055 5983 ERROR neutron.openstack.common.rpc.common [-] AMQP

server on 10.43.166.162:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying

again in 5 seconds.

 If some services on some boards are XX, the services may be faulty. Check the service
status directly. If the status is not active, make sure the service is normal.

[root@control5 ~admin)]#systemctl status neutron-sriov-nic-switch-agent.service

neutron-sriov-nic-switch-agent.service - OpenStack Neutron SR-IOV

Loaded: loaded (/usr/lib/systemd/system/neutron-sriov-nic-switch-agent.service;

enabled)

Active: active (running) since Wed 2015-08-19 11:35:19 CST; 5 days ago

Main PID: 7357 (neutron-sriov-n) CGroup: /system.slice/

neutron-sriov-nic-switch-agent.service

7357 /usr/bin/python /usr/bin/neutron-sriov-nic-switch-agent --config-file

/usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf...

Warning: Journal has been rotated since unit was started. Log output is incomplete

SJ-20240124113225-026 | 2024-01-20 (R1.0) 83


TECS OpenStack Troubleshooting

or unavailable.

2. It may be that you want to deploy an sriov VM, but you configure a macvtap VM. You can
use the following method to determine the VM type.
 Run the ip link show command. If there are many enp interfaces, the VM type is
macvtap. Otherwise, the VM type is sriov.

[root@opencos_slot4_macvtap ~]# ip link show

143: enp8s17f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode

DEFAULT qlen 1000

link/ether 72:d8:0c:93:31:f2 brd ff:ff:ff:ff:ff:ff

144: enp8s17f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode

DEFAULT qlen 1000

link/ether 7e:42:20:44:e7:e8 brd ff:ff:ff:ff:ff:ff

145: enp8s17f5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode

DEFAULT qlen 1000

link/ether 06:f2:1a:38:da:5b brd ff:ff:ff:ff:ff:ff

146: enp8s17f7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode

DEFAULT qlen 1000

link/ether 46:c8:26:c5:11:82 brd ff:ff:ff:ff:ff:ff

147: enp8s18f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode

DEFAULT qlen 1000

link/ether 0a:c2:7d:49:1b:91 brd ff:ff:ff:ff:ff:ff

148: enp8s18f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode

DEFAULT qlen 1000

link/ether 42:61:02:cd:d3:06 brd ff:ff:ff:ff:ff:ff

149: enp16s16f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode

DEFAULT qlen 1000

link/ether ca:8f:bb:5e:1e:e6 brd ff:ff:ff:ff:ff:ff

150: enp8s18f5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode

DEFAULT qlen 1000

link/ether ea:50:71:14:79:6d brd ff:ff:ff:ff:ff:ff

407: phy-br-macvtap: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4000 qdisc

pfifo_fast master ovs-system state UP mode DEFAULT qlen 1000

link/ether de:48:a0:c3:63:12 brd ff:ff:ff:ff:ff:ff

151: enp16s16f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode

DEFAULT qlen 1000

link/ether 0e:98:c7:60:20:13 brd ff:ff:ff:ff:ff:ff

408: int-br-macvtap: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4000 qdisc

84 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

pfifo_fast master ovs-system state UP mode DEFAULT qlen 1000

link/ether ea:32:d1:26:03:1f brd ff:ff:ff:ff:ff:ff

152: enp8s18f7: <BROADCAST,MULTICAST> mtu 1500 q

 Check the configuration file to see whether it is macvtap or sriov.

[root@tecs162 ~(admin)]# cat /etc/sriov.conf

[DEFAULT]

# Sriov config options define

# Vif type options(macvtap/direct), default is macvtap

sriov_vnic_type = direct

# Vf number of one nic, e82599 default is 63, e82576 default is 7

ixgbe_vf_num = 63

igb_vf_num = 7

ixgbe_num = 16

3. Change the VM type to sriov.


4. If the fault persists, contact ZTE technical support.

Expected Result

A VM is successfully created.

7.1.3.3 Failed to Start a VM with an SR-IOV NIC


Symptom

A VM using an SR-IOV NIC fails to be started, and the status of the VM displayed on the TECS
management portal is failed.

Probable Cause

The SR-IOV NIC uses the VT-d function. If the VT-d function is not enabled, this error will
occur.

Action

1. Log in to the compute node, and run the following command to check the log records of the
libvirt.
cat /var/log/libvirt/libvirtd.log
Check whether the following error information exists:

error: unsupported configuration: host doesn't support legacy PCI passthrough

 Yes → Step 2.
 No → Step 4.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 85


TECS OpenStack Troubleshooting

2. The VT-d option is not enabled in the BIOS configuration of the server. You need to enable
the corresponding option in the BIOS. For different servers, the path is different. For
example,
chipset > North Bridge > IOH Configuration > Intel VT for Directed I/O Configuration >
Intel VT-d
3. Restart the VM, and check whether the VM can be started properly. When the VM is started
normally, the status of the VM displayed on the TECS management portal is running.
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

7.1.3.4 HVM Not Supported

Symptom

When the VM is started, the status of the VM displayed in the provider is failed. Check /var/
log/libvirt/libvirtd.log of the compute node. There is a record that shows hvm not
supported.

internal error: no supported architecture for os type 'hvm'

Probable Cause

The VT function is not enabled on the physical machine.

Action

1. Set the VT function in the BIOS configuration and enable it.


2. If the fault persists, contact ZTE technical support.

Expected Result

The VM is started properly.

7.1.3.5 Failed to Deploy a VM when Network Type is VXLAN

Symptom

A VM fails to be deployed when the network type is VXLAN.

Probable Cause

In the openvswitch_agent.ini configuration file on the compute node, the network type
can only be VLAN.

86 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

Action

1. On the active and standby controller nodes, modify the /etc/neutron/plugins/ml2/


ml2_conf.ini file as follows:

type_drivers = vlan,vxlan //vlan is the default configuration during installation.

vxlan is added here.

tenant_network_types = vlan,vxlan

//If vxlan is placed before vlan, when a network is created, the default network type

//is vxlan.

vni_ranges =1:10000 //Add the vni_ranges configuration. This is only an example.

2. Run the following command to restart the neutron-server service on the controller
node.
systemctlrestart neutron-server.service
3. On the compute node, modify the /etc/neutron/plugins/ml2/openvswitch_
agent.ini file as follows:

[OVS]

Local_ip = 10.0.0.3 //the IP address here is the management port IP address of the local

//computer or other IP addresses that can be used for communication.

[AGENT]

tunnel_types = vxlan

4. Run the following command to restart the neutron-openvswitch-agent service on the


compute node.
systemctl restart neutron-openvswitch-agent.service
5. On the compute node, run the following command to add a network bridge. For example,
the name of the network bridge is br-fabric.
ovs-vsctl add-br br-fabric
6. On the compute node, run the following command to mount the network interface (for
example, fabricright) to the bridge (for example, br-fabric).
ovs-vsctl add-port br-fabric fabricright
7. On the compute node, run the following command to restart the neutron-openvswitch-
agent service.
systemctlrestart neutron-openvswitch-agent.service
8. (Optional) On the compute node, run the following command to check the bridge
configuration:
ovs-vsctl show
An example of the output result is as follows:

SJ-20240124113225-026 | 2024-01-20 (R1.0) 87


TECS OpenStack Troubleshooting

# ovs-vsctl show

7d875903-6472-49c4-9b66-d830cd740ecd

Bridge br-fabric //new network bridge

Port fabricright

Interface fabricright //network interface of the new network bridge

Port br-fabric

Interface br-fabric

type: internal

Bridge br-tun

Port patch-int

Interface patch-int

………………

9. Re-deploy the VM and check whether the VM is deployed successfully.


 Yes → End.
 No → Step 10.
10. Contact ZTE technical support.

Expected Result

The VM is deployed successfully.

7.2 Hot Migration Faults


7.2.1 Hot Migration Is Allowed Only in One Direction
Symptom

After a VM is migrated from physical machine A to physical machine B successfully, the VM


fails to be migrated from physical machine B to physical machine A.
Check the /var/log/nova/nova-conductor.log on the controller node. The following
information is displayed:
“Unacceptable CPU info: CPU doesn't have compatibility.”:

2015-07-01 09:23:56.937 7968 WARNING nova.scheduler.utils

[req-c0cdccd2-3703-42f7-a471-614fee84ab53 3aee999c7bf4418d84881ba1f1fb4b3c

12705e09249149ae8d23e6d57df108df] Failed to compute_task_migrate_server:

Unacceptable CPU info: CPU doesn't have compatibility.

Probable Cause

A VM cannot migrate from a physical machine with higher CPU performance to another
physical machine with lower CPU performance.

88 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

Action

1. Confirm the CPU information of the source end and destination end. Run the following
command on the source end and the destination end respectively and view the flags field.
The CPU type is indicated by the field.
a. Log in to the host where the VM is located and the destination host in SSH mode.
b. Run the lscpu |grep Flags command.
c. Compare the flags fields. The flags field of the destination host must contain that of the
node where the VM is located.
2. A VM cannot migrate from a physical machine with higher CPU performance to another
physical machine with lower CPU performance. If you need to migrate the VM to a node with
a lower CPU type, you can perform cold migration.

Expected Result

If the CPU type meets the migration requirements, the live migration operation is successful.

7.2.2 Inter-AZ Hot Migration of VM Fails


Symptom

Hot migration of a VM within an AZ succeeds, but hot migration of the VM to another AZ fails.

Probable Cause

An AZ is specified upon the deployment of the VM, and AvailabilityZoneFilter is enabled for hot
migration. Thus, physical devices of other AZs are filtered, and no host can be selected during
the hot migration of the VM.

Action

1. On the control node, modify "target_host_with_filters=True" to "target_host_with_filters=


False" in the /etc/nova/nova.conf file.
2. Run the following command to restart the nova-scheduler service:
systemctl restart openstack-nova-scheduler.service
3. Perform inter-AZ hot migration again, and check whether the fault is removed.
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

Inter-AZ hot migration of the VM succeeds.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 89


TECS OpenStack Troubleshooting

7.2.3 Destination Host Has Not Enough Resources (Not Referring to Disk
Space)
Symptom

During hot migration, it is detected that the destination host has not enough resources, and thus
the migration fails. When you check the /var/log/nova/nova-conductor.log file on the
control node, the following information is displayed:

2015-04-14 17:28:52.070 1103 WARNING nova.scheduler.utils

[req-f9bc9950-528d-49a2-8ad0-39d4ac8bf542 2ee45765eb0f49c2a1a0b47b8e5b2b71

b91ab11a20ea4b24a6fc3694b7df5238] Failed to compute_task_migrate_ser

ver: Migration pre-check error: Unable to migrate

ff17befb-2685-4de9-91fd-13a56a7109ea to NJopencos2:

Lack of memory(host:2772 <= instance:16384)

Probable Cause

The destination host has not enough resources. The log shows "lack of memory".

Action

1. On the control node, run the following command to obtain the destination hypervisor name.
In most cases, the destination hypervisor name is consistent with the destination host name.
nova hypervisor-list
2. Run the command to check whether the destination host lacks resources. You can learn the
resource information from ram and vcpus in the flavor parameter of the VM.
nova hypervisor-show <destination hypervisor name>
 Yes → Step 3.
 No → Step 5.
3. Delete redundant VMs from the destination host, or add more computing nodes to satisfy
migration requirements.
4. Perform hot migration again, and check whether the fault is removed.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

Expected Result

Hot migration succeeds.

90 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

7.2.4 Destination Host Has Not Enough Disk Space


Symptom

Hot migration of a VM fails, and the following information is displayed:


ERROR:Migration pre-check error: Unable to migrate

7db73e23-cd45-4cf2-b03f-da24be6f3bfe: Disk of instance is too large

(available on destination host:27917287424 < need:32212254720) (HTTP 400)

(Request-ID: req-0a9e7d2f-19ae-49a4-bb6c-8ce4deff7d58)

Probable Cause

The destination host has not enough disk space.

Action

Perform the following operations as needed.

If... Then...

The value of "available on destination host" is a  If you perform the migration through CLI, when you run
positive number. the nova live-migration command, add the --disk-over-
commit parameter.
 If you perform the migration through GUI, select Disk
Over Commit.

The value of "available on destination host" is a Select another destination host, or clear the disk space
negative number. of the current destination host until it is sufficient for hot
migration.

Expected Result

Hot migration succeeds.

7.2.5 Source Computing Service Unavailable


Symptom

Hot migration of a VM fails, and the following information is displayed:


Failed to compute_task_migrate_server:

Compute service of SBCR13 is unavailable at this time.

Probable Cause

The nova-compute service of the computing node on the source host is down.

Action

1. on the source host, run the following command to restart the nova-compute service:

SJ-20240124113225-026 | 2024-01-20 (R1.0) 91


TECS OpenStack Troubleshooting

systemctl restart openstack-nova-compute.service


2. On the control node, run the following command to check whether the nova-compute service
of the computing node on the source host is up:
nova service-list
 Yes → Step 4.
 No → Step 3.
3. Ask ZTE technical support to analyze the /var/log/nova/nova-compute.log file on
the source host.
4. Perform hot migration again, and check whether the fault is removed.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

Expected Result

The nova-compute service of the source host is restored to normal, and hot migration
succeeds.

7.2.6 VM Goes into Error Status After Live Migration


Symptom

After the live migration, the status of the VM is changed to error. After the nova show
command, executed, the fault field displays the following error information:

| fault | {"message": "

Unexpected vif_type=binding_failed", "code": 500, "details": "

File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\",

line 1482, in _build_instance |

Probable Cause

Live migration fails due to network problems.

Action

1. Troubleshoot the network connection fault → Refer to 7.1.3.2 Binding Failed.


2. Perform live migration again after the network problem is solved. Check whether the fault is
fixed.
 Yes → End.
 No → Step 3.
3. Contact ZTE technical support.

92 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

Expected Result

After the network problem is solved, the live migration operation is successful.

7.3 Cold Migration and Resizing Faults

7.3.1 Authentication Fails During Migration


Symptom

During cold migration or resizing, the following information is displayed in the /var/log/
nova/nova-compute.log log on the source host.

u'Permission denied, please try again.\r\nPermission denied,

please try again.\r\nPermission denied (publickey,password).\r\n'

Probable Cause

To perform cold migration or resizing, you should use SSH to log in as the nova user (No
password is required.) The probable cause for this fault is that the nova user is not correctly
configured.

Action

1. Perform the following operations on each computing node:


a. In the /etc/nova/nova.conf configuration file, modify the value of "update_nova_
ssh_pub_key_interval" to 60.
b. Run the following command to restart the nova-compute service:
systemctl restart openstack-nova-compute
c. Run the following command to check whether the nova-compute service is successfully
started. In the output of this command, if the Active field is "active", it indicates that the
service is successfully started. Otherwise, it indicates that the service is not successfully
started.
systemctl status openstack-nova-compute
 Yes → Step 2.
 No → Step 3.
2. After the nova-compute service is started, wait for about 70 seconds, and perform cold
migration or resizing again. Check whether the fault is removed.
 Yes → End.
 No → Step 3.
3. Contact ZTE technical support.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 93


TECS OpenStack Troubleshooting

Expected Result

Cold migration or resizing succeeds.

7.3.2 Error "No valid host was found" Reported During Migration
Symptom

There is only one computing node, or the computing node and control node are co-located.
During cold migration or resizing, the following information is displayed in the /var/log/
nova/nova-api.log file on the source host.

nova.api.openstack NoValidHost: No valid host was found.

No valid host found for resize

Probable Cause

It is not allowed to perform cold migration or resizing on the same computing node.

Action

1. On the control node, modify "allow_resize_to_same_host=false" to "allow_resize_to_same_


host=True" in the /etc/nova/nova.conf file.
2. Run the following command to restart the nova-compute service:
systemctl restart openstack-nova-compute.service
3. Run the following command to restart the nova-api service:
systemctl restart openstack-nova-api.service
4. Perform cold migration or resizing again. Check whether the fault is removed.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

Expected Result

Cold migration or resizing succeeds.

7.3.3 Error "Unable to resize disk down" Reported During Resizing


Symptom

Resizing fails, and the following information is displayed in the /var/log/nova/nova-


compute.log file on the source host.

ERROR oslo.messaging.rpc.dispatcher [-] Exception during message handling:

Resize error: Unable to resize disk down.

94 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

Probable Cause

It is not allowed to resize a disk to a smaller disk.

Action

1. Log in to the TECS portal. Select Cloud Mgmt. > Compute > Instance. The Instance page
is displayed, on which VM specifications and other information are displayed.
2. Check the Specification column to obtain the disk value currently used by the VM (root
disk, temporary disk).
3. During the resize operation, select the VM specifications whose disk value is larger than the
current disk value. Check whether the alarm is cleared.
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

Resizing succeeds.

7.3.4 VM Always in "verify_resize" Status After Cold Migration or Resizing


Symptom

After cold migration or resizing is performed on a VM, the VM is always in "verify_resize" status.

Probable Cause

In the /etc/nova/nova.conf configuration file of a computing node, "resize_confirm_


window" configures whether to automatically confirm the "verify_resize" status. If this parameter
is disabled and the status is not manually confirmed, the VM is always in "verify_resize" status.

Action

1. Perform the following operations as needed.

If... Then...

You want to solve this problem for this time only. a. On the control node, run the following command:
nova resize-confirm <VM's uuid or name>
b. Go to Step 4.

You want to solve this problem permanently. Go to Step 2.

2. In the /etc/nova/nova.conf configuration file of the computing node, set "resize_


confirm_window=10". That is, after the VM status becomes "verify-resize", the status is
automatically confirmed and changed to "active" within 10 seconds.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 95


TECS OpenStack Troubleshooting

Note
"resize_confirm_window=10" means the confirmation time is 10 seconds. You can modify the value as
required. The default value is 0, meaning no automatic confirmation.

3. On the computing node, run the following command to restart the nova-compute service:
systemctl restart openstack-nova-compute
4. Perform cold migration or resizing again. Check whether the fault is removed.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

Expected Result

When cold migration or resizing succeeds, the status of the VM is restored to "active".

7.3.5 Mirror Error Reported During Cold Migration or Resize Operation


Symptom

Resizing a VM fails, and a libvirtError error is displayed in the /var/log/nova/nova-


compute.log file on the source host. Keywords of the error are as follows:

libvirtError: internal error: process exited while connecting to monitor

qcow2: Image is corrupt; cannot be opened read/write

Probable Cause

The mirror of the VM is damaged. Thus, cold migration or resizing cannot be performed on the
VM.

Action

Stop the cold migration or resizing on the VM, or use a new mirror to create a VM and then try
again.

Expected Result

After a VM is created with the new mirror, cold migration or resizing can be performed on the
VM.

96 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

7.4 Cannot Delete VM

7.4.1 Deletion Error Caused by Abnormal Compute Node Service


Symptom

After a VM is deleted, the VM remains in the deleting state.

Probable Cause

Check the nova-compute service on the compute node where the VM is located. It is possible
that the compute service is abnormal.

Action

1. Log in to the controller node and check whether the nova-compute service of the
corresponding node is up.
nova service-list
 Yes → Step 3.
 No → Step 2.
2. Log in to the compute node through SSH, and check whether the nova-compute service
status is active.
systemctl status openstack-nova-compute
 Yes → Step 4.
 No → Step 3.
3. Log in to the compute node in the SSH mode, restart the nova-compute service, and check
whether the VM is deleted.
systemctl restart openstack-nova-compute
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

The nova-compute service of the node where the VM is located operates properly and the VM
is deleted successfully.

7.4.2 Control Node's cinder-volume Service Abnormal


Symptom

Deleting a VM fails. When you check /var/log/nova/nova-compute.log on the host


where the VM is located, the following information is displayed:

SJ-20240124113225-026 | 2024-01-20 (R1.0) 97


TECS OpenStack Troubleshooting

InvalidBDMVolume: Block Device Mapping is Invalid: failed

to get volume 11739888-315a-4661-86bd-4dee926c1346

Probable Cause

The cinder-volume Service of the control node is down.

Action

1. On the control node, run the following command to restart the cinder-volume service:
systemctl restart openstack-cinder-volume.service
2. Run the following command to check whether the cinder-volume service is successfully
started. In the output of this command, if the Active field is "active", it indicates that the
service is successfully started. Otherwise, it indicates that the service is not successfully
started.
systemctl status openstack-cinder-volume.service
 Yes → Step 3.
 No → Step 4.
3. Try again and check whether the VM can be deleted.
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

The cinder-volume service is normal, and the VM is successfully deleted.

7.4.3 Network Service Abnormal


Symptom

When you check /var/log/nova/nova-compute.log on the host where the VM is located,


the following information is displayed:
"Connection to neutron failed: Maximum attempts reached", "code": 500, "details"

Probable Cause

The communication process is abnormal. Thus, the neutron service cannot be connected to
release network resources when you attempt to delete the VM.

Action

1. On the controller node, run the following command and check the /etc/neutron/neutron.

conf file.

98 SJ-20240124113225-026 | 2024-01-20 (R1.0)


7 VM Life Cycle Management Faults

cat /etc/neutron/neutron.conf | egrep 'rabbit_host'


The result is as follows:

# Deprecated group/name - [DEFAULT]/rabbit_host

#rabbit_host = localhost

# Deprecated group/name - [DEFAULT]/rabbit_hosts

#rabbit_hosts = $rabbit_host:$rabbit_port

In the result, if # is added, the configuration does not take effect. If # is not added, the
configuration takes effect. The qpid related configuration is commented out by #, while the
rabbit configuration is enabled. By default, rabbitmq communication is used.
2. In the case of RABBITMQ communication, run the following command to restart the
rabbitmq service.
systemctl restart rabbitmq-server
3. Delete the VM again, and check whether the VM is deleted successfully.
 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

The network service is normal, and the VM is successfully deleted.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 99


Chapter 8
VM Operation Failure
Table of Contents
VM OS Startup Failure.............................................................................................................100
Network Disconnection (Non-SDN Scenario, VLAN)............................................................... 108
Network Disconnection (SDN Scenario, VXLAN).................................................................... 130
DHCP Faults.............................................................................................................................132
VM's NIC Unavailable.............................................................................................................. 135
Control Console Cannot Connect to VM................................................................................. 136
VM Restart Due to Invoked OOM-Killer...................................................................................137

8.1 VM OS Startup Failure


8.1.1 Some Services of the VM are not Started
Symptom

After a VM is started, some services are not started or fail to be started, for example, the
network fails to be started.

Probable Cause

Services are started in sequence. If some services are not started, the subsequent services fail
to be started. For example, when the network is started, if the corresponding network port name
is not ready, the network fails to be started.

Action

1. After the system is started, manually start or restart the service (such as the network
service) with the following commands.
Command for starting the service: systemctlstart network.service
Command for restarting the service: systemctl restart network.service
2. Run the following command to check whether the service is started properly.
systemctl statusnetwork.service
If the following information is displayed, this indicates that the system is started properly.

100 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

network.service - LSB: Bring up/down networking

Loaded: loaded (/etc/rc.d/init.d/network)

Active: active since Fri 2015-07-10 12:05:47 CST; 6 days ago

//here, the Active field value is active, indicating normal startup.

If the following information is displayed, this indicates that the system is not started
successfully.

network.service - LSB: Bring up/down networking

Loaded: loaded (/etc/rc.d/init.d/network)

Active: failed (Result: exit-code) since Fri 2015-07-10 00:13:17 CST; 10s ago

//here, the Active field value is failed not active, indicating startup failure.

Process: 3434 ExecStart=/etc/rc.d/init.d/network start (code=exited, status=1/FAILURE)

 Yes → End.
 No → Step 3.
3. Contact ZTE technical support.

Expected Result

The service is started properly.

8.1.2 Failed to Start the VM Due to Loss of grub Information


Symptom

The file system is damaged. As a result, the grub information is lost, the VM operating system
cannot be started, and the "Nobootabledevice" information is displayed.

Action

1. Delete the fault image.


2. Restart the VM.
3. If the fault persists, re-create a VM image.
4. If the fault persists, contact ZTE technical support.

Expected Result

The VM is started properly.

8.1.3 Too Long VM Startup Time Due to Too Large Disk


Symptom

The VM startup time is too long. The VM status in the TECS management portal is spawning.
After the normal startup time expires, the startup is not completed yet. In normal cases, the

SJ-20240124113225-026 | 2024-01-20 (R1.0) 101


TECS OpenStack Troubleshooting

VM is started within 10 minutes, and the status in the TECS management portal changes to
running.

Probable Cause

If the hard disk space is large, it may take a long time or even several hours to repair the file
system of the hard disk.
The system may be performing the FSCK operation, which takes a long time. After the FSCK is
completed, the system can be powered on normally.

Action

1. Connect the VM console and check whether the FSCK operation is being performed.
If the following information is displayed, this indicates that the FSCK operation is being
performed.

fsck from util-linux 2.23.2

e2fsck 1.42.9 (28-Dec-2013)

Pass 1: Checking inodes, blocks, and sizes

Pass 2: Checking directory structure

Pass 3: Checking directory connectivity

Pass 4: Checking reference counts

Pass 5: Checking group summary information

 Yes → Step 2.
 No → Step 3.
2. After the FSCK operation is completed, check whether the VM can be started normally.
If the TECS management portal shows that the VM status is running, the VM is started
properly.
 Yes → End.
 No → Step 3.
3. Contact ZTE technical support.

Expected Result

The VM is started properly.

8.1.4 Failed to Start the VM After Power Off


Symptom

The services of TECS are operating properly, and the cloud management system is normal.
After a VM that is operating properly is powered off, it cannot be powered on again.

102 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

Probable Cause

The image of the VM is abnormal, for example, the image is damaged due to an abnormal
operation.

Action

1. Log in to the Provider GUI, find the corresponding VM, enter the VM management page,
and check the console log to determine whether the problem is inside the VM.
 Yes → Step 2.
 No → Step 3.
2. Perform one of the following operations as required.

If… Then…

The VM uses local storage ( a. Use the local basic image and the incremental image to form a new
the incremental image is not image.
damaged or the damage is not b. Redeploy a VM by using the new image.
serious). This method can save the previous scenario where the VM operates
properly to the maximum extent.

The VM uses local storage (the Import the original image through the rebuild method to generate a
incremental image is damaged new duplicate VM.
and cannot be used). This method can restore the scenario where the VM is just deployed.

The VM uses local storage (the a. Redeploy the VM.


original image is also damaged b. Reconfigure the VM information, including the network information.
and cannot be used).

The VM uses a volume. a. Re-create a volume.


b. Redeploy the VM.
3. Log in to the controller node, and then log in to the compute node where the VM is located
through ssh. Run the following command to restart the nova-compute service of the
compute node.
systemctl restart openstack-nova-compute
4. Run the following command to check whether the VM is started properly.
systemctl statusopenstack-nova-compute

[root@njopencos1 ~(keystone_admin)]# systemctl status openstack-nova-compute

openstack-nova-compute.service - OpenStack Nova Compute Server

Loaded: loaded (/usr/lib/systemd/system/openstack-nova-compute.service; enabled)

Active: active (running) since Thu 2015-07-02 09:04:37 CST; 2 weeks 0 days ago

//The Active field value is active, indicating normal startup.

//If other states are displayed, the startup fails.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 103


TECS OpenStack Troubleshooting

Main PID: 2498 (nova-compute)

CGroup: /system.slice/openstack-nova-compute.service

2498 /usr/bin/python /usr/bin/nova-compute

 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

Expected Result

The VM is started properly.

8.1.5 Failed to Start the VM OS, no bootable device


Symptom

Log in to the VM GUI through the console, it prompts no bootable device.

Probable Cause

1. The image used cannot be started.


2. If the volume is faulty (for example, the iSCSI link is faulty), the resources to be accessed by
the VM operating system do not exist.

Action

 Replace the boot image.


 Vloume fault
1. Check whether the status indicator of the disk array controller is normal.
 Yes → Step 3.
 No → Step 2.
2. Check whether the status indicator for the iSCSI/FC interface on the disk array is normal.
 Yes → Step 4.
 No → Step 3.
3. Check network cables, optical fibers, and optical modules to troubleshoot network faults.
After the network faults are removed, check whether the fault is fixed.
 Yes → End.
 No → Step 4.
4. Log in to the page for disk array management, and check whether the disk array has any
alarm.
 Yes → Step 5.
 No → Step 6.

104 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

5. Collect device information, and contact the technical support of the disk array manufacturer.
Check whether the fault is fixed.
 Yes → End.
 No → Step 6.
6. If there is no alarm for the disk and the connection indicator is normal, check whether
the VM can be started properly. When the VM is started normally, the status of the VM
displayed on the TECS management portal is running.
 Yes → End.
 No → Step 7.
7. Contact ZTE technical support.

Expected Result

The VM is started properly and no error is reported.

8.1.6 Error Status of VM


Symptom

During the operation, the VM suddenly goes to error status. Observed from the service layer,
the VM is down and its status is fault. The VM cannot be pinged or accessed. When entering
the Provider GUI or running the nova list–all-tenants command, the VM status is error.

Action

1. Log in to the Provider GUI, find the corresponding VM, and select soft restart or hard restart
of the cloud host. Check whether the fault is fixed.
 Yes → End.
 No → Step 2.
2. Log in to the controller node, and run the nova list-all-tenants command to view the VM
uuid.
3. Run the nova reboot command to restart the VM. For example,

nova reboot --hard cc0015d5-9d85-4dc9-b156-d8a85c5bf9d0

4. If the fault persists, contact ZTE technical support.

Expected Result

The VM status is normal.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 105


TECS OpenStack Troubleshooting

8.1.7 Cannot Power on the VM After Restart


Symptom

In case of manual restart, abnormal restart and global restart of the VM, the VM status is
Running, but the service software in the VM cannot be started properly.

Action

1. Check the power-on print information. It shows that the VM is repeatedly restarted in the
simulated boot phase, or the VM print information is abnormal.
2. Log in to the Provider GUI, find the corresponding VM, and select soft restart or hard restart
of the cloud host.
3. If the fault persists, contact ZTE technical support.

Expected Result

The VM can be powered on properly after being restarted.

8.1.8 Failed to Start the VM, Insufficient Memory


Symptom

The VM fails to be started, and the status of the VM displayed on the TECS management portal
is failed.

Probable Cause

The memory for huge pages is insufficient, resulting in VM startup failure.

Action

1. Log in to the compute node where the VM is located, and run the following command to
check the libvirt log.
cat /var/log/libvirt/libvirtd.log
Check whether the following error information exists:

file_ram_alloc: can't mmap RAM pages: Cannot allocate memory

 Yes → Step 2.
 No → Step 5.
2. Run the following command to check the huge page memory information of the physical
machine.
cat /proc/meminfo
An example of the output result is as follows:

AnonHugePages: 2359296 kB

106 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

HugePages_Total: 5120

HugePages_Free: 0

HugePages_Rsvd: 0

HugePages_Surp: 0

Hugepagesize: 2048 kB

3. Check the number of idle huge pages. Whether the idle huge pages meet the requirements
of the VM.
 Yes → Step 6.
 No → Step 4.
4. Perform the following steps to increase the huge page memory:
a. Run the vi /etc/grubtool.cfg command to open the grubtool.cfg file.

Note
In the grubtool.cfg file, hugepage_num refers to the number of huge pages, which is an
integer, and hugepage_size refers to the size of a huge page. In principle, 30 G memory needs to
be reserved for the OS on the compute node.

b. Modify the value of hugepage_num. The hugepage_num × hugepage_size should be


smaller than the remaining memory of the system.
5. Restart the VM, and check whether the VM can be started properly. When the VM is started
normally, the status of the VM displayed on the TECS management portal is running.
 Yes → End.
 No → Step 6.
6. Contact ZTE technical support.

8.1.9 VM File System Read-Only Due to Disk Array Network Interruption


Symptom

If the network of the disk array is suddenly interrupted during normal operation, after a period of
time, the VM that is started through the volume may be suspended and cannot be recovered.
The cluster controller node can be recovered automatically.

Action

1. After the network of the disk array is disconnected, the disks mounted from the disk array
to the host or VM are set to read-only by the OS. Run the mount command to view the disk
array. For example, in the following result, if the attribute is not rw, this indicates that the file
system cannot be written. Generally, the read-only mode is ro.

/dev/mapper/VG_Glance-LV_Glance on /var/lib/glance type ext4 (rw,relatime,data=ordered)

SJ-20240124113225-026 | 2024-01-20 (R1.0) 107


TECS OpenStack Troubleshooting

/dev/mapper/VG_Backup-LV_Backup on /mybackup type ext4 (rw,relatime,data=ordered)

/dev/mapper/VG_DB-LV_DB on /var/lib/mariadb type ext4 (rw,relatime,data=ordered)

/dev/mapper/spathk on /var/lib/nova/instances/4aeabb8e-2735-427d-91bb-9cd8f41bca63/sysdisk

type ext4 (ro,relatime,data=ordered)

/dev/mapper/spathm on /var/lib/nova/instances/64ea09b1-c195-4957-900f-d731282523bd/sysdisk

type ext4 (rw,relatime,data=ordered)

2. On the cluster controller node, for the mariadbl and other devices mounted from the disk
array, the cluster can detect read-only and perform switchover. Generally, the VM can be
recovered automatically. If the VM cannot be recovered, restart the cluster controller node.
3. Perform the following operations as required.
 If it is a ZTE disk array, use ZXOPENCOS_V01.01.10.P5B1I192 or later developed by
ZTE. Perform hard restart on the Provider GUI, or use the nova reboot –hard uuid
command to restart the VM.
 For the VM using a volume, if it is a ZTE disk array, use the versions earlier than
ZXOPENCOS_V01.01.10.P5B1I192 developed by ZTE. Upload the script to the
compute node where the VM is located and then hard reboot the VM.
 In the case of Fujitsu disk array, you can hard reboot the VM from the Provider GUI or
use the nova reboot –hard uuid command to restart the VM.
 In the case of Ceph storage, you can hard reboot the VM from the Provider GUI or use
the nova reboot –hard uuid command to restart the VM.
4. For the VMs that use cluster software, such as the EMS and MANO, if the VMs still cannot
be recovered after hard reboot, contact the corresponding product support personnel.
5. If the fault persists, contact ZTE technical support.

Expected Result

After the network of the disk array is recovered for a period of time, the VM that uses a volume
operates properly.

8.2 Network Disconnection (Non-SDN Scenario, VLAN)

8.2.1 Cannot Ping the VM From the External Debugging Machine


Symptom

You can successfully ping the external debugging machine from the VM, but cannot ping the
VM from the external debugging machine.

Action

1. Perform the following steps to obtain the addresses of the tap and qvo ports.

108 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

a. Check the VM ID.

# nova list

+--------------------------------------+------+--------+------------+-------------

+---------------------+

| ID | Name | Status | Task State | Power State

| Networks |

+--------------------------------------+------+--------+------------+-------------

+---------------------+

| 50fe1063-3a3a-4a5a-8bdd-40f8c61f3656 | test | ACTIVE | - | Running

| vlannet=192.168.1.2 |

+--------------------------------------+------+--------+------------+-------------

+---------------------+

b. On the control node, run the nova interface-list VM ID command to view all the ports of
the VM.

# nova interface-list 50fe1063-3a3a-4a5a-8bdd-40f8c61f3656

+------------+--------------------------------------+-------

| Port State | Port ID

| Net ID | IP addresses

| MAC Addr |

+------------+--------------------------------------+-------

| ACTIVE | 5e0c98c1-9db3-44b7-be12-9a0a3544bd23

| 01a537f2-91fa-4740-bf26-328dae440884 |

| ["fa:16:3e:7a:ea:c2", "fa:16:3e:7a:ea:c2"] |

+------------+--------------------------------------+-------

Find the port ID corresponding to the MAC address of the unreachable port of the VM.
For example, if the MAC address of an unreachable port is fa:16:3e:7a:ea:c2, you can
see that the corresponding port ID is 5e0c98c1-9db3-44b7-be12-9a0a3544bd23. In that
case, the tap port is tap5e0c98c1-9d, and the qvo port is qvo5e0c98c1-9d. "5e0c98c1-9
d" carried by tap and qvo is the first 10 bytes of the port ID.
2. Run the tcpdump command respectively on the tap, qvo, and physical ports, and check
whether ARP packets have responses.
 Yes → Step 5.
 No → Step 3.
3. Check whether the port has firewalls.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 109


TECS OpenStack Troubleshooting

Note
The VM uses a Windows operating system, and the firewall function may be enabled on the VM.

 Yes → Step 4.
 No → Step 5.
4. Disable the firewall function on the VM. Check whether the fault is removed.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

Expected Result

The external debugging machine and VM can be successfully pinged from each other.

8.2.2 Cannot Ping the External Debugging Machine From the VM


Symptom

You can successfully ping the VM from the external debugging machine, but cannot ping the
external debugging machine from the VM.

Action

1. Verify that the switch, port mode, and VLAN configurations used between the VM and
debugging machine are correct.
2. Verify that the internal port of the VM is in up status.
3. Verify that the IP address acquisition mode is correct.
Perform the following operations as needed.

If... Then...

Addresses are allocated by DHCP. Verify that the IP address acquisition mode in the
configuration file of the internal port of the VM is DHCP.

Static IP addresses are used. Verify that correct IP addresses and gateway are configured.

Floating IP addresses are used. Verify that IP addresses in the network segment of the port
are available.

4. Ping the gateway from the VM, attempt to use tcpdump to capture packets on the physical
port, and check whether ARP packets can be captured.
 Yes → Step 5.
 No → Step 7.
5. Capture packets on the switch, and check whether packets are sent out.

110 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

 Yes → Step 8.
 No → Step 6.
6. Run the ethtool command to check whether the NIC type is supported by the TECS.
 Yes → Step 8.
 No → Step 9.
7. Capture packets respectively on the tap, qvb, and qvo ports, and check whether ARP
packets can be captured. For example, run the tcpdump -i qvoa1e55c9e-4f command to
capture packets on the qvo port.
 Yes → Step 9.
 No → Step 8.
8. Verify that the firewall has no problem, and check whether the external debugging machine
can be successfully pinged from the VM.
 Yes → End.
 No → Step 9.
9. Contact ZTE technical support.

Expected Result

The external debugging machine and VM can be successfully pinged from each other.

8.2.3 Cannot Ping Ports on a VLAN


Symptom

Ports on a VLAN cannot be successfully pinged.

Probable Cause

The number of VLANs created in the network exceeds 64, which is the maximum number of
VLANs supported by an SR-IOV NIC. Therefore, some VLANs are invalid.

Action

1. Check whether too many VLANs are created in the network.


 Yes → Step 2.
 No → Step 3.
2. Delete redundant VLANs. The number of VLANs configured for an SR-IOV NIC should not
exceed 64. Check whether the fault is removed.
 Yes → End.
 No → Step 3.
3. Contact ZTE technical support.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 111


TECS OpenStack Troubleshooting

Expected Result

All the ports on the VLAN can be successfully pinged.

8.2.4 OVS VM Cannot Be Connected


Symptom

The OVS VMs in the same subnetwork of the same network cannot be connected to each
other.

Action

1. Check whether the status of the VMs is active and whether the IP addresses of the VMs are
correctly configured. If there are only the IP addresses allocated by the TECS instead of the
IP address configured, refer to 8.4 DHCP Faults for troubleshooting.
2. Check whether the service status of network.service, openvswitch.service, and neutron-
openvswitch-agent.service is normal.
 Yes → Step 4.
 No → Step 3.
3. Enable network.service, openvswitch.service, and neutron-openvswitch-agent.service and
ensure that the service status of the services is normal.
4. Initiate a Ping operation on a VM and capture packets over the TAP port of the VM. The port
is TAP plus the first 11 digits of the port ID.

[root@tecs162 (keystone_admin)]# tcpdump -i tapba6b6e42-67

tcpdump: WARNING: tapba6b6e42-67: no IPv4 address assigned

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on tapba6b6e42-67, link-type EN10MB (Ethernet), capture size 65535 bytes

17:36:11.101254 ARP, Request who-has 25.0.0.100 tell 25.0.0.3, length 28

If there is no ARP packet, determine whether the VM has sent out packets. Some VM
images support TcpDump; that is to say, packet capture in a VM is supported.
5. Capture packets over the QVO port.
 If there are DHCP packets on the TAP port and are not DHCP packets on the QVO port,
it indicates that the ports are filtered by security groups. This is probably because the
MAC address of the port configured on the VM is inconsistent with that shown on the
TECS. The IP address configured on the VM is inconsistent with that allocated by on the
TECS because the subnetwork is configured and DHCP is enabled. In this case, add
security group rules, create a port without security groups, or disable security groups.

112 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

 If there is an ICMP response from the QVB port and no ICMP response from the TAP
port, this is probably because the security groups of the tenant filter packets and the
packets of some types are not allowed to pass.
6. Capture packets over the physical ports of different blades of the two VMs.
a. If there are requested packets on the QVO port and not requested packets on the
physical port, check whether there is a tag for the QVO port (the tag is not always
consistent with the VLAN). If the tag does not exist, check whether neutron-openvswitch-
agent.service is in good condition.

[root@tecs162 (keystone_admin)]# ovs-vsctl show

Bridge br-int

Port "qvo8f301bf7-de"

tag: 4

Interface "qvo8f301bf7-de"

b. Check whether the VLAN in the packets is consistent with that in the network. If not,
check the configuration of the network firewall.

[root@tecs162 (keystone_admin)]# tcpdump -i er0 -xx

tcpdump: WARNING: int-br-data1: no IPv4 address assigned

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on int-br-data1, link-type EN10MB (Ethernet), capture size 65535 bytes

17:40:32.193134 ARP, Request who-has 25.0.0.100 tell 25.0.0.3, length 28

0x0000: ffff ffff ffff fa16 3e15 3a6f 8100 0002

0x0010: 0806 0001 0800 0604 0001 fa16 3e15 3a6f

0x0020: 1900 0003 0000 0000 0000 1900 00644

7. If the fault persists, contact ZTE technical support.

Expected Result

The OVS VMs in the same subnetwork of the same network can be connected to each other.

8.2.5 Floating IP Address Cannot Be Pinged


Symptom

The floating network IP address bound to the VM port and the IP address of the external
network cannot be pinged.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 113


TECS OpenStack Troubleshooting

Action

1. Correctly configure the floating IP address of the router. If the router binds the subnetwork
of the external network and the subnetwork of the internal network, bind the floating IP
address to the port of the VM.
2. Check whether network.service, openvswitch.service, neutron-openvswitch-agent.service,
and neutron-l3-agent.service are all enabled. If not, manually enable the services.
3. Run the nova-list command to check whether the VM binding the floating IP address is in
good condition.
 Yes → Step 5.
 No → Step 4.
4. Troubleshoot the VM and make sure that the VM binding the floating IP address is in good
condition.
5. Run the neutron port-show command to check whether the status of the port of the VM is
"Active".
 Yes → Step 7.
 No → Step 6.
6. Troubleshoot the port and make sure that the port of the VM is "Active".
7. Check whether the type of the network connecting to external networks is "external
network". Run the neutron net-show command the check whether the value of router:
external is "True".
 Yes → Step 9.
 No → Step 8.
8. Configure the network type and make sure that the value of router:external is "True".
9. Check whether multiple external networks are configured.
 Yes → Step 10.
 No → Step 11.
10. If multiple external networks are configured, delete the unused ones to ensure that there is
only one external network.
11. Run the ovs-vsctl show command on the network node to check information about the br-
ex bridge.

Bridge br-ex

Port "eth1"

Interface "eth1"

Port br-ex

Interface br-ex

type: internal

114 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

Port "qg-1c3627de-1b"

Interface "qg-1c3627de-1b"

type: internal

 If there is no br-ex bridge, run the ovs-vsctl add-br br-ex command to create a br-ex
bridge.
 If there is no port in the bridge, run the ovs-vsctl add-port br-ex eth1 command,
where eth1 indicates the name of the physical network adapter connected to the
external network.
12. On the VM, ping the IP address of the network connected to the external network. Run the
following command to capture packets on the br-ex bridge:

[root@opencos135 (keystone_admin)]# tcpdump -i br-ex host 10.43.166.1 -xx

//10.43. 166.1 is the IP address of the network connected to the external network and

can be modified as required.

tcpdump: WARNING: br-ex: no IPv4 address assigned

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on br-ex, link-type EN10MB (Ethernet), capture size 65535 bytes

01:49:27.572516 ARP, Request who-has 10.43.166.1 tell 10.43.167.45, length 28

0x0000: ffff ffff ffff fa16 3e3e 66bf 0806 0001

0x0010: 0800 0604 0001 fa16 3e3e 66bf 0a2b a72d

0x0020: 0000 0000 0000 0a2b a601

If packets can be successfully captured, it indicates that the router and the floating IP
address are correctly configured. Otherwise, check the configuration of the router and the
floating IP address.
13. Capture packets on the external network adapter of the router. If packets are successfully
captured, it indicates that there is no fault in the TECS. Check the connection and
switching settings of the external network.
14. Troubleshoot the connection of the external network and correctly configure the switching
settings.
15. If the fault persists, contact ZTE technical support.

Expected Result

The floating network IP address bound to the VM port and the IP address of the external
network can access each other.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 115


TECS OpenStack Troubleshooting

8.2.6 The Service VM Media Plane Using the SRIOV Port Cannot Be
Connected
Symptom

The status of the VM is correct, while the media plane of the board cannot be connected on
the service layer. No obvious exception is found during service troubleshooting. The underlying
network may be congested.

Action

1. Log in to the control node and perform the following operations:


a. Run the source source keystonerc command.
b. Run the nova list --all-tenants command to check the UUID of the VM.

[root@controller15 (admin)]# nova list --all-tenants --fields name,host,status

+--------------------------------------+--------------------+------------+--------+

| ID | Name | Host | Status |

+--------------------------------------+--------------------+------------+--------+

| 65f63c73-689f-47c8-8999-457060186c55 | CG1 | computer7 | ACTIVE |

| 9157f4c2-78aa-43c5-b189-c3f290e7a236 | CG2 | computer6 | ACTIVE |

| f3a04950-03af-4c9e-8842-2b3a080686a1 | EMS01 | computer12 | ACTIVE |

| c919c614-2dd5-4a2a-8526-f591adb66bce | EMS02 | computer11 | ACTIVE |

| e2ca8114-f858-4579-a8f7-20935f15993c | MPU_4/20/0 | computer11 | ACTIVE |

| 95849054-f82e-46c3-a035-f897b56049e0 | MPU_4/21/0 | computer12 | ACTIVE |

| df5f603d-c324-4995-b1ad-7eb261706931 | PFU_4/0/0 | computer3 | ACTIVE |

| ffcdaa71-7763-4e6f-8c9e-f7ae2d5d7e69 | PFU_4/1/0 | computer2 | ACTIVE |

+--------------------------------------+--------------------+------------+--------+

c. Run the nova show VM uuid command.

[root@controller16 (admin)]# nova show df5f603d-c324-4995-b1ad-7eb261706931

d. Check the compute node of the VM and the instance name.


2. Log in to the compute node where the VM is located and perform the following operations to
check whether the VLAN of the media plane port of the VM is correct:
a. Run the virsh list command.

[root@computer3 ]# virsh list

Id name status

----------------------------------------------------

2 instance-00000848 running

3 instance-00000849 running

4 instance-0000084e running

116 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

5 instance-0000084d running

b. Run the virsh domiflist command.

[root@computer3 ]# virsh domiflist instance-0000084e

interface type source model MAC

-------------------------------------------------------

tapf37d0c76-7d bridge qbrf37d0c76-7d virtio 00:d8:03:50:70:11

- hostdev - - 00:d8:03:50:70:12

- hostdev - - 00:d8:03:50:70:12

c. Run the ip link command.

[root@computer3 ]# ip link | grep 00:d8:03:50:70:12

vf 2 MAC 00:d8:03:50:70:12, vlan 1701, link-state auto

vf 2 MAC 00:d8:03:50:70:12, link-state auto

In this example, the service VM correspond to two bound media plane ports: 1701 for
the VLAN of one VF and none for the other. In this case, both packet receiving and
transmission are abnormal.
3. Perform the following operations as required:

If... Then...

The VLAN is lost a. Run the nova reboot command to reboot the VM.
b. Check whether the VLAN exists and whether the service
operates properly.
c. If the fault cannot be resolved after the VM is rebooted, run
the reboot command to reboot the compute node.

The VLAN is normal Go to next step.

4. In a scenario supporting SRIOV port bonding, run the ovs-appctl bond/show command
on the compute node where the VM is located to check whether the status of the physical
network adapter is normal. In the following example, make sure that both ens2f0 and ens2fl
are enabled. If they are disabled, the network port or switching is faulty.

[root@computer12 ]# ovs-appctl bond/show

---- bond1 ----

bond_mode: balance-tcp

bond-hash-basis: 0

updelay: 30000 ms

downdelay: 0 ms

next rebalance: 6879 ms

lacp_status: negotiated

slave ens2f0: enabled //ens2f0 is enabled.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 117


TECS OpenStack Troubleshooting

may_enable: true

hash 8: 0 kB load

hash 33: 7 kB load

hash 34: 8 kB load

hash 35: 103 kB load

hash 59: 16 kB load

slave ens2f1: enabled //ens2f1 is enabled.

active slave

may_enable: true

hash 98: 0 kB load

hash 139: 48 kB load

hash 150: 8 kB load

hash 183: 7 kB load

hash 207: 16 kB load

hash 212: 24 kB load

5. Run the ifconfig ens2f0 down/up command on the compute node and check whether the
corresponding network port can be recovered.
6. If the network port cannot be recovered, run the shutdown /noshutdown command to try
again on the blade port on the switching side.
7. If the network port cannot be recovered, restart the compute node.
8. If the fault persists, perform a switchover operation between the active and standby service
VMs and contact ZTE technical support.

Expected Result

The media planes of the service VMs can properly communicate with each other.

8.2.7 VM (OVS+DPDK Type) Communication Failure


Symptom

The VMs in the same subnet in the same network cannot communicate with each other.

Introduction to Common Debugging Commands of OVS+DPDK (DVS)

1. Run the dvs show-dpifstats command. The query result is shown in Figure 8-1.

Figure 8-1 Result of the dvs show-dpifstats Command

118 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

For a description of the parameters, refer to Table 8-1.

Table 8-1 Parameter Descriptions


Parameter Description

if_name Port name on the DVS bridge.

Status Port status.

Rx_packets Number of packets that the DVS receives from the


corresponding port.

Rx_bytes Number of bytes that the DVS receives from the


corresponding port.

Rx_drop Number of packets lost when the DVS receives


packets from the corresponding port. If the counter
has a value on the physical port, this indicates that
the packets are lost because of the processing
performance of the DVS.

Tx_packets Number of packets that the DVS successfully sends


to the corresponding port.

Tx_bytes Number of bytes that the DVS successfully sends to


the corresponding port.

Tx_dropped Number of packets that are discarded when the


DVS sends packets to the corresponding port,
including the discarded packets because of the
MTU restriction and tx_overrun.

Tx_overrun Number of packets that are discarded when the


DVS sends packets to the corresponding port
because the sending queue is full. This counter
indicates that the traffic exceeds the receiving
capability of the VM and the queue is full.

2. Run the dvs dump-dpflow br-int command to query the flow table. The result is as follows:

[root@test tecs]# dvs dump-dpflow br-int

ovs-appctl dpctl/dump-flows --names netdev@ovs-netdev

flow-dump from pmd on cpu core: 1

recirc_id(0),in_port(vhu6f1-80),packet_type(ns=0,id=0),eth(src=fa:16:3e:bc:6e:a7,

dst=fa:00:00:12:30:10),eth_type(0x0800),ipv4(frag=no), packets:11462223, bytes:1650560112,

used:0.001s, actions:push_vlan(vid=401,pcp=0),enp33s0f0

flow-dump from pmd on cpu core: 66

recirc_id(0),in_port(enp33s0f0),packet_type(ns=0,id=0),eth(src=fa:00:00:12:30:10,

dst=fa:16:3e:bc:6e:a7),eth_type(0x8100),vlan(vid=402,pcp=0),encap(eth_type(0x0800),

SJ-20240124113225-026 | 2024-01-20 (R1.0) 119


TECS OpenStack Troubleshooting

ipv4(frag=no)), packets:11462334, bytes:1696425432, used:0.001s, actions:pop_vlan,

vhu0deea455-6f

For a description of the parameters, refer to Table 8-2.

Table 8-2 Query Result Description


Parameter Description

in_port Port of the DVS for receiving packets, that is,


packets go into the DVS through the port.

actions Processing result of a packet.

packets/bytes Numbers of packets and bytes that hit the entry.

eth/eth_type/vlan/ipv4 Message information, including source and


destination MAC addresses, Ethernet type, VLAN,
and IP header.

In this example, there are two flow tables. Take the first flow table as an example. It means
that the DVS has received packets from port vhu6f1-80, where the keyword information
is: Eth (src=fa:16:3e:bc:6e:a7, dst=fa:00:00:12:30:10), eth_type (0x0800), ipv4 (frag=
no). The final processing mode is actions:push_vlan (vid=401, pcp=0), enp33s0f0, that is,
VLAN 401 is added, and packets are sent out from port enp33s0f0. Currently, 11462223
packets/1650560112 bytes in total hit this flow table.
3. Run the dvs_tcpdump command to mirror packets for packet capture.
The dvs_tcpdump command is a debugging command encapsulated by the DVS and used
to reduce the threshold for users to use. It can be used to locate the functional problems in
low-traffic environment. Because packet capture has a great impact on the performance, it is
not recommended to use it in high-traffic environment.
Syntax: dvs_tcpdump -i port name -w/home/tecs/bond1.pcap
Where,
-w/home/tecs/bond1.pcap is an optional parameter, which means that captured packets are
saved into the /home/tecs/bond1.pcap file. You should select a proper location for the file.
Otherwise, serious consequences may be caused.
The port name can be the virtual port name or the bond port name (for non-bond interface,
capture packets directly at the physical port). Run the ovs-appctl bond/show command to
query the name of the bond interface. For example, bond1 is the name of the bond interface
in the current environment.

[root@R5300G4-2 tecs]# ovs-appctl bond/show

---- bond1 ----

bond_mode: balance-tcp

120 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

bond may use recirculation: no, Recirc-ID : -1

bond-hash-basis: 0

updelay: 0 ms

downdelay: 0 ms

lacp_status: negotiated

lacp_fallback_ab: false

active slave mac: 28:7b:09:c6:23:e3(ens1f1)

slave ens1f1: enabled

active slave

may_enable: true

slave ens4f1: enabled

may_enable: true

VM Communication Scenarios

Figure 8-2 shows two common scenarios for OVS+DPDK (DVS) communication. The
communication of VMs on the same node refers to the communication between VM1 and VM2.
The communication of VMs on different nodes refers to the communication between VM1/VM2
and VM3. This rule will be followed in the following description.

Figure 8-2 OVS+DPDK (DVS) VM Communication Scenarios

As a virtual switch, the main task of the DVS is to send service packets from interface A to B.
The basic principle for locating the network connection fault is to check whether the packets

SJ-20240124113225-026 | 2024-01-20 (R1.0) 121


TECS OpenStack Troubleshooting

have passed through the path between the VMs. You can use the following methods: statistics,
flow table and packet capture.

Location of the Fault of VM Communication on the Same Node

1. Check whether the states of the source and destination VMs are active, and whether the
IP addresses of the VMs are configured correctly. If no IP address is configured, but the
TECS allocates the addresses, refer to 8.4 DHCP Faults for troubleshooting.
2. Use the nova command to query the compute node name corresponding to the VM (that
is, hypervisor_hostname, in this example, the name of the compute node is tecs162) and
instance_name (in this example, instance-0000042d).

nova show TEST-NUM0-8C8G |grep -E 'instance_name|hypervisor_hostname'

| OS-EXT-SRV-ATTR:hypervisor_hostname | tecs162

| OS-EXT-SRV-ATTR:instance_name | instance-0000042d

3. Log in to the above compute node, run the following commands as the root user to query
the network interface information corresponding to the VM, and check whether Type is
vhostuser.

[root@tecs162 tecs]# virsh domiflist instance-00000433

Interface Type Source Model MAC

-------------------------------------------------------

vhu4b2-26 vhostuser - virtio fa:16:3e:44:3a:cf

[root@tecs162 tecs]# virsh domiflist instance-0000042d

Interface Type Source Model MAC

-------------------------------------------------------

vhu6f1-80 vhostuser - virtio fa:16:3e:bc:6e:a7

 Yes → Step 4.
 No → Find the corresponding troubleshooting guide in accordance with the NIC type.
4. Check whether the openvswitch.service and neutron-openvswitch-agent.service are in
normal state.
 Yes → Step 6.
 No → Step 5.
5. Start the openvswitch.service and neutron-openvswitch-agent.service and ensure that they
are in normal status, and check whether the VM network is normal.
 Yes → End.
 No → Step 6.
6. Use the debugging function to query port information.

[root@tecs162 tecs]# dvs show-dpifstats

122 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

if_name status rx_packets rx_bytes rx_dropped tx_packets tx_bytes tx_dropped

tx_overrun

vhu6f1-80 up 0 0 0 258706914811 41190539245134 23325182902

6325182902

vhu4b2-26 up 0 0 0 558706914800 71190539245138 4325182902

7325182902

7. Check whether the corresponding status field is up.


 Yes → Step 10.
 No → Step 8.
8. Check whether the NIC in the VM is enabled.
 Yes → Contact the DVS technical support or try to restart the VM.
 No → Step 9.
9. Enable the NIC in the VM, and check whether the alarm is cleared.
 Yes → End.
 No → Step 10.
10. Check whether the tag configuration on the faulty virtual interface is normal. Check the ovs
bridge, and check whether the tag of the vhu interface is the same as that of the VLAN
of the network. If there is no tag, check whether the neutron- openvswitch-agent.service
status is normal.
[root@tecs162]# ovs-vsctl show
Bridge br-int
fail_mode: secure
Port "vhu4b2-26"
tag: 2
Interface "vhu4b2-26"
type: dpdkvhostuserclient
options: {vhost-server-path="/var/run/openvswitch/vhu4b2-26"}
Port "vhu6f1-80"
tag: 2
Interface "vhu6f1-80"
type: dpdkvhostuserclient
options: {vhost-server-path="/var/run/openvswitch/vhu6f1-80"}
 Yes → Step 11.
 No → Step 14.
11. Use VM1 to ping VM2. The normal packet interaction is VM1 (vhu6f1-80) <-->DVS<-->
VM2 (vhu4b2-26). On the compute node, check the statistics every 10 seconds to check
whether the rx_packets of the virtual port corresponding to VM1 is increasing.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 123


TECS OpenStack Troubleshooting

[root@SUGON-AMD tecs]# dvs show-dpifstats |grep -E "status|vhu6f1-80"

if_name status rx_packets rx_bytes rx_dropped tx_packets tx_bytes tx_dropped

tx_overrun

vhu6f1-80 up 0 0 0 558706914800 71190539245138 2325182902

2325182902

 Yes → Step 12.


 No → Check whether packets are sent in the VM.
12. Capture packets on the vhu port vhu6f1-80 of VM1.

[root@tecs162]# dvs_tcpdump -i vhu6f1-80

tcpdump: WARNING: vhuba6b6e42-67: no IPv4 address assigned

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on

tapba6b6e42-67, link-type EN10MB (Ethernet), capture size 65535 bytes 17:36:11.101254 ARP,

Request who-has 25.0.0.100 tell 25.0.0.3, length 28

Check whether there are ARP and ICMP request packets.


 Yes → Step 13.
 No → Check whether packets are sent in VM1 (if the VM supports tcpdump, capture
packets in the VM).
13. In the above captured packets, check whether there are ARP and ICMP response packets.
 Yes → Step 14.
 No → Check whether packets are sent in VM1 (if the VM supports tcpdump, capture
packets in the VM).
14. Capture packets at vhu port vhu4b2-26 of VM2.
[root@tecs162]# dvs_tcpdump -i vhu4b2-26
Check whether there are ARP and ICMP request packets.
 Yes → Step 15.
 No → Contact TECS/DVS technical support.
15. Check whether there are ARP and ICMP response packets in the captured packets at port
vhu4b2-26.
 Yes → Contact TECS/DVS technical support.
 No → Check whether packets are sent in VM2 (if the VM supports tcpdump, capture
packets in the VM).

Location of the Fault of VM Communication on Different Nodes

1. Check whether the states of the source and destination VMs are active, and whether the
IP addresses of the VMs are configured correctly. If no IP address is configured, but the
TECS allocates the addresses, refer to 8.4 DHCP Faults for troubleshooting.

124 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

2. Use the nova command to query the compute node names corresponding to the VMs
(that is, hypervisor_hostname, in this example, the names of the compute nodes are
tecs162 and tecs163) and instance_name (in this example, instance-0000042d and
instance-00000433).

nova show TEST-NUM0-8C8G |grep -E 'instance_name|hypervisor_hostname'

| OS-EXT-SRV-ATTR:hypervisor_hostname | tecs162

| OS-EXT-SRV-ATTR:instance_name | instance-0000042d

nova show TEST-NUM0-8C8G |grep -E 'instance_name|hypervisor_hostname'

| OS-EXT-SRV-ATTR:hypervisor_hostname | tecs163

| OS-EXT-SRV-ATTR:instance_name | instance-00000433

3. Log in to the above compute nodes, run the following commands as the root user to query
the network interface information corresponding to the VM, and check whether Type is
vhostuser.

[root@tecs162 tecs]# virsh domiflist instance-00000433

Interface Type Source Model MAC

-------------------------------------------------------

vhu4b2-26 vhostuser - virtio fa:16:3e:44:3a:cf

[root@tecs163 tecs]# virsh domiflist instance-0000042d

Interface Type Source Model MAC

-------------------------------------------------------

vhu6f1-80 vhostuser - virtio fa:16:3e:bc:6e:a7

 Yes → Step 4.
 No → Find the corresponding troubleshooting guide in accordance with the NIC type.
4. Log in to the above two compute nodes, and check whether the openvswitch.service and
neutron-openvswitch-agent.service are in normal state.
 Yes → Step 6.
 No → Step 5.
5. Start the openvswitch.service and neutron-openvswitch-agent.service and ensure that they
are in normal status, and check whether the VM network is normal.
 Yes → End.
 No → Step 6.
6. Use the debugging function to query port information.

[root@tecs162 tecs]# dvs show-dpifstats

if_name status rx_packets rx_bytes rx_dropped tx_packets tx_bytes tx_dropped

tx_overrun

vhu6f1-80 up 0 0 0 258706914811 41190539245134 23325182902

SJ-20240124113225-026 | 2024-01-20 (R1.0) 125


TECS OpenStack Troubleshooting

6325182902

[root@tecs163 tecs]# dvs show-dpifstats

if_name status rx_packets rx_bytes rx_dropped tx_packets tx_bytes tx_dropped

tx_overrun

vhu4b2-26 up 0 0 0 558706914800 71190539245138 4325182902

7325182902

7. Check whether the corresponding status field is up.


 Yes → Step 10.
 No → Step 8.
8. Check whether the NIC in the VM is enabled.
 Yes → Try to restart the VM.
 No → Step 9.
9. Enable the NIC in the virtual network, and check whether the alarm is cleared.
 Yes → End.
 No → Step 10.
10. Check whether the tag configuration on the faulty virtual interface is normal. Check the ovs
bridge, and check whether the tag of the vhu interface is consistent with the VLAN of the
network. If there is no tag, check whether the neutron- openvswitch-agent.service status is
normal.

[root@tecs163 tecs]# ovs-vsctl show

Bridge br-int

fail_mode: secure

Port "vhu4b2-26"

tag: 2

Interface "vhu4b2-26"

type: dpdkvhostuserclient

options: {vhost-server-path="/var/run/openvswitch/vhu4b2-26"}

Bridge "br-bond1"

Port "bond1"

Interface "ens1f1"

type: dpdk

options: {dpdk-devargs="0000:3b:00.1", n_rxq="2"}

Interface "ens4f1"

type: dpdk

options: {dpdk-devargs="0000:d9:00.1", n_rxq="2"}

[root@tecs162 tecs]# ovs-vsctl show

Bridge br-int

126 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

Port "vhu6f1-80"

tag: 33

Interface "vhu6f1-80"

type: dpdkvhostuserclient

options: {vhost-server-path="/var/run/openvswitch/vhu6f1-80"}

Bridge "br-bond1"

Port "bond1"

Interface "ens1f1"

type: dpdk

options: {dpdk-devargs="0000:3b:00.1", n_rxq="2"}

Interface "ens4f1"

type: dpdk

options: {dpdk-devargs="0000:d9:00.1", n_rxq="2"}

 Yes → Step 11.


 No → Contact TECS/NEUTRON technical support.
11. Use VM1 to ping VM3. The normal packet interaction is VM1 (vhu6f1-80)<-->DVS<-->
switch<-->DVS<-->VM3 (vhu4b2-26). On the compute node, check the statistics every
10 seconds to check whether the rx_packets of the virtual port corresponding to VM1 is
increasing.

[root@tecs162 tecs]# dvs show-dpifstats |grep -E "status|vhu6f1-80"

if_name status rx_packets rx_bytes rx_dropped tx_packets tx_bytes tx_dropped

tx_overrun

vhu6f1-80 up 0 0 0 558706914800 71190539245138 2325182902

2325182902

 Yes → Step 12.


 No → Check whether packets are sent in the VM.
12. Capture packets on the vhu port vhu6f1-80 of VM1.

[root@tecs162]# dvs_tcpdump -i vhu6f1-80

tcpdump: WARNING: vhuba6b6e42-67: no IPv4 address assigned

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on

tapba6b6e42-67, link-type EN10MB (Ethernet), capture size 65535 bytes 17:36:11.101254

ARP, Request who-has 25.0.0.100 tell 25.0.0.3, length 28

Check whether there are ARP and ICMP request packets.


 Yes → Step 13.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 127


TECS OpenStack Troubleshooting

 No → Check whether packets are sent in VM1 (if the VM supports tcpdump, capture
packets in the VM).
13. In the above captured packets, check whether there are ARP and ICMP response packets.
 Yes → Step 14.
 No → Check whether packets are sent in VM1 (if the VM supports tcpdump, capture
packets in the VM).
14. On the compute node tecs162, check the flow table by filtering the source MAC addresses,
and check whether the sending ports in the flow table of ARP and ICMP response
packets sent through the virtual port of VM1 contain the physical port, and whether the
encapsulated VLAN is correct. For the specific method, refer to the previous introduction to
basic commands.

[root@tecs162 tecs]# dvs dump-dpflow br-int |grep ‘fa:16:3e:bc:6e:a7’

recirc_id(0),in_port(vhu6f1-80),packet_type(ns=0,id=0),eth(src=fa:16:3e:bc:6e:a7,

dst=fa:16:3e:44:3a:cf),eth_type(0x0800),ipv4(frag=no), packets:11462223,

bytes:1650560112, used:0.001s, actions:push_vlan(vid=401,pcp=0),ens1f1

 Yes → Step 15.


 No → Step 16.
15. Run the ovs-appctl bond/show command to check whether the bond interface of the DVS
is in normal status. Check whether the following may_enable fields are true (as long as one
of the fields is true).

[root@tecs162 tecs]# ovs-appctl bond/show

---- bond1 ----

bond_mode: balance-tcp

bond may use recirculation: no, Recirc-ID : -1

bond-hash-basis: 0

updelay: 0 ms

downdelay: 0 ms

lacp_status: negotiated

lacp_fallback_ab: false

active slave mac: 28:7b:09:c6:23:d1(ens4f1)

slave ens1f1: enabled

may_enable: true

slave ens4f1: enabled

active slave

may_enable: true

 Yes → Contact TECS/DVS technical support.

128 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

 No → Check the connection of the compute node. If bond_mode is balance-tcp, check


whether the compute node configuration is consistent with the switch configuration.
16. On the compute node tecs163, check whether the physical interface receives ARP and
ICMP request packets from the peer end, and whether the actions parameter contains
the correct VLAN processing action and the correct virtual interface. For details, see the
previous introduction to basic commands.

[root@tecs162 tecs]# dvs dump-dpflow br-int |grep ‘fa:16:3e:bc:6e:a7’

recirc_id(0),in_port(ens1f1),packet_type(ns=0,id=0),eth(src=fa:16:3e:bc:6e:a7,

dst=fa:16:3e:44:3a:cf),eth_type(0x8100),vlan(vid=402,pcp=0),encap(eth_type(0x0800),

ipv4(frag=no)), packets:11462334, bytes:1696425432, used:0.001s,

actions:pop_vlan,vhu4b2-26

Verify that the physical interface receives ARP and ICMP request packets from the peer
end.
 Yes → Step 17.
 No → Contact the intermediate switch maintenance personnel of the compute node to
locate the fault.
17. Verify that the actions in the flow table of the above ARP and ICMP response packet are
correct.
 Yes → Step 18.
 No → Contact TECS/DVS technical support.
18. Filter the MAC address flow table of VM1, and check whether there are ARP or ICMP
response packets coming out of the virtual port of VM3.
 Yes → Step 19.
 No → Contact the VM service processing engineer to locate the fault.
19. Check whether the destination port of the above flow table is the physical port and whether
VLAN encapsulation is correct.
 Yes → Step 20.
 No → Handle the fault by referring to Steps 15 and 16.
20. On the compute node tecs162, refer to Steps 16 and 17.
21. On the compute node tecs162, check whether the destination port of the flow table is the
virtual port of VM1.
a. If the qvo port has request packets but the physical port does not, you need to check
the ovs bridge to see whether the qvo port has a tag (which is not necessarily the same
as the that of the VLAN of the network). If there is no tag, check whether the neutron-
openvswitch-agent.service status is normal.

[root@tecs162 ~(keystone_admin)]# ovs-vsctl show Bridge br-int

SJ-20240124113225-026 | 2024-01-20 (R1.0) 129


TECS OpenStack Troubleshooting

Port "qvo8f301bf7-de"

tag: 4

Interface "qvo8f301bf7-de"

b. Check whether the VLAN in the packet is the same as that in the network. If not, check
the network firewall.

[root@tecs162 ~(keystone_admin)]# tcpdump -i er0 -xx tcpdump: WARNING:

int-br-data1: no IPv4 address assigned

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening

on int-br-data1,

link-type EN10MB (Ethernet), capture size 65535 bytes 17:40:32.193134 ARP,

Request who-has 25.0.0.100 tell 25.0.0.3, length 28

0x0000: ffff ffff ffff fa16 3e15 3a6f 8100 0002

0x0010: 0806 0001 0800 0604 0001 fa16 3e15 3a6f

0x0020: 1900 0003 0000 0000 0000 1900 00644

22. If the fault persists, contact ZTE technical support.

Expected Result

OVS VMs in the same subnet in the same network can communicate with each other.

8.3 Network Disconnection (SDN Scenario, VXLAN)


8.3.1 OVS (User Mode) VMs Not Connected
Symptom

The communication between two OVS VMs is interrupted.

Action

1. Check whether the SDN network (SDN topology) is normal. If there is problem in sending or
receiving packets, it is recommended that you check the SDN topology. The topology can
correctly reflect the status of tunnels.
 Yes → Step 2.
 No → Step 4.
2. Enter the VM and check whether the peer end has an IP address.
 Yes → Step 3.
 No → Refer to 8.3.2 Failed to Obtain an IP Address.
3. Set the ICMP rules so that the IP address is not filtered by the security group. Check
whether the fault is fixed.
 Yes → End.

130 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

 No → Step 4.
4. Contact ZTE technical support.

8.3.2 Failed to Obtain an IP Address


Symptom

The VM cannot obtain an IP address.

Action

1. Check whether the SDN network (SDN topology) is normal. If there is a problem in sending
or receiving packets, it is recommended that you check the SDN topology. The topology can
correctly reflect the status of tunnels.
 Yes → Step 2.
 No → Step 7.
2. Check whether the DHCP function is enabled for the subnet of the network where the VM is
created.
Method:

[root@NFV-D tecs(keystone_v3)]# neutron subnet-show f0b6510e-c782-45b4-9930-962800a2cb48

+-------------------+------------------------------------------+

| Field | Value |

+-------------------+------------------------------------------+

| allocation_pools | {"start": "1.1.1.2", "end": "1.1.1.254"} |

| cidr | 1.1.1.0/24 |

| created_at | 2020-06-03T08:26:49 |

| description | |

| dns_nameservers | |

| enable_dhcp | True |

| gateway_ip | 1.1.1.1 |

| host_routes | |

| id | f0b6510e-c782-45b4-9930-962800a2cb48 |

| ip_version | 4 |

| ipv6_address_mode | |

| ipv6_ra_mode | |

| name | test_subnet |

| network_id | cfa11c3c-915e-4670-8bed-e69a668ff440 |

| subnetpool_id | |

| tenant_id | 2492dad68d3e45798d98b0a2bd3a8300 |

SJ-20240124113225-026 | 2024-01-20 (R1.0) 131


TECS OpenStack Troubleshooting

| updated_at | 2020-06-03T08:55:27 |

+-------------------+------------------------------------------+

If enable_dhcp is True, this indicates that DHCP is enabled. If enable_dhcp is False, DHCP
is disabled.
 Yes → Step 4.
 No → Step 3.
3. IP addresses cannot be obtained automatically. You need to manually configure an IP
address. After an IP address is configured, check whether the fault is removed.
 Yes → End.
 No → Step 5.
4. Use dhclient and check whether an IP address can be obtained.
 Yes → End. The fault is fixed. The cause is that the image does not automatically obtain
an IP address when the VM is started.
 No → Step 5.
5. Check whether DHCP packets are allowed by the security group.
 Yes → Step 7.
 No → Step 6.
6. Set the security group so that UDP packets can pass through.Check whether the fault is
fixed.
 Yes → End.
 No → Step 7.
7. Contact ZTE technical support.

8.4 DHCP Faults

8.4.1 Cannot Obtain IP Addresses Distributed by DHCP


Symptom

The IP address of a VM cannot be obtained.

Probable Cause

 The DHCP is abnormal.


 Network connection is abnormal.
 The number of VLANs configured for an SR-IOV NIC exceeds 64.

132 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

Action

1. On the network node, run the following command to check whether the neutron-dhcp-
agent service is normal. In the result of this command, if the Active field is "active", it
indicates that the service is normal. Otherwise, it indicates that the service is abnormal.
systemctl statusneutron-dhcp-agent
 Yes → Step 4.
 No → Step 2.
2. Run the following command to restart the neutron-dhcp-agent service:
systemctl restartneutron-dhcp-agent
3. Run the following command to check whether the neutron-dhcp-agent service is
normal.
systemctl statusneutron-dhcp-agent
 Yes → Step 4.
 No → Step 9.
4. Check whether the connectivity between the network node and the computing node where
the VM is located is normal. You can manually configure an IP address for the VM, and
check whether the VM can be successfully pinged.
 Yes → Step 7.
 No → Step 5.
5. Check whether VLAN configuration or network connection is abnormal.
 Yes → Step 6.
 No → Step 7.
6. Modify VLAN configuration or network connection. Check whether the fault is removed.
 Yes → End.
 No → Step 7.
7. Check whether the number of VLANs exceeds 64.
 Yes → Step 8.
 No → Step 9.
8. Re-plan the network, so that the number of VLANs does not exceed 64. Check whether the
fault is removed.
 Yes → End.
 No → Step 9.
9. Contact ZTE technical support.

Expected Result

The IP address of the VM can be obtained.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 133


TECS OpenStack Troubleshooting

8.4.2 Connection Failure If the Address Distributed by DHCP Is Not Used


Symptom

After a VM is started, it is unreachable if the IP address distributed by the DHCP to the VM is


not used.

Probable Cause

Firewall extension checks IP addresses. To use addresses other than those distributed by the
DHCP, you should disable firewall extension.

Action

1. Plan the network properly, and check whether the DHCP function is needed.
 Yes → Step 3.
 No → Step 2.
2. Delete the subnets in the network, and check whether the fault is removed.
 Yes → End.
 No → Step 6.
3. Disable the firewall extension as follows:
 Modification on the control node:
a. Change enable_security_group in the /etc/neutron/plugin.ini file to False as follows:
openstack-config --set /etc/neutron/plugin.ini securitygroup enable_security_group
False
b. If port_security exists in extension_drivers in the /etc/neutron/plugin.ini file, delete it.
c. When there are no other services, restart the service as follows:
openstack-service restart
 Modification on the compute node:
a. Change enable_security_group in the /etc/neutron/plugins/ml2/openvswitch_agent.ini
file to False as follows:
openstack-config --set /etc/neutron/plugins/ml2/openvswitch_agent.ini securitygroup
enable_security_group False
b. Modify firewall_driver as follows:
openstack-config --set /etc/neutron/plugins/ml2/openvswitch_agent.ini securitygroup
firewall_driver neutron.agent.firewall.NoopFirewallDriver
c. When there are no other services, restart the service as follows:
openstack-service restart
4. Check whether the neutron service status is normal and whether the network is normal.
systemctl status neutron-server
 Yes → End.

134 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

 No → Step 5.
5. Run the following command to restart the iptables service on the compute node. Check
whether the alarm is cleared.
service iptables restart
 Yes → End.
 No → Step 6.

Expected Result

Communication with the VM is normal.

8.5 VM's NIC Unavailable


Symptom

After a VM is successfully created, its NIC cannot be found.

Probable Cause

 The MAC address of the VM is not unique.


 The tenant port used for creating the VM is not under the same tenant as the tenant port
used by the VM.

Action

1. Check whether the MAC address of the VM conflicts with an existing address.
 Yes → Step 2.
 No → Step 3.
2. Set the MAC address of the VM to be unique. Check whether the fault is removed.
 Yes → End.
 No → Step 3.
3. Check whether the tenant port used for creating the VM is under the same tenant as the
tenant port used by the VM.
 Yes → Step 5.
 No → Step 4.
4. Modify the two tenant ports so that they are under the same tenant. Check whether the fault
is removed.
 Yes → End.
 No → Step 5.
5. Contact ZTE technical support.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 135


TECS OpenStack Troubleshooting

Expected Result

The NIC of the VM is available.

8.6 Control Console Cannot Connect to VM


Symptom

A VM is successfully created and enters "running" status, but the control console cannot
connect to the VM.

Probable Cause

Some configuration is incorrect or service is abnormal.

Action

1. Run the following command to restart the nova-novncproxy service:


systemctl restart openstack-nova-novncproxy
2. On the control node, run the following command to check whether the nova-novncproxy
service is normal. In the output of this command, if the Active field is "active", it indicates
that the service is successfully started. Otherwise, it indicates that the service is not
successfully started.
systemctl status openstack-nova-novncproxy
 Yes → Step 3.
 No → Step 6.
3. Modify the /etc/nova/nova.conf file of the control node. Make the "novncproxy_host"
value to be the same as the "my_ip" value and set "novncproxy_port" to 6180. For example,

my_ip=1.2.3.4

...

novncproxy_host=1.2.3.4

novncproxy_port=6180

4. Modify the /etc/nova/nova.conf file of the compute node. Make the "vncserver_listen"
value in the "vnc" section to be the same as the "my_ip" value in the "default" section, and
set "novncproxy_base_url" in the "vnc" section to "https://public-zte.dns-252:6080/vnc_auto.
html". For example,

[default]

my_ip=1.2.3.4

[vnc]

vncserver_listen=1.2.3.4

136 SJ-20240124113225-026 | 2024-01-20 (R1.0)


8 VM Operation Failure

novncproxy_base_url=novncproxy_base_url = https://public-zte.dns-252:6080/vnc_auto.html

//public-zte.dns-252 is the public-vip address of the control node.

5. Check whether the console can be connected to the VM.


 Yes → End.
 No → Step 6.
6. Contact ZTE technical support.

Expected Result

The control point can successfully connect to the VM.

8.7 VM Restart Due to Invoked OOM-Killer


Symptom

The VM in the physical environment is automatically restarted and there is oom_kill_


process information in the log file in the /var/log/messages directory.

Probable Cause

Some processes (for example, qemu-system-x86) are abnormally restarted due to physical
memory exhaustion.

Action

1. Run the free -m command to view the free memory.

free -m

total used free shared buff/cache available

Mem: 128074 86841 4347 474 36886 26325

Swap: 0 0 0

Where, "total" means total memory, "used" means used memory, "free" means remaining
memory, and "buff/cache" means cache memory that can be released. When "free+buff/
cache" is less than 4096 MB, it is considered that the memory will be exhausted. Unit: MB.
2. Run the cat /proc/cmdline command to check whether huge pages are configured.

cat /proc/cmdline

BOOT_IMAGE=/vmlinuz-3.10.0-693.21.1.el7.x86_64 root=/dev/mapper/vg_sys-lv_root

ro crashkernel=512M rd.lvm.lv=vg_sys/lv_root console=tty0 console=ttyS0,115200n8

intel_iommu=on iommu=pt rdblacklist=igb,ixgbe,igbvf,ixgbevf,tg3,radeon,openvswitch,

i40e modules-load=bonding blacklist=openvswitch acpi_pad.disable=1 spectre_v2=off

nopti nopti idle=poll udev.event-timeout=180, rd.udev.event-timeout=180

default_hugepagesz=1G hugepagesz=1G hugepages=80 intel_pstate=disable isolcpus=1

SJ-20240124113225-026 | 2024-01-20 (R1.0) 137


TECS OpenStack Troubleshooting

,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,29,30,31,

32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55

Hugepagesz=1G hugepages=80 indicates the huge page size is 1 GB, and 80 hug pages
are configured. Generally, a memory of 30 GB should be reserved for the system, that is,
the total memory – huge page memory should be larger than 30 GB. If the requirement
cannot be met, modify the huge page configuration and contact ZTE technical support.
3. Check whether the memory configured for the VM is too large. You need to adjust
the configuration in accordance with the memory resources of the physical machine.
Generally, non-huge page VMs will not be used in the commercial environment. Check the
specifications of the VMs to see whether the numa feature is added. If not, contact ZTE tech
nical support.
4. Compare the memory usage with the actual usage. If memory leak occurs, contact ZTE tech
nical support.

Expected Result

The VM operates properly.

138 SJ-20240124113225-026 | 2024-01-20 (R1.0)


Chapter 9
O&M System Faults
Table of Contents
TECS Interface-Related Faults................................................................................................ 139
Performance Index Collection Faults....................................................................................... 143

9.1 TECS Interface-Related Faults


9.1.1 Image Uploading Queued
Symptom

The upload of an image is initiated on the TECS page. The image source is a local image file,
but the image status is queued for a long time.

Probable Cause

The image upload is interrupted because the browser is refreshed or closed and reopened
during the upload.

Action

1. Delete the image whose status is queued for a long time.


2. Re-upload the image. During the upload, do not refresh or exit the browser.

Note
Chrome (version 49 or later) or Firefox (version 43 or later) is recommended.

3. Check whether the fault is removed.


 Yes → End.
 No → Step 4.
4. Contact ZTE technical support.

Expected Result

The image is uploaded successfully and the image status is active.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 139


TECS OpenStack Troubleshooting

9.1.2 Database Server Startup Failure Due to the Damaged Data File
Symptom

After the system is powered off and restarted, the TECS login page can be opened. But after
the username and password are entered, the "Error" information is displayed.

Probable Cause

The data file is damaged due to abnormal shutdown of the database server.

Action

Notice
This operation is highly risky and should be performed under the guidance of ZTE technical support.

1. Log in to the control node as the root user and run the docker-manage ps command to
query the container name, see Figure 9-1.

Figure 9-1 Querying the Container

The container with STATUS being Up is the running container and the NAMES is provider.
2. Run the docker-manage enter provider command to enter the container. The container
prompt is -bash-4.2#, see Figure 9-2.

Figure 9-2 Entering the Provider Container

3. Run the ps –ef|grep mysql command to check whether the mysql process exists, see
Figure 9-3.

Figure 9-3 Checking the mysql Process

If no mysql process is displayed in Figure 9-3, the mysql process does not exist.
 Yes → Step 12.
 No → Step 4.

140 SJ-20240124113225-026 | 2024-01-20 (R1.0)


9 O&M System Faults

4. Run the ls –l /home/Data/vDirector_db/mysql command to check whether the mysql


directory contains the fm, jam, and pm database directories. If yes, the mount is normal, as
shown in Figure 9-4.

Figure 9-4 Checking the mysql Directory

 Yes → Step 5.
 No → Step 12.
5. Run the /home/ztecms/mysql-5.6.19‒x86_63/ start_mysql.sh command to check
whether the database can be started properly, see Figure 9-5.

Figure 9-5 Starting the Database

 Yes → Step 12.


 No → Step 6.
6. Check whether the database log file /home/Data/vDirecror_db/mysql/mysqld.log
records an exception, see Figure 9-6.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 141


TECS OpenStack Troubleshooting

Figure 9-6 Checking the Database Log File

The content in Figure 9-6 indicates that the data file is damaged.
 Yes → Step 7.
 No → Step 12.
7. Run the ls -l /home/Data/backup/mysql-bak command to check backup files with the "
gz" suffix starting with mysqlbak. The backup files are generated at 00:00 every day. The
last two files are stored.
8. Select the recent backup file and run the /usr/local/mysqlbackup/ restore.sh backup
file name command to restore the backup data file, see Figure 9-7.

Figure 9-7 Restoring the Backup Data File

9. Run the exit command to exit the container and return to the control node.
10. Run the docker-manage restart provider command to restart the provider container.
11. Log in to the TECS page and check whether the fault is solved.
 Yes → End.
 No → Step 12.
12. Contact ZTE technical support.

Expected Result

The TECS page can be logged in to after the correct username and password are entered.

142 SJ-20240124113225-026 | 2024-01-20 (R1.0)


9 O&M System Faults

9.1.3 Account Locked Due to Incorrect Passwords


Symptom

The TECS login page shows that a user is locked because the number of times that an
incorrect password is entered exceeds a threshold.

Probable Cause

If the number of times that a user enters an incorrect password exceeds three, the self-
protection mechanism of the system will forbid the user's login.

Action

Wait for 5 minutes. The system will automatically unlock the user.

Expected Result

The user can log in to the TECS page again after 5 minutes.

9.2 Performance Index Collection Faults


9.2.1 Cannot Obtain Performance Indexes of a Physical Machine
Symptom

The following performance indexes of a physical machine cannot be obtained:


 CPU usage
 Total disk space
 Used disk space
 Total memory
 Used memory

Probable Cause

 The openstack-nova-compute service of the physical machine is abnormal.


 The openstack-nova-compute service of the physical machine is not correctly configured.

Action

1. Run the following command to check whether the openstack-nova-compute service of the
physical node is successfully started. In the output of this command, if the Active field is "
active", it indicates that the service is successfully started. Otherwise, it indicates that the
service is not successfully started.
systemctl status openstack-nova-compute.service
 Yes → Step 2.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 143


TECS OpenStack Troubleshooting

 No → Step 8.
2. Wait for about five minutes, and then check whether the performance indexes of the
physical machine can be obtained.
 Yes → End.
 No → Step 3.
3. Run the following command to check whether the openstack-nova-compute service of the
physical node is correctly configured:
cat /etc/nova/nova.conf
If the following information is displayed, it indicates that the configuration is correct.

# cat /etc/nova/nova.conf

compute_monitors=cpu.virt_driver

notification_driver = messagingv2

 Yes → Step 8.
 No → Step 4.
4. Modify the configuration in the /etc/nova/nova.conf file.
5. Run the following command to restart the openstack-nova-compute service of the physical
node:
systemctl restart openstack-nova-compute.service
6. Run the following command to check whether the openstack-nova-compute service of the
physical node is successfully started:
systemctl status openstack-nova-compute.service
 Yes → Step 7.
 No → Step 8.
7. Wait for about five minutes, and then check whether the performance indexes of the
physical machine can be obtained.
 Yes → End.
 No → Step 8.
8. Contact ZTE technical support.

Expected Result

The performance indexes of the physical machine can be obtained.

9.2.2 Performance Data Record Failure


Symptom

Ceilometer fails to record performance data and the openstack-ceilometer-collector service log
information is as follows:

144 SJ-20240124113225-026 | 2024-01-20 (R1.0)


9 O&M System Faults

2019-07-30 09:32:25.626 7764 ERROR ceilometer.openstack.common.rpc.amqp

[req-7c07bb40-dc67-4380-8c9b-178eddb44959 - - - - -] Exception during message handling

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp Traceback

(most recent call last):

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp File "/usr/lib/

python2.7/site-packages/ceilometer/openstack/common/rpc/amqp.py", line 462, in _process_data

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp **args)

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp File "/usr/lib/

python2.7/site-packages/ceilometer/openstack/common/rpc/dispatcher.py", line 172, in dispatch

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp result =

getattr(proxyobj, method)(ctxt, **kwargs)

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp File "/usr/lib/

python2.7/site-packages/ceilometer/collector.py", line 107, in record_metering_data

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp data=data)

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp File "/usr/lib/

python2.7/site-packages/stevedore/extension.py", line 243, in map_method

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp method_name,

*args, **kwds)

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp File "/usr/lib/

python2.7/site-packages/stevedore/extension.py", line 213, in map

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp raise

RuntimeError('No %s extensions found' % self.namespace)

2019-07-30 09:32:25.626 7764 TRACE ceilometer.openstack.common.rpc.amqp RuntimeError:

No ceilometer.dispatcher extensions found

Probable Cause

The collector service records database errors after receiving performance data. The status of
the default database mongodb is normal. The logs of the collector and mongodb services show
that the collector service depends on the mongodb service. The mongod service is started
before the ceilometer service is started; however, it will take some time to start the mongod
service. When the ceilometer service is started, the mongod service is not always available. If
the mongod service is connected at this moment, the fault may occur.

Action

1. Restart the openstack-ceilometer-collector service in the following way:

systemctl restart openstack-ceilometer-collector

2. If the fault persists, contact ZTE technical support.

SJ-20240124113225-026 | 2024-01-20 (R1.0) 145


TECS OpenStack Troubleshooting

Expected Result

The openstack-ceilometer-collector service operates properly.

146 SJ-20240124113225-026 | 2024-01-20 (R1.0)


Chapter 10
Troubleshooting Records
Device name Device ID

Occurrence time ( Removal time (HHDDMMYY)


HHDDMMYY)

Fault type:

Fault source:

Symptoms:

Solution:

Summary:

Signature of the personnel on duty: Signature of the handling personnel:

SJ-20240124113225-026 | 2024-01-20 (R1.0) 147


Glossary

ARP

- Address Resolution Protocol

AZ

- Availability Zone

BIOS

- Basic Input/Output System

DHCP

- Dynamic Host Configuration Protocol

EMS

- Element Management System

FC

- Fiber Channel

FSCK

- File System Check

FTP

- File Transfer Protocol

HA

- High Availability

ICMP

- Internet Control Message Protocol

IP

- Internet Protocol

iSCSI

- Internet Small Computer System Interface

148 SJ-20240124113225-026 | 2024-01-20 (R1.0)


MAC

- Media Access Control

NE

- Network Element

NTP

- Network Time Protocol

OS

- Operating System

OVS

- Open VSwitch

RAID

- Redundant Array of Independent Disks

SSH

- Secure Shell

TECS

- Tulip Elastic Cloud System

VLAN

- Virtual Local Area Network

VT

- Virtual Tributary

ZTE

- Zhongxing Telecommunications Equipment

SJ-20240124113225-026 | 2024-01-20 (R1.0) 149

You might also like