5-TECS OpenStack (V7.23.40) Troubleshooting
5-TECS OpenStack (V7.23.40) Troubleshooting
Version: V7.23.40
ZTE CORPORATION
ZTE Plaza, Keji Road South, Hi-Tech Industrial Park,
Nanshan District, Shenzhen, P.R.China
Postcode: 518057
Tel: +86-755-26771900
URL: http://support.zte.com.cn
E-mail: [email protected]
LEGAL INFORMATION
Copyright 2024 ZTE CORPORATION.
The contents of this document are protected by copyright laws and international treaties. Any reproduction or
distribution of this document or any portion of this document, in any form by any means, without the prior written
consent of ZTE CORPORATION is prohibited. Additionally, the contents of this document are protected by
All company, brand and product names are trade or service marks, or registered trade or service marks, of ZTE
This document is provided as is, and all express, implied, or statutory warranties, representations or conditions are
disclaimed, including without limitation any implied warranty of merchantability, fitness for a particular purpose,
title or non-infringement. ZTE CORPORATION and its licensors shall not be liable for damages resulting from the
ZTE CORPORATION or its licensors may have current or pending intellectual property rights or applications
covering the subject matter of this document. Except as expressly provided in any written license between ZTE
CORPORATION and its licensee, the user of this document shall not acquire any license to the subject matter
herein.
ZTE CORPORATION reserves the right to upgrade or make technical change to this product without further notice.
Users may visit the ZTE technical support website http://support.zte.com.cn to inquire for related information.
delivered together with this product of ZTE, the embedded software must be used as only a component of this
product. If this product is discarded, the licenses for the embedded software must be void either and must not be
transferred. ZTE will provide technical support for the embedded software of this product.
Revision History
I
5.1.7 Virtual Resource Page Prompts That The Current User Needs to Be Bound
With a Project.................................................................................................... 40
5.2 Nova Service Faults.......................................................................................................... 41
5.2.1 NOVA Fails to Be Connected to RabbitMQ......................................................... 41
5.3 Neutron Service Failure.....................................................................................................42
5.3.1 Neutron Server Error............................................................................................ 42
5.3.2 Neutron Agent Error............................................................................................. 43
5.3.3 Network Service Startup Failure...........................................................................44
5.4 Rabbitmq-Related Faults................................................................................................... 45
5.4.1 Failed to Start rabbitmq-server.............................................................................45
5.4.2 Message Server Connection Failure.................................................................... 46
5.4.3 General Rabbitmq-Related Fault Location........................................................... 47
5.4.4 Nova Cannot be Connected to Rabbitmq............................................................ 48
5.5 Automatic Restart Every Other Minutes in a New Physical Environment..........................49
6 Faults Related to Virtual Resources.................................................................. 50
6.1 Cannot Create a Cloud Drive............................................................................................50
6.1.1 Cannot Create a Cloud Drive With a Mirror......................................................... 50
6.1.2 Cannot Create a Cloud Drive (Based on a Fujitsu Disk Array).............................52
6.1.3 Cannot Create a Cloud Drive With a Mirror (Based on an IPSAN Disk Array)...... 55
6.1.4 The Volume With Images Fails to be Created Due to "Failed to Copy Image
to Volume"......................................................................................................... 55
6.1.5 The Volumes With Images Fail to Be Created in Batches................................... 56
6.1.6 The Volume With Images Fails to Be Created on a Fujitsu Disk Array.................57
6.1.7 The Volume Fails to Be Created and the Status of the Volume Is "error,
volume service is down or disabled"................................................................. 58
6.1.8 The Volume With Images Fails to be Created Due to "_is_valid_iscsi_ip, iscsi
ip:() is invalid".................................................................................................... 59
6.2 Cloud Drive Deletion Failure............................................................................................. 60
6.2.1 Cannot Delete a Cloud Drive, the Status of the Cloud Drive is "Error-
Deleting".............................................................................................................60
6.2.2 A Volume Fails to Be Deleted From a ZTE Disk Array Due to "Failed to
signin.with ret code:1466"..................................................................................61
6.2.3 A Volume Fails to Be Deleted From a ZTE Disk Array Due to "error-deleting"..... 63
6.2.4 No Response and Log Are Returned After a Volume Is Deleted......................... 63
6.3 VM Cannot Mount a Cloud Drive......................................................................................64
6.3.1 Cannot Mount a Cloud Drive When a Fujitsu Disk Array Is Used........................ 64
6.3.2 Cannot Mount a Cloud Drive When IPSAN Back-End Storage Is Used............... 65
II
6.4 Cannot Unmount a Cloud Drive........................................................................................ 67
6.5 Cannot Upload a Mirror.....................................................................................................68
6.5.1 Mirror Server Space Insufficient........................................................................... 68
6.5.2 Insufficient Permissions on the Mirror Storage Directory..................................... 69
6.6 Security Group Faults........................................................................................................70
6.6.1 Network Congestion Caused by Security Groups................................................ 70
7 VM Life Cycle Management Faults.....................................................................72
7.1 VM Deployment Faults...................................................................................................... 72
7.1.1 Deployment Fault Handling Entrance...................................................................72
7.1.2 No valid host was found.......................................................................................73
7.1.3 Failed to Deploy a VM on a Compute Node........................................................ 78
7.2 Hot Migration Faults.......................................................................................................... 88
7.2.1 Hot Migration Is Allowed Only in One Direction................................................... 88
7.2.2 Inter-AZ Hot Migration of VM Fails.......................................................................89
7.2.3 Destination Host Has Not Enough Resources (Not Referring to Disk Space).......90
7.2.4 Destination Host Has Not Enough Disk Space.................................................... 91
7.2.5 Source Computing Service Unavailable............................................................... 91
7.2.6 VM Goes into Error Status After Live Migration................................................... 92
7.3 Cold Migration and Resizing Faults.................................................................................. 93
7.3.1 Authentication Fails During Migration...................................................................93
7.3.2 Error "No valid host was found" Reported During Migration.................................94
7.3.3 Error "Unable to resize disk down" Reported During Resizing............................. 94
7.3.4 VM Always in "verify_resize" Status After Cold Migration or Resizing..................95
7.3.5 Mirror Error Reported During Cold Migration or Resize Operation....................... 96
7.4 Cannot Delete VM............................................................................................................. 97
7.4.1 Deletion Error Caused by Abnormal Compute Node Service...............................97
7.4.2 Control Node's cinder-volume Service Abnormal................................................. 97
7.4.3 Network Service Abnormal................................................................................... 98
8 VM Operation Failure.........................................................................................100
8.1 VM OS Startup Failure....................................................................................................100
8.1.1 Some Services of the VM are not Started......................................................... 100
8.1.2 Failed to Start the VM Due to Loss of grub Information..................................... 101
8.1.3 Too Long VM Startup Time Due to Too Large Disk...........................................101
8.1.4 Failed to Start the VM After Power Off.............................................................. 102
8.1.5 Failed to Start the VM OS, no bootable device..................................................104
8.1.6 Error Status of VM............................................................................................. 105
8.1.7 Cannot Power on the VM After Restart..............................................................106
III
8.1.8 Failed to Start the VM, Insufficient Memory....................................................... 106
8.1.9 VM File System Read-Only Due to Disk Array Network Interruption.................. 107
8.2 Network Disconnection (Non-SDN Scenario, VLAN)...................................................... 108
8.2.1 Cannot Ping the VM From the External Debugging Machine............................. 108
8.2.2 Cannot Ping the External Debugging Machine From the VM............................. 110
8.2.3 Cannot Ping Ports on a VLAN........................................................................... 111
8.2.4 OVS VM Cannot Be Connected.........................................................................112
8.2.5 Floating IP Address Cannot Be Pinged..............................................................113
8.2.6 The Service VM Media Plane Using the SRIOV Port Cannot Be Connected......116
8.2.7 VM (OVS+DPDK Type) Communication Failure................................................ 118
8.3 Network Disconnection (SDN Scenario, VXLAN)............................................................130
8.3.1 OVS (User Mode) VMs Not Connected............................................................. 130
8.3.2 Failed to Obtain an IP Address..........................................................................131
8.4 DHCP Faults....................................................................................................................132
8.4.1 Cannot Obtain IP Addresses Distributed by DHCP............................................132
8.4.2 Connection Failure If the Address Distributed by DHCP Is Not Used.................134
8.5 VM's NIC Unavailable......................................................................................................135
8.6 Control Console Cannot Connect to VM.........................................................................136
8.7 VM Restart Due to Invoked OOM-Killer.......................................................................... 137
9 O&M System Faults........................................................................................... 139
9.1 TECS Interface-Related Faults........................................................................................139
9.1.1 Image Uploading Queued...................................................................................139
9.1.2 Database Server Startup Failure Due to the Damaged Data File.......................140
9.1.3 Account Locked Due to Incorrect Passwords.....................................................143
9.2 Performance Index Collection Faults.............................................................................. 143
9.2.1 Cannot Obtain Performance Indexes of a Physical Machine............................. 143
9.2.2 Performance Data Record Failure......................................................................144
10 Troubleshooting Records................................................................................147
Glossary..................................................................................................................148
IV
Chapter 1
Fault Handling Overview
Table of Contents
Introduction to Faults....................................................................................................................1
Requirements for Maintenance Engineers...................................................................................3
Precautions for Fault Handling.....................................................................................................4
Fault Location Thinking and Method Descriptions....................................................................... 5
Faults refer to the phenomena in which the equipment or system software loses specified
functions or causes dangers during operation due to a certain reason. Based on services
affected by the faults and fault impact ranges, faults can be classified as critical faults and minor
faults.
Critical faults
Critical faults are the faults that seriously affect system services and operations, including
serious decline of system key performance indicators (KPIs), large-area or even full
interruption of services, and abnormal charging.
Minor faults
Minor faults are the faults that have minor impacts on services and operations, excluding the
critical faults.
The sources for discovery of faults can be divided into the following three categories:
Complaints of terminal users
Services cannot be used properly, so users complain.
Alarms on EMS pages
Due to equipment or software faults, the system reports the alarms to the EMS. Audible and
visual alarms are raised on EMS pages.
Routine maintenance and inspections
Maintenance engineers detect equipment or system faults during routine maintenance and
inspections.
Faults of the TECS OpenStack are generally caused by the following reasons:
Faults in the hardware
Contact hardware platform engineers to resolve the faults.
Faults in software and the system
à Faults of the operating system
The operating system has memory management and security problems.
à Database faults
Improper settings and usages of databases also cause various problems in usage and
security.
à Program or software faults
A module of the TECS OpenStack may be faulty, or an unmatched version of software is
used.
Faults caused by environmental changes
à Stable system operation has strict requirements on the environment. If the temperature
or humidity does not meet the requirements, the system generates alarms.
à The occurrence of natural disasters or accidents can also cause alarms, such as
lightning, fire, and infrared sensor alarms.
Setting or configuration faults
à Problems in settings
Problems in settings of equipment interfaces, system rights, files or folders
à Problems in configuration
Invalid or unreasonable configuration reduces system performance and capacity. Alarm
thresholds must be configured appropriately.
à Command errors
An invalid command is entered or a program with error scripts is executed.
Network faults
à Local connection problems
The VM is not installed correctly or properly. The ports of the network cable has contact
problems. The IP address, subnet mask, default gateway, and route are not set correctly.
These normally cause network connection problems.
à Faults of network equipment
Faults occur in the network equipment, routing and switching equipment, or intermediate
links on the Internet.
Fundamental Knowledge
Be familiar with the basic knowledge of computer networks such as Ethernet and TCP/IP.
Be familiar with the basic knowledge of the MariaDB and MongoDB databases.
Be familiar with the basic knowledge of the Linux system.
Be familiar with basic virtualization knowledge.
Be familiar with the network architecture and IP planning of the TECS OpenStack.
Be familiar with the connection relations between the TECS OpenStack and other devices in
the network.
Collect and save the on-site data. On-site data collection and saving includes the periodic
data collection during proper device operation and the data collection when device faults
occur. Generally, acquire and save the on-site data before fault handling.
No. Case
1 A critical fault occurs, and part or all of the services are interrupted.
2 The problem cannot be solved by using the known fault handling methods.
3 The problem cannot be solved by using the previous handling methods of similar
faults.
à For any problem during maintenance, record the raw information in detail, including the
symptom of the fault, operations before the occurrence of the fault, versions, and data
changes.
During the handling of faults
à Observe operation regulations and industry safety regulations strictly to ensure personal
and equipment safety.
à During component replacement and maintenance, take antistatic measures and wear
antistatic wrist straps.
Dangerous Operations
The following are dangerous operations that must be implemented with caution during fault
handling:
Modifying service parameters.
Deleting the configuration file of the system.
Modifying the configuration of the network equipment.
Altering the network architecture.
Principle 1: If data is modified before a fault occurs, data must be restored immediately.
à Quickly determine whether the fault is related to the operation in accordance with the
operation contents, operation time, and fault occurrence time.
à After the preliminary determination, perform the corresponding restoration operation in
accordance with the operation performed before the fault occurs.
Principle 2: If a fault occurs in the equipment room construction procedure, check whether
the fault is related to construction.
à In accordance with onsite conditions, determine whether the device fault can be
caused in the construction procedure. For example, an internal cable of the system is
disconnected by mistake.
à Determine the operational status of the system in accordance with alarm management
and board indicators. Focus on internal cable connections.
Principle 3: Verify that physical machines are in normal state.
All physical machines must be normal state. You can check their status by viewing blade
indicators. Ensure that physical machine operate properly.
Principle 4: Ensure that the control nodes in two-server cluster mode operate properly.
Run the crm_mon -1 command to check whether the two-server cluster is in normal state.
Principle 5: Verify that VMs are in normal state.
Through the TECS OpenStack, verify that all VMs are in "active running" state.
Principle 6: Verify that the network of physical machines is normal.
à On a physical machine, ping another physical machine, and check the ping packet result.
à If there is a disk array, ping the disk array management interface from the control node,
and ping the disk array service interface from the computing node.
If a fault occurs in the system, analyze it in accordance with the alarm information and the
generated error log to find the fault cause and fix the fault.
On the TECS OpenStack, check all the current alarms in the system. Analyze and determine
the fault cause in accordance with the alarm information. Keys to query and collect alarm
information:
In the collected alarm information, focus on the alarm level, alarm code, location, time, and
details.
After the current alarm information is collected, determine whether to collect historical alarm
information as required.
After the current alarm information is collected, perform analysis as follows:
1. Preliminary analysis and determination: In accordance with the keys of the current alarm
information (for example, the alarm is a critical or major alarm), determine the fault cause
and impact.
2. Alarm relationship analysis: Analyze the sequence and codes of the current alarms, and
clarify the relationships between the alarms. In this way, the fault occurrence procedure can
be known.
In the quick troubleshooting procedure, the logs of the TECS OpenStack are important
methods. After the log information is collected, the fault can be analyzed quickly.
Fault handling includes common fault handling and emergency fault handling, which have
different procedures.
When a fault occurs, on-site maintenance personnel must determine whether the fault is an
emergency fault. If it is an emergency fault, follow the emergency fault handling procedure. If it
is a common fault, follow the common fault handling procedure.
To locate a fault to the specific module during the troubleshooting, you need to analyze the flow
and the network element (NE). During the fault location, you should start with the flow and the
system composition, analyze and determine the fault in accordance with the symptoms, exclude
normal modules and determine the fault module.
Figure 2-1 shows the common procedure for handling a fault.
During a fault handling procedure, the troubleshooting engineers should perform the following
steps in turn:
1. Determine the situations
If a fault occurs, perform a simple test to know the situation of the fault.
2. Collect source information
If a fault occurs, record detailed information about the fault, including the symptom, alarms
and operating information displayed on the TECS OpenStack window, operations that you
have performed to handle this fault, and other information that you can collect with the
maintenance tools (such as performance management).
3. Classify the fault
Analyze the fault initially and classify it in accordance with the symptom and the information
that you have collected with the maintenance tools.
4. Locate the fault
Locate the fault and determine the possible causes by analyzing the flow and NEs.
5. Remove the fault
Remove the fault in accordance with the located fault reasons
6. Record the fault handling information
Record details about the fault handling, including the symptom and the handling methods.
Such information is a helpful reference for the handling of similar faults. It is recommended
that the sheet shown in 10 Troubleshooting Records be used to record the fault handling
information, and you can also record the fault handling information with a sheet designed by
yourself.
Precautions
Make rules and regulations for fault handling and tracing for all maintenance personnel
to follow. Only authorized and relevant persons are allowed to participate in the
troubleshooting, to avoid worse faults caused by misoperations.
Perform operations and maintenance by following the instructions in the documents of the
TECS OpenStack.
Back up service data and system operating parameters before the fault handling. Make
a detailed record about the fault symptoms, versions, and configuration changes and
operations that you have performed. Collect other data about the fault for analyzing and
removing the fault.
Trace and record the detailed fault handling procedure. For a fault that may last for days,
make detailed shift records to clarify the responsibilities.
Handle every fault promptly. If there is any fault that you cannot remove, contact ZTE techni
cal support.
In any of the following situations, you should contact ZTE technical support.
à Emergency faults, for example, all services or some services are interrupted.
à Faults that you cannot remove with the methods described in this document.
à Faults that you cannot remove with your own knowledge.
à Faults that you cannot remove by referring to the similar fault removal cases.
Paste a list of contacts of ZTE in a conspicuous place, and remember to confirm and update
the contacts frequently.
When you are contacting ZTE for technical support, you may be required to provide the
following information:
à Detailed symptoms about the fault, including the time, place, and events.
à Alarm management data, performance management data, signaling tracing result, and
failure observation result.
à Operations that you have performed after the fault occurred.
à Way to remotely log in to the system and the telephone numbers of persons for contact.
When an emergency fault occurs, the device cannot provide basic services, operate for more
than 30 minutes, or the device causes human safety hazards. All emergency faults must be
handled immediately.
Emergency faults of the TECS OpenStack can be classified into five types:
Failing in providing basic services for multiple causes, such as equipment breakdown,
power off, system crash, environmental or human factors. The faults are not removed after
the preliminary handling and need to be handled immediately.
The rate of successful service handling operations declines 5% or more, or many
subscribers or important customers complain about the interruption or poor quality of
services.
Failing in accessing subscriber data, or subscriber data completeness and consistency are
damaged.
Failing in maintaining the device with the TECS OpenStack window.
Influences on basic services provisioning by other equipment
Hazards to human safety caused by use of the product
Once emergency faults on the equipment are reported or found, to restore the system as soon
as possible, you should handle the fault in accordance with the procedure shown in Figure 2-2.
You also need to contact the local ZTE technical support.
In accordance with the statistical data, network failures, and other faults. You can troubleshoot
the fault based on the statistical result. When you have confirmed that the power and
communication are normal, you can use the alarm management function to locate the node
where the fault possibly lies.
Once an emergency fault happens, contact and organize relevant persons and department
to perform the emergency fault handling, and make a call to the authority department and
supervisors immediately. After recovery, you need to submit a written failure report to the
authority departments and supervisors. You also need to organize relevant technicians,
departments, and equipment suppliers to locate causes so that lessons may be drawn from
it and effective measures may be taken in the subsequent operations to avoid its to avoid its
recurrence. After the recovery, you need to make a detailed emergency fault record carefully
and archive it.
Figure 2-2 shows the procedure of handling emergency faults.
When you run the crm_mon -1 command to query resource status, a TECS service resource is
in failed status. For example, the following message is displayed:
Note
This section uses the openstack-nova-api resource as an example to describe the fault symptom and
troubleshooting procedure.
Probable Cause
Action
1. Run the following command as the root user to disable the monitoring of the openstack-
nova-api resource in the HA:
pcs resource unmanage openstack-nova-api
2. Run the following command to start the resource service:
systemctl start openstack-nova-api.service
3. Run the following command to check the service status and logs, and handle possible
problems according to prompts.
crm_mon -1
4. Run the following command to enable the monitoring of the openstack-nova-api
resource in the HA:
pcs resource meta openstack-nova-api is-managed=true
Check whether the fault is removed.
Yes → End.
No → Step 5.
5. Contact ZTE technical support.
Expected Result
When the cluster is started, after running the crm_mon command, it is found that the file
system resources (Filesystem) fail to be started. For example, the following information is
displayed:
Note
This section uses the mysql_fs resource as an example to describe the fault symptom and
troubleshooting procedure.
Action
1. Perform the following steps as the root user to check whether the disk to be mounted and
the mounting point exist.
a. Run the pcs resource show mysql_fs command. For the command result, see Figure
3-1.
b. Run the mount |grep mysql command. For the command result, see Figure 3-2.
c. Check the Attribute line in Figure 3-1. device indicates the path and name of the device
to be mounted, and directory indicates the directory where the device is mounted.
Compare the check result with the information displayed in Figure 3-2 to see whether
they are consistent.
Yes → it indicates that the disk to be mounted and the mounting point exist. Step 4.
No → it indicates that the disk to be mounted and the mounting point do not exist or
they have errors. Step 2.
2. Run the mount command to attempt to manually mount the disk.
Example: mount -t ext4 /dev/mapper/vg_db-lv_db /var/lib/mysql , where:
-t ext4 : file system type.
/dev/mapper/vg_db-lv_db : device to be mounted.
/var/lib/mysql : mounting point of the device.
3. Run the df |grep mysql command to check whether the name of the mounted device is the
same as the device parameter. For the command result, see Figure 3-3.
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Expected Result
The file system resources are started successfully, and the cluster is started successfully.
When the cluster is started, the active/standby switchover notification is repeatedly reported
from upper-layer services. Run the crm_mon -1 command for multiple times to check whether
the cluster is automatically switched over (the started host is changed). For example,
Stack: corosync
Version: 1.1.10.36791-1.el7-f2d0cbc
2 Nodes configured
41 Resources configured
Online: [ host4 host5 ] //host4 and host5 are the two servers that form the cluster.
//Run the command for multiple times to check whether the services are switched between
//two servers.
Resource Group: DB
Action
1. Run the following command as the root user to check the detailed information about the
resource:
pcs resource op defaults on-fail =ignore
In this command, on-fail =ignore indicates that server restart upon startup failure is
disabled. In that case, the HA does not restart a service when the service is in failed status.
Thus, it can be avoided that active/standby switchover occurs when the failed service is
restarted by the HA for a specific number of times.
2. Run the crm_mon -1 command and check the output information.
If... Then...
A service resource is in failed status. Refer to 3.1 Service Resources in Failed Status in
the Cluster
A file system resource cannot be started. Refer to 3.2 Failed to Start File System Resources
in the Cluster
If... Then...
Expected Result
After the cluster is started, some functions cannot be used, for example, alarms are not
reported. When you run the crm_mon -1 command, some resources are not displayed, for
example, opencos-alarmmanager.
Action
1. Run the following command as the root user to check whether the stonith resource is
configured. If the stonith resource is configured, "stonith-enabled: false"" is returned.
pcs property show |grep stonith
Yes → Step 4.
No → Step 2.
2. Run the following command to set stonith-enabled to false :
pcs property set stonith-enabled =false
3. Run the following command to check whether all the configured resources can be displayed.
For example, the opencos-alarmmanager resource is displayed after configuration. In
addition, check whether no resource is in Offline, Stopped, or Failed status.
crm_mon -1
Yes → End.
No → Step 4.
4. Run the following command to check whether any resource is disabled by the HA:
crm_mon -! |grep opencos-alarmmanager |grep Stopped
For example, if a resource is disabled by the HA, the following message is displayed:
<nvpair id="opencos-alarmagent-meta_attributes-target-role"
name="target-role" value="Stopped"/>
Yes→Step 5.
No → Step 7.
5. Run the following command to enable the resource:
pcs resource enable opencos-alarmmanager
Note
Repeat this step to enable all the other resources that are not displayed.
6. Run the following command to check whether all the configured resources can be displayed
and no resource is in Offline, Stopped, or Failed status. For example, a resource was
disabled and in Stopped status before, and not it is in Started status.
crm_mon -1
Yes → End.
No → Step 7.
7. Contact ZTE technical support.
Expected Result
After the cluster is started, run the # crm_mon -! | grep disable_fence_reboot command on a
node. Check whether there is execution result. If yes, this indicates that the node is suspended
by the HA, at this time the node does not run any resources.
Example:
Note
When detecting that a node is repeatedly restarted within a specific period due to resource errors, the HA
suspends the node. If there is no manual intervention, the node will be automatically restored to normal
30 minutes later.
Action
1. Run the following command as the root user to manually restore the node to normal:
crmadmin –c host_name
Note
host_name is the node name, and there is no space between host_name and -c.
Expected Result
When you run the crm_mon -! | grep disable_fence_reboot command, no output information
is displayed.
The pacemaker is operating properly, but the clusters fails to find each other on the pacemaker.
For example, run the crm_mon -1 command on the host-2018-abcd-
abcd-1234-4321-5678-8765-12aa, and the execution result is that the host-2018-abcd-
abcd-1234-4321-5678-8765-12 cc is OFFLINE.
Run the command on the host-2018-abcd-abcd-1234-4321-5678-8765-12 cc, and the execution
result is that the host-2018-abcd-abcd-1234-4321-5678-8765-12aa is OFFLINE.
host-2018-abcd-abcd-1234-4321-5678-8765-12aa
Stack: corosync
2 Nodes configured
83 Resources configured
Online: [ host-2018-abcd-abcd-1234-4321-5678-8765-12aa ]
OFFLINE: [ host-2018-abcd-abcd-1234-4321-5678-8765-12cc ]
Action
1. Run the corosync-cmapctl | grep member command on both nodes that cannot find each
other, and perform the following operations in accordance with the output result.
If the following result is returned, this indicates that the corosync exits abnormally and
the pcsd monitoring fails to restart the corosync. In this case, the pacemaker cannot
operate normally. At this time, it is necessary to collect the coredump log and black box
data for further analysis, and go to Step 6.
If the following result is displayed, this indicates that the peer end is not found.
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
If the following result is returned, this indicates that the local end finds the peer end but
the peer end leaves.
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 2
2. If the two ends can find each other, the pacemaker is faulty, go to Step 6.
3. If the two ends cannot find each other, the network may be disconnected or the heartbeat IP
configuration may be incorrect. Run the corosync-cfgtool -s command to check whether
the heartbeat IP addresses are the same as the configurations.
Local node ID 1
RING ID 0
id = 128.0.0.15
RING ID 1
id = 129.0.0.15
RING ID 2
id = 130.0.0.15
Where, "ring 0 active with no faults" indicates that the heartbeat link is normal only when the
two sides can find each other.
4. Check the configuration file. The configurations are as follows:
nodelist {
node {
ring0_addr: 128.0.0.14
ring1_addr: 129.0.0.14
ring2_addr: 130.0.0.14
name: test1
nodeid: 1
node {
ring0_addr: 128.0.0.15
ring1_addr: 129.0.0.15
ring2_addr: 130.0.0.15
name: test2
nodeid: 2
The contents displayed in the configuration file are consistent with the result returned by the
corosync-cfgtool -s command.
If the heartbeat IP addresses meet the configurations, but the two ends cannot find each
other, the heartbeat link may be broken. Run the ping and route commands to check
whether the heartbeat links are normal and whether the routes are correct.
5. If the heartbeat links are normal, shut down the firewall and check again. If the fault persists,
collect all information for further analysis.
6. Contact ZTE technical support.
Expected Result
Each node in the cluster on the pacemaker can find the peer end.
Run the crm_mon -1 command on all nodes. If the following result is displayed, all nodes in the
cluster are online.
[root@host-2018-abcd-abcd-1234-4321-5678-8765-12cc vtu]# crm_mon -1
Last updated: Sat May 16 14:27:06 2020
Last change: Sat May 16 11:03:58 2020 via cibadmin on host-2018-abcd-
abcd-1234-4321-5678-8765-12aa
Stack: corosync
Current DC: host-2018-abcd-abcd-1234-4321-5678-8765-12aa (1) - partition with
quorum-180408
2 Nodes configured
83 Resources configured
Online: [ host-2018-abcd-abcd-1234-4321-5678-8765-12aa host-2018-abcd-
abcd-1234-4321-5678-8765-12cc ]
Run the corosync-cmapctl | grep member command at both sides of the cluster. The
following result indicates that the corosync exits abnormally and the pcsd monitoring fails to
restart the corosync.
[root@cent6c corosync]# corosync-cmapctl | grep member
Action
1. Check whether the configuration file of Corosync is correct and whether the heartbeat IP
address is correctly configured for the network port. Run the corosync -t command to check
whether the configuration file is available.
If the following result is displayed, it indicates that the configuration file is damaged or
incorrectly configured. Collect the configuration file for troubleshooting.
specified
Jan 30 15:53:58 error [MAIN ] Corosync Cluster Engine exiting with status 8
at main.c:1416.
If the following result is displayed, it indicates that the configuration file is correct. The
network port may not be configured with an IP address.
2. Run the corosync -f command to manually run Corosync in the foreground to find the
causes. If the following information is displayed, it indicates that the network port is not
configured with an IP address. Configure a correct heartbeat IP address for the network port
and restart the two-server cluster. If other information is displayed, collect all related data for
further analysis.
Jan 30 16:38:12 notice [MAIN ] Corosync Cluster Engine ('2.3.4.3'): started and
Jan 30 16:38:12 info [MAIN ] Corosync built-in features: pie relro bindnow
Jan 30 16:38:12 warning [TOTEM ] bind token socket failed: Cannot assign requested
address (99)
Jan 30 16:38:12 error [MAIN ] Corosync Cluster Engine exiting with status 15 at
totemudpu.c:1237.
Expected Result
Action
1. Check the faulty resources. For details, refer to 3.1 Service Resources in Failed Status in
the Cluster.
2. Check whether there are heartbeat faults between the nodes, especially heartbeat
disconnection. During normal operation of the nodes, both nodes may operate as active
nodes due to disconnection of a heartbeat line. After the fault is resolved, the HA restarts
the nodes and runs the node with the less number of resources.
3. Check whether the HA processes operate properly. During normal operation of the two
nodes, if the HA process of one node is faulty, the node without a fault restarts the faulty
one.
4. Collect log information. To collect memory log information, run the crm_mon -! | grep _
reboot command. File logs are saved in the cat /var/lib/pacemaker/pengine/
crm_status_save.xml file.
5. Contact ZTE technical support.
Expected Result
Run the crm_mon -1 command on a node. If split-brain occurs on the node, the node status
is displayed as offline. If the number of split-brain nodes in the cluster is smaller than half of
the total number of nodes in the cluster, the single-instance service does not operate on the
split-brain nodes. The multi-instance service operates properly, and only the standby service is
operating.
Stack: corosync
Version: 1.1.10.40182-1.el7.centos-f2d0cbc
3 Nodes configured
16 Resources configured
Online: [ host-192-168-32--ab27 ]
Started: [ host-192-168-32--ab27 ]
Status: [ host-192-168-32--ab27,0 ]
Slaves: [ host-192-168-32--ab27 ]
host-192-168-32--ab28=null ]
Started: [ host-192-168-32--ab27 ]
Started: [ host-192-168-32--ab27 ]
Action
totem {
crypto_hash: none
token_retransmits_before_loss_const: 30
netmtu: 1500
crypto_cipher: none
cluster_name: HA_Cluster
token: 30000
version: 2
ip_version: ipv4
transport: udpu
nodelist {
node {
ring0_addr: 2018:abcd:abcd:1234:4321:5678:8765:12aa
name: host-2018-abcd-abcd-1234-4321-5678-8765-12aa
nodeid: 1
node {
ring0_addr: 2018:abcd:abcd:1234:4321:5678:8765:12cc
name: host-2018-abcd-abcd-1234-4321-5678-8765-12cc
nodeid: 2
Yes → Step 2.
No → Step 3.
2. Make the heartbeat addresses connected, and check whether the split-brain alarm is
cleared.
Yes → End.
No → Step 3.
3. Check whether the port 5405 is restricted due to the firewall.
iptables -S | grep 5405
-A INPUT -p udp -m udp --dport 5405 -j ACCEPT
Yes → Step 4.
No → Step 5.
4. Run the iptables -A INPUT -p $port -j ACCEPT command to open the port, and check
whether the split-brain alarm is cleared.
Yes → End.
No → Step 5.
5. Contact ZTE technical support.
Expected Result
The split-brain fault of the HA node is removed. After the recovery, the HA cluster software
restarts the nodes with fewer resources. You can configure that these nodes are not restarted
after split-brain recovery.
pcs property set pcmk_option=0x08
Probable Cause
In most cases, the database enters split-brain status because the network between control
nodes is disconnected, or more than half of control nodes are down, or the database service
is stopped abnormally. The database does not provide any service. You need to wait for more
than half of the nodes to recover.
Action
1. Verify that more than half of the control nodes are in Online status:
# crm_mon -1
3. Run the following commands to restart MySQL and check whether the database is started
properly.
# crm_mon -1
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Expected Result
Probable Cause
Action
1. Run the crm_mon -1 command to check whether the database service is normal.
Yes → Step 3.
No → Step 2.
2. Restart the MySQL by using the following method and check whether the database is
started properly.
3. Use the following method to log in to the MySQL and check whether you can log in by using
the username and password.
Yes → Step 5.
No → Step 4.
4. Set the correct username and password, and then check whether the fault is removed.
Yes → End.
No → Step 5.
5. Check whether you can access the database through the floating IP address.
Yes → End.
No → Step 6.
6. Confirm the network status, make sure you can ping the floating IP from the local end, and
check whether the fault is fixed.
Yes → End.
No → Step 7.
7. Contact ZTE technical support.
Expected Result
Probable Cause
Action
# df -h
Yes → Step 2.
No → Step 3.
2. Clear the database space, and check whether the database is started successfully.
# rm -f /var/lib/mysql/mariadb-bin.*
Yes → End.
No → Step 3.
3. Check whether there is abnormal printing of damaged files in the database logs.
Expected Result
The database cannot be backed up. The provider raises an alarm about automatic database
backup failure.
Probable Cause
Action
# dbmanager job-list
4. Configure the automatic database backup correctly, and then check whether the fault is
fixed.
Yes → End.
No → Step 5.
5. Contact ZTE technical support.
Expected Result
An error occurs during the login to the TECS, indicating an authentication error and prompting
the user to try again later. The Keystone log prompts that there are too many MySQL
connections.
“Can not connect to MySQL server. Too many connections”-mysql 1040
Probable Cause
The database configuration parameters do not meet the actual conditions of each component.
Action
1. Run the mysql command to enter the database. Run the show variables like '%conn%';
command to view the following variables:
max_connections:mysql: maximum number of connections allowed by the server
max_user_connections: maximum number of connections for each database user
Check whether the following variables are 0. If the value is 0, the number of connections is
not limited.
Yes → Step 4.
No → Step 2.
2. Check the status of the following parameters:
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| Aborted_connects | 0 |
| Connections | 5 |
| Max_used_connections | 1 |
| Threads_connected | 1 |
+----------------------+-------+
Expected Result
Keystone authentication failure occurs when the following operations are performed.
1. The login webpage can be opened. After you enter the username and password, an
authentication error is displayed.
2. After the username and password in the source keystonerc are modified, the webpage
is normal, but an authorization error is reported when a command is executed in the
command line.
Probable Cause
Action
[token]
provider=fernet
Expected Result
When you log in, an authentication error occurs and it prompts you try again later.
When you use the openstack use list command in the keystone command line, the following
information is displayed:
Authorization Failed: An unexpected error prevented the server from fulfilling your request.
(HTTP 500)
Probable Cause
There is no sufficient disk space for proper operation of the mariadb service.
Keystone is not correctly installed.
Action
1. Check whether the mariadb service operates properly in the following way:
mariadb.service; enabled)
Active: failed (Result: exit-code) since Fri 2015-04-17 17:58:09 CST; 2 days ago
Yes → Step 3.
No → Step 2.
2. Troubleshoot and then restart the mariadb service.
Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn'
-- Logs begin at Thu 2015-04-09 13:48:14 CST, end at Mon 2015-04-20 09:20:15 CST. --Apr
self.stream.flush()
on device
You can turn off this feature to get a quicker startup with -A
Database changed
MariaDB [keystone]>
Yes → Step 5.
No → Step 4.
4. There is no table in the keystone database. The database table may be deleted by mistake,
resulting in data loss. Contact technical support to check whether database tables are
backed up and whether they can be restored.
Yes → End.
No → Step 5.
5. Contact ZTE technical support.
Expected Result
The keystone authorization fails, and it prompts that the server cannot be connected. The
following information is displayed:
Probable Cause
Action
Active: failed (Result: start-limit) since Mon 2015-04-20 14:12:35 CST; 29min ago
Yes → Step 3.
No → Step 2.
2. Set a executable permission for the /var/log/keystone/keystone.log file.
3. Run the systemctl restart openstack-keystone command to restart the Keystone service.
4. If the fault persists, contact ZTE technical support.
Expected Result
The system fails to create a user and "create openstack user failed" is displayed on the screen.
Probable Causes
Action
Expected Result
Probable Causes
A floating IP address that cannot be identified by the TECS is set when the cloud environment
is deployed.
Action
Add an association between the floating and actual IP addresses in the /etc/hosts
configuration file of the TECS host.
Expected Result
5.1.7 Virtual Resource Page Prompts That The Current User Needs to Be
Bound With a Project
Symptom
When you open the virtual resource page, a "Please Bind Project &User First!" message is
displayed on the screen.
Probable Causes
Action
1. In the project list, bind the user with the expected project, and check whether the fault is
removed.
Yes → End.
No → Step 2.
2. Check whether the cloud environement has been completely imported.
Yes → Step 4.
No → Step 3.
3. Wait until the cloud environment has been completely imported, and check whether the fault
is removed.
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Expected Result
NOVA fails to be connected to RabbitMQ because the heat process creates thousands of
queues.
Probable Cause
The heat process creates a message queue beginning with "heat" whenever it is restarted. The
new queue cannot be automatically deleted if it is not connected to the client. After the queue
name, a random UUID is followed. Thus, after each restart, the queue name is different and the
old queues are not deleted when the new queue is generated. For example, the heat process
initiates 16 processes and 16 heat queues are generated in the RabbitMQ server during each
HA node switchover. After several switchover, the number of RabbitMQ queues reaches the
threshold, running out of memory and limiting the number of connections.
Action
1. Set automatic deletion by setting the message queue strategy. For example, if a queue
beginning with heat-engine-listener is not connected in more than one hour, it is
automatically deleted.
Note
The command can be executed during the installation of RabbitMQ. After RabbitMQ is installed,
execute the statement to set the strategy for the message queue to the RabbitMQ server.
Expected Result
When the neutron agent-list command is executed, the following information is displayed:
Action
1. Log in to the control node and check whether the status of the Neutron server is "active". If
no, an error occurs.
Active: active (running) since Tue 2015-08-25 13:15:10 CST; 2min 35s ago
CGroup: /system.slice/neutron-server.service
Run the command to show error information about the configuration file and modify
following the prompt.
Note
There must not be any blank space left at the beginning of a line in the configuration file.
In normal cases, the result of the openstack-status command executed on both the control
node and the compute node shows that only one node is "active" and others are "disabled".
5. If the fault persists, contact ZTE technical support.
Expected Result
The execution result of the neutron agent-list command shows that neutron-openvswitch-
agent of the board concerned is XX.
+--------------------------------------+----------------------+---------+-------+
+--------------------------------------+----------------------+---------+-------+
+--------------------------------------+----------------------+---------+-------+
Action
1. Log in to the service with the fault and check whether the service status is "active". If no, the
service does not operate properly. For example,
Active: active (running) since Tue 2015-08-25 10:49:38 CST; 2h 7min ago
Yes → End.
No → Step 2.
2. Run the date command to check whether the time of the control node is synchronous with
that of the compute node.
Yes → Step 5.
No → Step 3.
3. On the control node and compute node, run the systemctl status chronyd command to
check whether the service is Active: Active (running).
Yes → Step 5.
No → Step 4.
4. Run the systemctl restart command to restart the service, and check whether the service
status is running.
Yes → Step 5.
No → Step 7.
5. Check whether the time is synchronized.
Yes → Step 6.
No → Step 7.
6. Run the following command when no service is operating, and check whether the alarm is
cleared.
systemctl restart neutron-openvswitch-agent.service
Yes → End.
No → Step 7.
7. Contact ZTE technical support.
Expected Result
The network service fails to be started. The execution result of the systemctl status network
command shows that the status of the service is "fail".
Probable Cause
Action
1. Check whether the network adapters in the configuration files are consistent with those
shown in the execution result of the ifconfig command and perform the following operations
as required:
If... Then...
Too many network adapters are configured in the Delete the invalid configuration files.
configuration files
If... Then...
The number of network adapters shown in the Check information about the network adapters
execution result of the ifconfig command is more with the ip link |grep network adapter name
than that in the configuration files command and contact ZTE technical support.
The number of network adapters shown in the Check whether there is DHCP configuration in the
execution result of the ifconfig command is equal configuration file. If yes, change the DHCP attribute
to that in the configuration files to "static".
Expected Result
The network service can be successfully started after the systemctl restart network command
is executed.
The rabbitmq-server cannot be started. The storage space of the /var/lib/ directory may be full.
Because large images are uploaded to the /var/lib/ directory, so the space is full, and thus the
rabbitmq files cannot be written.
Action
1. Run the systemctl start rabbitmq-server command to start the rabbitmq-server service.
2. Run the journalctl -xe command. It prompts some related errors.
3. Check the /var/log/rabbitmq/rabbit@host name.log file, and turn to the end to check whether
the following information is printed.
{not_a_dets_file,"/var/lib/rabbitmq/mnesia/rabbit@rabbitmq/msg_stores/vhosts/
628WB79CIFDYO9LJI6DKMI09L/recovery.dets"}}}
Yes → Step 4.
No → Step 6.
4. Delete the /var/lib/rabbitmq/mnesia/rabbit@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9
LJI6DKMI09L/recovery.dets file.
5. Run the systemctl restart rabbitmq-server command to restart the rabbitmq-server service.
6. Contact ZTE technical support.
Expected Result
The message server fails to be connected although the RabbitMQ service is already started.
Probable Cause
Action
Some nodes may be down and fail to be connected. You can run the rabbitmqctl
cluster_status command to view the cluster status.
[{nodes,[{disc,['rabbit@gltest-ctrl']}]},
{running_nodes,['rabbit@gltest-ctrl']},
{cluster_name,<<"rabbit@gltest-ctrl">>},
{partitions,[]},
{alarms,[{'rabbit@gltest-ctrl',[]}]}]
In normal cases, the "running_nodes" column contains all nodes. If all nodes are not
contained, this indicates that some nodes are not running. You can run the systemctl
restart rabbitmq-server command to start the service.
3. If the fault persists, contact ZTE technical support.
Expected Result
Probable Cause
This may be caused by the message backlog in the service queue. Run the rabbitmqctl
list_queues --local|grep command. The second column of the service queue name is the
quantity of consumers. If it is 0, this indicates that no user is consuming.
The service processing is slow. Check the system load (top) and the memory and CPU
usage.
Action
1. Check whether there is a backlog of messages. Run the rabbitmqctl list_queues --local|awk
'$2>50' command on all control nodes. If a value is returned, this indicates that there is a
backlog of messages. If the backlog persists for five minutes, contact ZTE technical support.
2. Check the service log. If the information such as "has taken %ss to process msg with"
is output before link disconnection, this indicates that the service processing is too slow.
Check whether the corresponding service is operating properly. If the fault persists, contact
ZTE technical support.
Expected Result
The message server can be connected properly, and there is no backlog of messages.
The heat process has created thousands of queues, so nova cannot be connected to rabbitmq.
Probable Cause
Each time the heat process restarts, a message queue starting with heat will be created, and
the queue will not be automatically deleted when there is no client connection. The name of
the queue generated every time the heat is restarted is followed by a random uuid. Therefore,
the queue names are different. When the heat process is restarted next time, a new queue is
generated and the previous queues are not deleted. For example, the heat process has created
16 queues. In this case, the 16 queues will be added to the rabbitmq server each time the HA
node is switched over. After multiple times of switchover, the rabbitmq queues will quickly reach
the limit, exhausting the memory, and restricting the connection.
Action
1. Configure a message queue policy to enable the automatic deletion function. For example,
when the message queue starting with heat-engine-listener is not connected for one hour, it
will be deleted automatically. The method is as follows:
rabbitmqctl set_policy ha-all "." '{"ha-mode":"all", "ha-sync-mode":"automatic"}' --apply-to all
--priority 0
rabbitmqctl set_policy heat_rpc_expire "^heat-engine-listener\\." '{"expires": 3600000,"ha-mo
de":"all","ha-sync-mode":"automatic"}' --apply-to all --priority 1
Note
The command can be executed during the rabbitmq installation. After the rabbitmq software is
installed and the rabbitmq service is started, this command is executed to configure the message
queue policy on the rabbitmq server.
Expected Result
After manual restart, the physical machine is automatically restarted every other few minutes
and there is systemd-logind information in the log file in the /var/log/messages
directory.
Probable Cause
The physical blade is not inserted tightly or the ejectors are not in place.
Action
1. Check whether the physical blade is inserted tightly and the ejectors are in place.
Yes → Step 3.
No → Step 2.
2. Insert the physical blade tightly and put the ejectors in place. Check whether the fault is
resolved.
Yes → End.
No → Step 3.
3. Contact ZTE technical support.
Expected Result
When you run the cinder list command to check whether a cloud drive with a mirror is
successfully created, the status of the cloud drive is "downloading" and then becomes "error".
When you check /var/log/cinder/volume.log on the control node, the log shows that
the cloud drive cannot be created, see Figure 6-1.
Probable Cause
Action
1. Check whether the mirror property description of the cloud drive contains Chinese
characters.
On the control node, run the glance image-show test command ("test" is the mirror name).
Figure 6-2 shows an example of the check result.
Figure 6-2 Checking Mirror Property Information (Chinese Characters Are Found)
In the result, the value of Property "description" contains Chinese characters, which
cannot be resolved.
Yes → Step 2.
No → Step 3.
2. Run the following command to modify the mirror property and re-create the cloud drive with
a mirror:
glance image-update --property description ="test" test
For a description of the parameters, refer to the following table.
Parameter Meaning
3. Check whether the mirror property description of the cloud drive contains Chinese
characters.
On the control node, run the glance image-show test command (test is the mirror name).
Figure 6-3 shows an example of the check result.
Figure 6-3 Checking Mirror Property Information (Not Chinese Character Is Found)
Yes → Step 4.
No→ End.
4. Contact ZTE technical support.
Expected Result
When you run the cinder list command to check whether the cloud drive with a mirror is
successfully created, the status of the cloud drive is "available".
When you run the cinder list command to check whether a cloud drive is successfully created,
the status of the cloud drive is "creating" and then becomes "error".
When you check /var/log/cinder/volume.log on the control node, the following
information is displayed:
Probable Cause
Action
1. On the control node, run the following command to obtain the address of the disk array
management interface:
cat /etc/cinder/cinder_fujitsu_eternus_dx.xml
Figure 6-4 shows an example of the output of this command.
Figure 6-4 Checking the Address of the Disk Array Management Interface
In Figure 6-4, EternusIP is 10.43.230.23, that is, the address of the disk array
management interface.
2. Enter the address of the disk array management interface in the address bar of an IE
browser to log in to the disk array management page.
3. Select RAID GROUP. The RAID GROUP page is displayed.
4. Enter the corresponding RAID group. Figure 6-4 shows an example that the RAID group is
CG_raid_04.
5. Click the Volume Layout tab, see Figure 6-5.
6. Check whether the disk array has sufficient continuous space. If the size of the cloud drive is
greater than the maximum value of Free, the cloud drive cannot be created.
Yes → Step 9.
No → Step 7.
7. Check whether a cloud drive whose size is smaller than or equal to Free can satisfy the
requirements of the user.
Yes → Step 9.
No → Step 8.
8. Create a cloud drive whose size is smaller than or equal to Free, and check whether the
cloud drive is successfully created.
Yes → End.
No → Step 9.
9. Contact ZTE technical support.
Expected Result
When you run the cinder list command to check whether the cloud drive is successfully
created, the status of the cloud drive is "available".
6.1.3 Cannot Create a Cloud Drive With a Mirror (Based on an IPSAN Disk
Array)
Symptom
When you run the cinder list command to check whether a cloud drive with a mirror is
successfully created, the status of the cloud drive is "downloading" and then becomes "error".
When you check /var/log/cinder/volume.log on the control node, the following
information is displayed:
Probable Cause
The link between the service interface of the disk array and the control node is abnormal.
Action
1. On the control node, ping the address of the service interface of the disk array and check
whether it can be pinged successfully. The service interface (for example, 162.161.1.208) is
stored in /var/log/cinder/volume.log.
Yes → Step 4.
No → Step 2.
2. Check whether the control node and the service interface of the disk array are properly
connected.
3. Check whether a cloud drive with a mirror can be successfully created.
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Expected Result
When you run the cinder list command to check whether the cloud drive with a mirror is
successfully created, the status of the cloud drive is "available".
6.1.4 The Volume With Images Fails to be Created Due to "Failed to Copy
Image to Volume"
Symptom
The volume with images fails to be created and the status of the volume is "error". In this case,
the following information is displayed:
[req-b5e20b82-7755-4712-ad45-9141d1512945 d75bb947cc7e450b83d4b006fc7656ef
[req-b5e20b82-7755-4712-ad45-9141d1512945 d75bb947cc7e450b83d4b006fc7656ef
ae74753aa2034c4aba49400a514d110e - - -]
Run the following command to view the image information in the /var/lib/glance/image
directory:
image: 43a17a9f-95d7-4152-ba5f-6476f917a534
cluster_size: 65536
compat: 1.1
Probable Cause
The virtual size of the image is 40 G, while the size of the volume is only 30 G, less than the
virtual size of the image.
Action
Recreate the volume with the size more than 48 G (1.2 times the virtual size of the image).
Expected Result
The volumes with images fail to be created in batches and the following information is displayed
in the volume.log file of the cinder:
Probable Cause
If a single volume can be successfully created, while several volumes fail to be created in
batches, it indicates that the fault is caused by the residual path.
Action
If multipath is used, run the multipath -f device name command to remove the residual device.
Expected Result
6.1.6 The Volume With Images Fails to Be Created on a Fujitsu Disk Array
Symptom
The volume with images fails to be created on a Fujitsu disk array and the log information is as
follows:
error: qemu-img: /dev/disk/by-path/ip-192.168.113.11:3260-iscsi-iqn.2000-09.com.fujitsu:
too small
Probable Cause
The size of the volume on the Fujitsu disk array is less than the virtual size of the image.
Action
1. Check whether the size of the volume is less than the virtual size of the image in the
following way:
image: 5be510e2-09a8-47ff-8d56-8ecfdf7465d0
cluster_size: 65536
compat: 1.1
Yes → Step 2.
No → Step 3.
2. Modify the size of the volume to a value lager than or equal to the virtual size of the image.
Check whether the fault is resolved.
Yes → End.
No → Step 3.
3. Contact ZTE technical support.
Expected Result
The volume with images can be successfully created on the Fujitsu disk array.
6.1.7 The Volume Fails to Be Created and the Status of the Volume Is "
error, volume service is down or disabled"
Symptom
The volume fails to be created and the status of the volume is "error, volume service is down or
disabled". In this case, the following information is displayed in the cinder-scheduler log:
17109062\n']
[req-114bc608-a921-4fb7-af26-1880535f2c40 bd9ffbdac44d48f780343de74ebd5913
No valid host was found. Exceeded max scheduling attempts 3 for volume None
[req-f969943a-51b2-4a40-9aad-23910767a158 bd9ffbdac44d48f780343de74ebd5913
Probable Cause
Run the cinder service-list command to check the status of the cinder-volume service in
the back end. The old host (sbcr13) is disabled and the new host (cinder) configured in the
cinder.conf file is enabled, thus the cinder-volume service corresponding to sbcr13 goes
down. However, sbcr13 is scheduled during the creation of volume scheduling, resulting in the
failure of volume creation.
Action
Expected Result
The volume with images fails to be created due to "_is_valid_iscsi_ip,iscsi ip:()is invalid". In this
case, the log information is as follows:
Probable Cause
The log shows that there is an error on the service ports of the disk array after the ports are
pinged. When you manually ping the codes, the execution result shows that all the service ports
are in good condition. This indicates that the execution permission of the ping command is
incorrect.
Action
2. Recreate the volume. If the fault persists, contact ZTE technical support.
Expected Result
6.2.1 Cannot Delete a Cloud Drive, the Status of the Cloud Drive is "Error-
Deleting"
Symptom
When you attempt to delete a cloud drive, the status of the cloud drive is "error-deleting", and
the cloud drive cannot be deleted.
Probable Cause
Action
1. On the control node, run the following command to check whether the volume service status
of cinder is active:
systemctl status openstack-cinder-volume.service
Figure 6-6 shows an example of the output of this command. If the Active field is "active", it
indicates that the service is successfully started. Otherwise, it indicates that the service is
not successfully started.
Yes → Step 4.
No → Step 2.
2. Run the following command to check the status of the cloud drive:
cinder reset-state test_reset , where test_reset is the cloud drive name.
3. Check whether the cloud drive can be successfully deleted.
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Expected Result
6.2.2 A Volume Fails to Be Deleted From a ZTE Disk Array Due to "Failed to
signin.with ret code:1466"
Symptom
A volume fails to be deleted from a ZTE disk array due to "Failed to signin.with ret code:1466".
cinder/openstack/common/periodic_task.py:178
zte_ks.py:1225
volume/drivers/zte/zte_ks.py:99
2015-04-20 09:53:28.342 19975 ERROR cinder.volume.manager [-] Get storage info failed.
2015-04-20 09:53:28.342 19975 TRACE cinder.volume.manager Traceback (most recent call last):
Check the user name and password in the /etc/cinder/cinder_zte_conf.xml file in the
following way:
<config>
<Storage>
<ControllerIP0>10.43.16.21</ControllerIP0>
<ControllerIP1 />
<LocalIP>10.43.179.42</LocalIP>
</Storage>
<LUN>
<ChunkSize>4</ChunkSize>
<AheadReadSize>8</AheadReadSize>
<CachePolicy>1</CachePolicy>
<StorageVd>nas_vd</StorageVd>
<StorageVd>san_vd</StorageVd>
<SnapshotPercent>50</SnapshotPercent>
</LUN>
<iSCSI>
</iSCSI>
<VOLUME>
<Volume_Allocation_Ratio>20</Volume_Allocation_Ratio>
</VOLUME>
</config>
Probable Cause
The user name and password are incorrectly configured in the /etc/cinder/cinder_zte_
conf.xml file.
Action
1. Modify the <UserName> and <UserPassword> fields in the configuration file, for example:
<UserName>admin</UserName>
<UserPassword>admin</UserPassword>
[root@control2 cinder]# ll
total 812
Expected Result
6.2.3 A Volume Fails to Be Deleted From a ZTE Disk Array Due to "error-
deleting"
Symptom
A volume fails to be deleted from a ZTE disk array due to "error-deleting". For the volume.log
file, an error code 16917030 is returned after you run the cinder command.
Probable Cause
The volume does not properly exit the mapping group after it is added to the group. This is
because the stored resources are not properly released when you delete the residual VMs by
manually modifying database tables.
Action
1. Log in to the disk array management interface and find the volume concerned.
2. Manually remove the mapping group from the volume.
3. Run the cinder reset-state --state error volume_id command to reset the status of the
volume.
4. Delete the volume.
5. If the fault persists, contact ZTE technical support.
Expected Result
The cinder service operates properly and no response and log are returned after a volume is
mounted, deleted, or dismounted.
Probable Cause
The fault occurs if the host name or the host field in the cinder.conf file is modified
because the volume belongs to the original host.
If the name of the volume service is modified after a new host is created under the volume
service or after the host field in the cinder.conf file is modified, then the operations on
the volume fail. In this case, a message sent to the original service cannot be processed.
Action
1. Run the cinder show volume_id command to find the os-vol-host-attr:host field, for
example, opencos263ed0ae9a0440eca446d6155b56b946@IPSAN.
2. Run the cinder service-list command to check whether the volume service corresponding
to the host goes down.
If yes and the new service is named after a new host name, the fault is caused by the
modification of the host name.
3. Modify host=** to the original value (host=opencos263ed0ae9a0440eca446d6155
b56b946) in the /etc/cinder/cinder.conf file.
4. Run the systemctl restart openstack-cinder-volume command to restart the volume
service.
5. Delete or dismount the volume.
6. Repeat Step 3 to modify the host name to tecs1; otherwise, the volume corresponding to the
host named tecs1 cannot be operated.
7. If the fault persists, contact ZTE technical support.
Expected Result
When you run the cinder list command to check whether a cloud drive is successfully mounted
to the VM, the status of the cloud drive changes from "attaching" to "available".
When you check /var/log/cinder/volume.log on the control node, error information is
displayed, see Figure 6-7.
Figure 6-7 Cannot Mount a Cloud Drive When a Fujitsu Disk Array Is Used
Probable Cause
The affinity group of the service interface on the disk array is set to off.
Action
1. On the control node, run the following command to obtain the address of the disk array
management interface:
cat /etc/cinder/cinder_fujitsu_eternus_dx.xml
Figure 6-8 shows an example of the output of this command. EternusIP is 10.43.230.23,
that is, the address of the disk array management interface.
Figure 6-8 Obtaining the Address of the Disk Array Management Interface
2. Telnet to the disk array management interface by using the username root and password
root (enter a password according to the actual situation).
3. Run the following command to set the affinity group of the disk array to enable:
set iscsi-parameters -port all -host-affinity enable
4. Check whether the cloud drive can be successfully mounted.
Yes → End.
No → Step 5.
5. Contact ZTE technical support.
Expected Result
When you run the cinder list command to check whether the cloud drive is successfully
mounted, the status of the cloud drive is "in-use".
6.3.2 Cannot Mount a Cloud Drive When IPSAN Back-End Storage Is Used
Symptom
When you run the cinder list command to check whether a cloud drive is successfully mounted
to the VM, the status of the cloud drive changes from "attaching" to "available".
When you check /var/log/nova/nova-compute.log on the host where the VM is located,
the following information is displayed:
Probable Cause
The link between the service interface of the disk array and the computing node is abnormal.
Action
1. Telnet to the control node, and run the following command to obtain the service interface
addresses of the disk array:
cat /etc/cinder/cinder_fujitsu_eternus_dx.xml
Note
The management interface address configuration files of different types of disk arrays have different
file names, file contents, and management page styles. Refer to the actual conditions.
Figure 6-9 shows an example of the output of this command. EternusISCSIIP indicates a
service interface address of the disk array.
Figure 6-9 Obtaining the Address of the Service Interface of the Disk Array
2. On the computing node, ping a service interface address of the disk array and check
whether it can be pinged successfully.
Yes → Step 5.
No → Step 3.
3. Restore the link between the computing node and the service interface of the disk array, for
example, by replacing faulty cables, so that their addresses can be pinged successfully from
each other.
4. Check whether the cloud drive can be successfully mounted.
Yes → End.
No → Step 5.
5. Contact ZTE technical support.
Expected Result
When you run the cinder list command to check whether the cloud drive is successfully
mounted, the status of the cloud drive is "in-use".
When you run the cinder list command to check whether a cloud drive is successfully
unmounted from the VM, the status of the cloud drive is always "detaching".
When you check /var/log/cinder/volume.log on the control node, error information is
displayed, see Figure 6-10.
Probable Cause
An admin user logs in to the disk array management page and does not log out properly.
Action
1. On the control node, run the following command to obtain the address of the disk array
management interface:
cat /etc/cinder/cinder_fujitsu_eternus_dx.xml
Figure 6-11 shows an example of the output of this command. EternusIP is
10.43.230.23, that is, the address of the disk array management interface.
Figure 6-11 Obtaining the Address of the Disk Array Management Interface
2. Enter the address of the disk array management interface in the address bar of an IE
browser and check whether you can log in to the disk array management page as the
admin user.
Yes → Step 4.
No → Step 3.
3. Troubleshoot connection problems. After the problems are solved, you can log in to the disk
array management page as the admin user.
4. On the disk array management page, click logout.
5. Check whether the cloud drive can be successfully unmounted.
Yes → End.
No → Step 6.
6. Contact ZTE technical support.
Expected Result
When you run the cinder list command to check whether the cloud drive is successfully
unmounted from the VM, the status of the cloud drive is "available".
Probable Cause
Action
1. Run the following command to check whether the storage space of the mirror server (Avail
parameter in Figure 6-12) can meet the requirement for uploading the mirror file. Figure 6-12
shows an example of the output of this command.
df -h /var/lib/glance/images
Yes → Step 4.
No → Step 2.
2. Delete unwanted files from the /var/lib/glance/images directory.
3. Check whether the mirror can be successfully uploaded.
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Expected Result
Probable Cause
The authorized user of the /var/lib/glance/images directory is not glance, and thus you
cannot store the mirror as the glance user.
Action
1. On the control node, run the following command to check whether the user of the mirror
storage directory is glance:
ll /var/lib/glance/
Figure 6-13 shows an example of the output of this command. In this example, the user is
root, not glance.
Expected Result
Probable Cause
If the IP address and the MAC address in a VM are inconsistent with the external ones, security
groups may cause VM communication failures. Therefore, in some cases, security groups need
to be disabled.
Action
Expected Result
The security groups are successfully disabled and the VM operates properly.
Probable Cause
Action
1. Check whether the states of each compute node is normal. Log in to the controller node in
SSH mode and view the status by using the nova service-list command. If the state is up,
this indicates the compute node is normal.
Normal → Step 2.
No → Refer to 7.1.2.1 Nova-compute Status of the Compute Node is Down.
2. The requested resources cannot meet the requirements, and the related destination nodes
are filtered by the filter. Check the filter logs to determine which filter fails.
Action
Expected Result
Action
Expected Result
Action
Expected Result
Action
Expected Result
This fault occurs when resources are allocated to VMs. Generally, the fault that no network
resources are allocated (mainly for macvtap and sriov VMs) are caused by configuration errors.
Action
1. Check whether the node on which VMs will be deployed supports the VM type (direct or
macvtap). To confirm it, you can view all the agents:
-------------------------------------+------------------+-----------+-------+------------+
-------------------------------------+------------------+-----------+-------+------------+
-------------------------------------+------------------+-----------+-----+--------------+
2. Check whether the host on which VMs will be deployed is in correct macvtap or sriov mode.
If not, modify it. You can check the NIC switch agents:
+---------------------+--------------------------------------+
| Field | Value |
+---------------------+--------------------------------------+
| admin_state_up | True |
| alive | True |
| binary | neutron-sriov-nic-agent |
| configurations | { |
| | "sriov_vnic_type": "direct", |
| | "devices": 0, |
| | "device_mappings": { |
| | "physnet3": "enp132s0f0", |
| | "physnet2": "enp2s0f0" |
| | } |
| | } |
| description | |
| host | control5 |
| id | c6934a55-9e36-4879-a55b-914d8387a795 |
| topic | N/A |
3. Check the configuration file. When the direct or macvtap VMs are deployed in a vlan, the
network plane configurations of nova and neutron must be the same. For example,
There are three physical planes in the nova configuration:
pci_passthrough_whitelist= [{ "address":"0000:81:00.1","physical_network":"physnet1" },
{ "address":" 0000:02:00.0","physical_network":"physnet2" },
{ "address":"0000:84:00.0","physical_network":"physnet3" }]:
There should also be three physical planes in the neutron configuration file. Otherwise,
when the nova requests resources from the neutron, no resources will be returned due to
mismatch of the planes.
vi /etc/neutron/plugins/ml2/ml2_conf.ini
network_vlan_ranges =physnet1:2001:2050,physnet2:2001:2050,physnet3:2001:2050
vi /etc/neutron/plugins/sriovnicagent/sriov_nic_plugin.ini
physical_device_mappings = physnet2:enp2s0f0,physnet3:enp132s0f0
+-----------------------+---------------------------------------+
| Field | Value |
+-----------------------+---------------------------------------+
| admin_state_up | True |
| allowed_address_pairs | |
| bandwidth | 0 |
| binding:host_id | |
| binding:profile | {} |
| binding:vif_details | {} |
| binding:vif_type | unbound |
| binding:vnic_type | direct |
| bond | 0 |
| device_id | |
| device_owner | |
| extra_dhcp_opts | |
| fixed_ips | |
| id | 9ada8ab5-dcca-449b-be95-27122bcce840 |
| mac_address | 00:d0:d0:6e:00:83 |
| name | ZTE-UMAC-83-UIPB1-S_vMAC_55_SIPI_port |
| network_id | 1ec95950-09a1-4e13-84b9-37d640fd05f4 |
| security_groups | |
| status | DOWN |
| tenant_id | 0d6a1d6602db4021899a29b1e98b3d89
+---------------------------+--------------------------------------+
| Field | Value |
+---------------------------+--------------------------------------+
| admin_state_up | True |
| attached_port_num | 0 |
| bandwidth | 0 |
| id | 1ec95950-09a1-4e13-84b9-37d640fd05f4 |
| max_server_num | 50 |
| mtu | 1500 |
| name | vMAC_55_SIPI |
| provider:network_type | vlan |
| provider:physical_network | physnet3 |
| provider:segmentation_id | 2009 |
| router:external | False |
| shared | True |
| status | ACTIVE |
| subnets | |
| tenant_id | 0d6a1d6602db4021899a29b1e98b3d89 |
| vlan_transparent | False |
+---------------------------+--------------------------------------+
c. Confirm whether the physical network plane is correct in each configuration file. If not,
modify the configuration file as required.
4. If the fault persists, contact ZTE technical support.
Expected Result
A VM is successfully created.
Action
Symptom
Probable Cause
The time of the compute node is different from that of the controller node, resulting in the
vif_type=binding_failed error.
The network service status is abnormal.
The configuration file is incorrect.
Action
1. Run the date command on the compute node and the controller node respectively to check
whether the time on the compute node and the controller node is the same.
Yes → Step 4.
No → Step 2.
2. Check whether the chronyd service is started on the controller node and the compute node.
It should be Active. Active (running) indicates the service is started successfully.
Active: active (running) since Thu 2019-12-05 16:52:11 CST; 23h ago
Docs: man:chronyd(8)
man:chrony.conf(5)
Tasks: 1
Memory: 348.0K
CGroup: /system.slice/chronyd.service
└─20126 /usr/sbin/chronyd
client/server...
chronyd version 3.2 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SECHASH
+SI...+DEBUG)
client/server.
Yes → Step 4.
No → Step 3.
3. If the chronyd service is not in running status, start it. Method:
a. Run the systemctl restart chronyd command to start the chronyd service.
b. Check the time of all nodes to ensure that the time is the same.
Yes → Next step.
No → Step 8.
c. Run the following command to start the neutron-openvswitch-agent service.
systemctl start neutron-openvswitch-agent.service
4. Telnet to the host where the VM is located. Perform the following operations in accordance
with VM types.
If… Then…
It is an OVS VM a. Run the following command to view the status of the neutron-
openvswitch-agent service.
systemctl status neutron-openvswitch-agent.service
b. If the service fails to be started, run the following command to
restart it.
systemctl restart neutron-openvswitch-agent.service
It is an SR-IOV VM a. Run the following command to view the status of the neutron-
sriov-nic-switch-agent service.
systemctl status neutron-sriov-nic-switch-agent
b. If the service fails to be started, run the following command to
restart it.
systemctl restart neutron-sriov-nic-switch-agent
Note
Check the service status. In the execution result, if Active is displayed, this means the service is
normally started. If other states are displayed, this means the service fails to be started.
Note
In the case of an SR-IOV VM, it is also necessary to check the /etc/nova/nova.conf file to see if
the network port bus_info specified in pci_passthrough_whitelist is the same as the actual NIC value.
Yes → Step 8.
No → Step 7.
7. Modify the configuration. Restart the openstack service. Check whether the fault is fixed.
Yes → End.
No → Step 8.
8. Contact ZTE technical support.
Expected Result
Symptom
A VM fails to be created. The execution result of the nova show command shows binding
failed.
Action
1. Run the neutron agent -list command to view the service status of neutron.
-------------------------------------+------------------+-----------+-------+----------+
_up |
-------------------------------------+------------------+-----------+-------+----------+
-------------------------------------+------------------+-----------+-----+------------+
If the alive field is not :-) but XX, the probable errors are as follows:
If some services are XX, run the date command to check whether the time of the
compute node and the controller node is the same. If not, manually configure the same
time or enable the NTP service.
If all services are XX, possibly the message service is abnormal. Check and restart the
qpid or rabbitmq-server service (or contact the message-related personnel to locate the
fault).
Active: active (running) since Thu 2015-10-08 15:52:01 CST; 2 days ago
CGroup: /system.slice/rabbitmq-server.service
Oct 08 15:51:54 tecs162 systemd[1]: Starting LSB: Enable AMQP service provided by
RabbitMQ broker...
Oct 08 15:52:01 tecs162 systemd[1]: Started LSB: Enable AMQP service provided by
RabbitMQ broker.
For this problem, view the logs of any agent. You can see related prompts of message
service failure.
again in 1 seconds.
again in 3 seconds.
again in 3 seconds.
again in 3 seconds.
again in 5 seconds.
again in 5 seconds.
again in 5 seconds.
If some services on some boards are XX, the services may be faulty. Check the service
status directly. If the status is not active, make sure the service is normal.
enabled)
Active: active (running) since Wed 2015-08-19 11:35:19 CST; 5 days ago
neutron-sriov-nic-switch-agent.service
Warning: Journal has been rotated since unit was started. Log output is incomplete
or unavailable.
2. It may be that you want to deploy an sriov VM, but you configure a macvtap VM. You can
use the following method to determine the VM type.
Run the ip link show command. If there are many enp interfaces, the VM type is
macvtap. Otherwise, the VM type is sriov.
143: enp8s17f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
144: enp8s17f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
145: enp8s17f5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
146: enp8s17f7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
147: enp8s18f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
148: enp8s18f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
149: enp16s16f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
150: enp8s18f5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
151: enp16s16f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
[DEFAULT]
sriov_vnic_type = direct
ixgbe_vf_num = 63
igb_vf_num = 7
ixgbe_num = 16
Expected Result
A VM is successfully created.
A VM using an SR-IOV NIC fails to be started, and the status of the VM displayed on the TECS
management portal is failed.
Probable Cause
The SR-IOV NIC uses the VT-d function. If the VT-d function is not enabled, this error will
occur.
Action
1. Log in to the compute node, and run the following command to check the log records of the
libvirt.
cat /var/log/libvirt/libvirtd.log
Check whether the following error information exists:
Yes → Step 2.
No → Step 4.
2. The VT-d option is not enabled in the BIOS configuration of the server. You need to enable
the corresponding option in the BIOS. For different servers, the path is different. For
example,
chipset > North Bridge > IOH Configuration > Intel VT for Directed I/O Configuration >
Intel VT-d
3. Restart the VM, and check whether the VM can be started properly. When the VM is started
normally, the status of the VM displayed on the TECS management portal is running.
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Symptom
When the VM is started, the status of the VM displayed in the provider is failed. Check /var/
log/libvirt/libvirtd.log of the compute node. There is a record that shows hvm not
supported.
Probable Cause
Action
Expected Result
Symptom
Probable Cause
In the openvswitch_agent.ini configuration file on the compute node, the network type
can only be VLAN.
Action
tenant_network_types = vlan,vxlan
//If vxlan is placed before vlan, when a network is created, the default network type
//is vxlan.
2. Run the following command to restart the neutron-server service on the controller
node.
systemctlrestart neutron-server.service
3. On the compute node, modify the /etc/neutron/plugins/ml2/openvswitch_
agent.ini file as follows:
[OVS]
Local_ip = 10.0.0.3 //the IP address here is the management port IP address of the local
[AGENT]
tunnel_types = vxlan
# ovs-vsctl show
7d875903-6472-49c4-9b66-d830cd740ecd
Port fabricright
Port br-fabric
Interface br-fabric
type: internal
Bridge br-tun
Port patch-int
Interface patch-int
………………
Expected Result
[req-c0cdccd2-3703-42f7-a471-614fee84ab53 3aee999c7bf4418d84881ba1f1fb4b3c
Probable Cause
A VM cannot migrate from a physical machine with higher CPU performance to another
physical machine with lower CPU performance.
Action
1. Confirm the CPU information of the source end and destination end. Run the following
command on the source end and the destination end respectively and view the flags field.
The CPU type is indicated by the field.
a. Log in to the host where the VM is located and the destination host in SSH mode.
b. Run the lscpu |grep Flags command.
c. Compare the flags fields. The flags field of the destination host must contain that of the
node where the VM is located.
2. A VM cannot migrate from a physical machine with higher CPU performance to another
physical machine with lower CPU performance. If you need to migrate the VM to a node with
a lower CPU type, you can perform cold migration.
Expected Result
If the CPU type meets the migration requirements, the live migration operation is successful.
Hot migration of a VM within an AZ succeeds, but hot migration of the VM to another AZ fails.
Probable Cause
An AZ is specified upon the deployment of the VM, and AvailabilityZoneFilter is enabled for hot
migration. Thus, physical devices of other AZs are filtered, and no host can be selected during
the hot migration of the VM.
Action
Expected Result
7.2.3 Destination Host Has Not Enough Resources (Not Referring to Disk
Space)
Symptom
During hot migration, it is detected that the destination host has not enough resources, and thus
the migration fails. When you check the /var/log/nova/nova-conductor.log file on the
control node, the following information is displayed:
[req-f9bc9950-528d-49a2-8ad0-39d4ac8bf542 2ee45765eb0f49c2a1a0b47b8e5b2b71
ff17befb-2685-4de9-91fd-13a56a7109ea to NJopencos2:
Probable Cause
The destination host has not enough resources. The log shows "lack of memory".
Action
1. On the control node, run the following command to obtain the destination hypervisor name.
In most cases, the destination hypervisor name is consistent with the destination host name.
nova hypervisor-list
2. Run the command to check whether the destination host lacks resources. You can learn the
resource information from ram and vcpus in the flavor parameter of the VM.
nova hypervisor-show <destination hypervisor name>
Yes → Step 3.
No → Step 5.
3. Delete redundant VMs from the destination host, or add more computing nodes to satisfy
migration requirements.
4. Perform hot migration again, and check whether the fault is removed.
Yes → End.
No → Step 5.
5. Contact ZTE technical support.
Expected Result
(Request-ID: req-0a9e7d2f-19ae-49a4-bb6c-8ce4deff7d58)
Probable Cause
Action
If... Then...
The value of "available on destination host" is a If you perform the migration through CLI, when you run
positive number. the nova live-migration command, add the --disk-over-
commit parameter.
If you perform the migration through GUI, select Disk
Over Commit.
The value of "available on destination host" is a Select another destination host, or clear the disk space
negative number. of the current destination host until it is sufficient for hot
migration.
Expected Result
Probable Cause
The nova-compute service of the computing node on the source host is down.
Action
1. on the source host, run the following command to restart the nova-compute service:
Expected Result
The nova-compute service of the source host is restored to normal, and hot migration
succeeds.
After the live migration, the status of the VM is changed to error. After the nova show
command, executed, the fault field displays the following error information:
File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\",
Probable Cause
Action
Expected Result
After the network problem is solved, the live migration operation is successful.
During cold migration or resizing, the following information is displayed in the /var/log/
nova/nova-compute.log log on the source host.
Probable Cause
To perform cold migration or resizing, you should use SSH to log in as the nova user (No
password is required.) The probable cause for this fault is that the nova user is not correctly
configured.
Action
Expected Result
7.3.2 Error "No valid host was found" Reported During Migration
Symptom
There is only one computing node, or the computing node and control node are co-located.
During cold migration or resizing, the following information is displayed in the /var/log/
nova/nova-api.log file on the source host.
Probable Cause
It is not allowed to perform cold migration or resizing on the same computing node.
Action
Expected Result
Probable Cause
Action
1. Log in to the TECS portal. Select Cloud Mgmt. > Compute > Instance. The Instance page
is displayed, on which VM specifications and other information are displayed.
2. Check the Specification column to obtain the disk value currently used by the VM (root
disk, temporary disk).
3. During the resize operation, select the VM specifications whose disk value is larger than the
current disk value. Check whether the alarm is cleared.
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Expected Result
Resizing succeeds.
After cold migration or resizing is performed on a VM, the VM is always in "verify_resize" status.
Probable Cause
Action
If... Then...
You want to solve this problem for this time only. a. On the control node, run the following command:
nova resize-confirm <VM's uuid or name>
b. Go to Step 4.
Note
"resize_confirm_window=10" means the confirmation time is 10 seconds. You can modify the value as
required. The default value is 0, meaning no automatic confirmation.
3. On the computing node, run the following command to restart the nova-compute service:
systemctl restart openstack-nova-compute
4. Perform cold migration or resizing again. Check whether the fault is removed.
Yes → End.
No → Step 5.
5. Contact ZTE technical support.
Expected Result
When cold migration or resizing succeeds, the status of the VM is restored to "active".
Probable Cause
The mirror of the VM is damaged. Thus, cold migration or resizing cannot be performed on the
VM.
Action
Stop the cold migration or resizing on the VM, or use a new mirror to create a VM and then try
again.
Expected Result
After a VM is created with the new mirror, cold migration or resizing can be performed on the
VM.
Probable Cause
Check the nova-compute service on the compute node where the VM is located. It is possible
that the compute service is abnormal.
Action
1. Log in to the controller node and check whether the nova-compute service of the
corresponding node is up.
nova service-list
Yes → Step 3.
No → Step 2.
2. Log in to the compute node through SSH, and check whether the nova-compute service
status is active.
systemctl status openstack-nova-compute
Yes → Step 4.
No → Step 3.
3. Log in to the compute node in the SSH mode, restart the nova-compute service, and check
whether the VM is deleted.
systemctl restart openstack-nova-compute
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Expected Result
The nova-compute service of the node where the VM is located operates properly and the VM
is deleted successfully.
Probable Cause
Action
1. On the control node, run the following command to restart the cinder-volume service:
systemctl restart openstack-cinder-volume.service
2. Run the following command to check whether the cinder-volume service is successfully
started. In the output of this command, if the Active field is "active", it indicates that the
service is successfully started. Otherwise, it indicates that the service is not successfully
started.
systemctl status openstack-cinder-volume.service
Yes → Step 3.
No → Step 4.
3. Try again and check whether the VM can be deleted.
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Expected Result
Probable Cause
The communication process is abnormal. Thus, the neutron service cannot be connected to
release network resources when you attempt to delete the VM.
Action
1. On the controller node, run the following command and check the /etc/neutron/neutron.
conf file.
#rabbit_host = localhost
#rabbit_hosts = $rabbit_host:$rabbit_port
In the result, if # is added, the configuration does not take effect. If # is not added, the
configuration takes effect. The qpid related configuration is commented out by #, while the
rabbit configuration is enabled. By default, rabbitmq communication is used.
2. In the case of RABBITMQ communication, run the following command to restart the
rabbitmq service.
systemctl restart rabbitmq-server
3. Delete the VM again, and check whether the VM is deleted successfully.
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Expected Result
After a VM is started, some services are not started or fail to be started, for example, the
network fails to be started.
Probable Cause
Services are started in sequence. If some services are not started, the subsequent services fail
to be started. For example, when the network is started, if the corresponding network port name
is not ready, the network fails to be started.
Action
1. After the system is started, manually start or restart the service (such as the network
service) with the following commands.
Command for starting the service: systemctlstart network.service
Command for restarting the service: systemctl restart network.service
2. Run the following command to check whether the service is started properly.
systemctl statusnetwork.service
If the following information is displayed, this indicates that the system is started properly.
If the following information is displayed, this indicates that the system is not started
successfully.
Active: failed (Result: exit-code) since Fri 2015-07-10 00:13:17 CST; 10s ago
//here, the Active field value is failed not active, indicating startup failure.
Yes → End.
No → Step 3.
3. Contact ZTE technical support.
Expected Result
The file system is damaged. As a result, the grub information is lost, the VM operating system
cannot be started, and the "Nobootabledevice" information is displayed.
Action
Expected Result
The VM startup time is too long. The VM status in the TECS management portal is spawning.
After the normal startup time expires, the startup is not completed yet. In normal cases, the
VM is started within 10 minutes, and the status in the TECS management portal changes to
running.
Probable Cause
If the hard disk space is large, it may take a long time or even several hours to repair the file
system of the hard disk.
The system may be performing the FSCK operation, which takes a long time. After the FSCK is
completed, the system can be powered on normally.
Action
1. Connect the VM console and check whether the FSCK operation is being performed.
If the following information is displayed, this indicates that the FSCK operation is being
performed.
Yes → Step 2.
No → Step 3.
2. After the FSCK operation is completed, check whether the VM can be started normally.
If the TECS management portal shows that the VM status is running, the VM is started
properly.
Yes → End.
No → Step 3.
3. Contact ZTE technical support.
Expected Result
The services of TECS are operating properly, and the cloud management system is normal.
After a VM that is operating properly is powered off, it cannot be powered on again.
Probable Cause
The image of the VM is abnormal, for example, the image is damaged due to an abnormal
operation.
Action
1. Log in to the Provider GUI, find the corresponding VM, enter the VM management page,
and check the console log to determine whether the problem is inside the VM.
Yes → Step 2.
No → Step 3.
2. Perform one of the following operations as required.
If… Then…
The VM uses local storage ( a. Use the local basic image and the incremental image to form a new
the incremental image is not image.
damaged or the damage is not b. Redeploy a VM by using the new image.
serious). This method can save the previous scenario where the VM operates
properly to the maximum extent.
The VM uses local storage (the Import the original image through the rebuild method to generate a
incremental image is damaged new duplicate VM.
and cannot be used). This method can restore the scenario where the VM is just deployed.
Active: active (running) since Thu 2015-07-02 09:04:37 CST; 2 weeks 0 days ago
CGroup: /system.slice/openstack-nova-compute.service
Yes → End.
No → Step 5.
5. Contact ZTE technical support.
Expected Result
Probable Cause
Action
5. Collect device information, and contact the technical support of the disk array manufacturer.
Check whether the fault is fixed.
Yes → End.
No → Step 6.
6. If there is no alarm for the disk and the connection indicator is normal, check whether
the VM can be started properly. When the VM is started normally, the status of the VM
displayed on the TECS management portal is running.
Yes → End.
No → Step 7.
7. Contact ZTE technical support.
Expected Result
During the operation, the VM suddenly goes to error status. Observed from the service layer,
the VM is down and its status is fault. The VM cannot be pinged or accessed. When entering
the Provider GUI or running the nova list–all-tenants command, the VM status is error.
Action
1. Log in to the Provider GUI, find the corresponding VM, and select soft restart or hard restart
of the cloud host. Check whether the fault is fixed.
Yes → End.
No → Step 2.
2. Log in to the controller node, and run the nova list-all-tenants command to view the VM
uuid.
3. Run the nova reboot command to restart the VM. For example,
Expected Result
In case of manual restart, abnormal restart and global restart of the VM, the VM status is
Running, but the service software in the VM cannot be started properly.
Action
1. Check the power-on print information. It shows that the VM is repeatedly restarted in the
simulated boot phase, or the VM print information is abnormal.
2. Log in to the Provider GUI, find the corresponding VM, and select soft restart or hard restart
of the cloud host.
3. If the fault persists, contact ZTE technical support.
Expected Result
The VM fails to be started, and the status of the VM displayed on the TECS management portal
is failed.
Probable Cause
Action
1. Log in to the compute node where the VM is located, and run the following command to
check the libvirt log.
cat /var/log/libvirt/libvirtd.log
Check whether the following error information exists:
Yes → Step 2.
No → Step 5.
2. Run the following command to check the huge page memory information of the physical
machine.
cat /proc/meminfo
An example of the output result is as follows:
AnonHugePages: 2359296 kB
HugePages_Total: 5120
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
3. Check the number of idle huge pages. Whether the idle huge pages meet the requirements
of the VM.
Yes → Step 6.
No → Step 4.
4. Perform the following steps to increase the huge page memory:
a. Run the vi /etc/grubtool.cfg command to open the grubtool.cfg file.
Note
In the grubtool.cfg file, hugepage_num refers to the number of huge pages, which is an
integer, and hugepage_size refers to the size of a huge page. In principle, 30 G memory needs to
be reserved for the OS on the compute node.
If the network of the disk array is suddenly interrupted during normal operation, after a period of
time, the VM that is started through the volume may be suspended and cannot be recovered.
The cluster controller node can be recovered automatically.
Action
1. After the network of the disk array is disconnected, the disks mounted from the disk array
to the host or VM are set to read-only by the OS. Run the mount command to view the disk
array. For example, in the following result, if the attribute is not rw, this indicates that the file
system cannot be written. Generally, the read-only mode is ro.
/dev/mapper/spathk on /var/lib/nova/instances/4aeabb8e-2735-427d-91bb-9cd8f41bca63/sysdisk
/dev/mapper/spathm on /var/lib/nova/instances/64ea09b1-c195-4957-900f-d731282523bd/sysdisk
2. On the cluster controller node, for the mariadbl and other devices mounted from the disk
array, the cluster can detect read-only and perform switchover. Generally, the VM can be
recovered automatically. If the VM cannot be recovered, restart the cluster controller node.
3. Perform the following operations as required.
If it is a ZTE disk array, use ZXOPENCOS_V01.01.10.P5B1I192 or later developed by
ZTE. Perform hard restart on the Provider GUI, or use the nova reboot –hard uuid
command to restart the VM.
For the VM using a volume, if it is a ZTE disk array, use the versions earlier than
ZXOPENCOS_V01.01.10.P5B1I192 developed by ZTE. Upload the script to the
compute node where the VM is located and then hard reboot the VM.
In the case of Fujitsu disk array, you can hard reboot the VM from the Provider GUI or
use the nova reboot –hard uuid command to restart the VM.
In the case of Ceph storage, you can hard reboot the VM from the Provider GUI or use
the nova reboot –hard uuid command to restart the VM.
4. For the VMs that use cluster software, such as the EMS and MANO, if the VMs still cannot
be recovered after hard reboot, contact the corresponding product support personnel.
5. If the fault persists, contact ZTE technical support.
Expected Result
After the network of the disk array is recovered for a period of time, the VM that uses a volume
operates properly.
You can successfully ping the external debugging machine from the VM, but cannot ping the
VM from the external debugging machine.
Action
1. Perform the following steps to obtain the addresses of the tap and qvo ports.
# nova list
+--------------------------------------+------+--------+------------+-------------
+---------------------+
| Networks |
+--------------------------------------+------+--------+------------+-------------
+---------------------+
| vlannet=192.168.1.2 |
+--------------------------------------+------+--------+------------+-------------
+---------------------+
b. On the control node, run the nova interface-list VM ID command to view all the ports of
the VM.
+------------+--------------------------------------+-------
| Net ID | IP addresses
| MAC Addr |
+------------+--------------------------------------+-------
| ACTIVE | 5e0c98c1-9db3-44b7-be12-9a0a3544bd23
| 01a537f2-91fa-4740-bf26-328dae440884 |
| ["fa:16:3e:7a:ea:c2", "fa:16:3e:7a:ea:c2"] |
+------------+--------------------------------------+-------
Find the port ID corresponding to the MAC address of the unreachable port of the VM.
For example, if the MAC address of an unreachable port is fa:16:3e:7a:ea:c2, you can
see that the corresponding port ID is 5e0c98c1-9db3-44b7-be12-9a0a3544bd23. In that
case, the tap port is tap5e0c98c1-9d, and the qvo port is qvo5e0c98c1-9d. "5e0c98c1-9
d" carried by tap and qvo is the first 10 bytes of the port ID.
2. Run the tcpdump command respectively on the tap, qvo, and physical ports, and check
whether ARP packets have responses.
Yes → Step 5.
No → Step 3.
3. Check whether the port has firewalls.
Note
The VM uses a Windows operating system, and the firewall function may be enabled on the VM.
Yes → Step 4.
No → Step 5.
4. Disable the firewall function on the VM. Check whether the fault is removed.
Yes → End.
No → Step 5.
5. Contact ZTE technical support.
Expected Result
The external debugging machine and VM can be successfully pinged from each other.
You can successfully ping the VM from the external debugging machine, but cannot ping the
external debugging machine from the VM.
Action
1. Verify that the switch, port mode, and VLAN configurations used between the VM and
debugging machine are correct.
2. Verify that the internal port of the VM is in up status.
3. Verify that the IP address acquisition mode is correct.
Perform the following operations as needed.
If... Then...
Addresses are allocated by DHCP. Verify that the IP address acquisition mode in the
configuration file of the internal port of the VM is DHCP.
Static IP addresses are used. Verify that correct IP addresses and gateway are configured.
Floating IP addresses are used. Verify that IP addresses in the network segment of the port
are available.
4. Ping the gateway from the VM, attempt to use tcpdump to capture packets on the physical
port, and check whether ARP packets can be captured.
Yes → Step 5.
No → Step 7.
5. Capture packets on the switch, and check whether packets are sent out.
Yes → Step 8.
No → Step 6.
6. Run the ethtool command to check whether the NIC type is supported by the TECS.
Yes → Step 8.
No → Step 9.
7. Capture packets respectively on the tap, qvb, and qvo ports, and check whether ARP
packets can be captured. For example, run the tcpdump -i qvoa1e55c9e-4f command to
capture packets on the qvo port.
Yes → Step 9.
No → Step 8.
8. Verify that the firewall has no problem, and check whether the external debugging machine
can be successfully pinged from the VM.
Yes → End.
No → Step 9.
9. Contact ZTE technical support.
Expected Result
The external debugging machine and VM can be successfully pinged from each other.
Probable Cause
The number of VLANs created in the network exceeds 64, which is the maximum number of
VLANs supported by an SR-IOV NIC. Therefore, some VLANs are invalid.
Action
Expected Result
The OVS VMs in the same subnetwork of the same network cannot be connected to each
other.
Action
1. Check whether the status of the VMs is active and whether the IP addresses of the VMs are
correctly configured. If there are only the IP addresses allocated by the TECS instead of the
IP address configured, refer to 8.4 DHCP Faults for troubleshooting.
2. Check whether the service status of network.service, openvswitch.service, and neutron-
openvswitch-agent.service is normal.
Yes → Step 4.
No → Step 3.
3. Enable network.service, openvswitch.service, and neutron-openvswitch-agent.service and
ensure that the service status of the services is normal.
4. Initiate a Ping operation on a VM and capture packets over the TAP port of the VM. The port
is TAP plus the first 11 digits of the port ID.
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
If there is no ARP packet, determine whether the VM has sent out packets. Some VM
images support TcpDump; that is to say, packet capture in a VM is supported.
5. Capture packets over the QVO port.
If there are DHCP packets on the TAP port and are not DHCP packets on the QVO port,
it indicates that the ports are filtered by security groups. This is probably because the
MAC address of the port configured on the VM is inconsistent with that shown on the
TECS. The IP address configured on the VM is inconsistent with that allocated by on the
TECS because the subnetwork is configured and DHCP is enabled. In this case, add
security group rules, create a port without security groups, or disable security groups.
If there is an ICMP response from the QVB port and no ICMP response from the TAP
port, this is probably because the security groups of the tenant filter packets and the
packets of some types are not allowed to pass.
6. Capture packets over the physical ports of different blades of the two VMs.
a. If there are requested packets on the QVO port and not requested packets on the
physical port, check whether there is a tag for the QVO port (the tag is not always
consistent with the VLAN). If the tag does not exist, check whether neutron-openvswitch-
agent.service is in good condition.
Bridge br-int
Port "qvo8f301bf7-de"
tag: 4
Interface "qvo8f301bf7-de"
b. Check whether the VLAN in the packets is consistent with that in the network. If not,
check the configuration of the network firewall.
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
Expected Result
The OVS VMs in the same subnetwork of the same network can be connected to each other.
The floating network IP address bound to the VM port and the IP address of the external
network cannot be pinged.
Action
1. Correctly configure the floating IP address of the router. If the router binds the subnetwork
of the external network and the subnetwork of the internal network, bind the floating IP
address to the port of the VM.
2. Check whether network.service, openvswitch.service, neutron-openvswitch-agent.service,
and neutron-l3-agent.service are all enabled. If not, manually enable the services.
3. Run the nova-list command to check whether the VM binding the floating IP address is in
good condition.
Yes → Step 5.
No → Step 4.
4. Troubleshoot the VM and make sure that the VM binding the floating IP address is in good
condition.
5. Run the neutron port-show command to check whether the status of the port of the VM is
"Active".
Yes → Step 7.
No → Step 6.
6. Troubleshoot the port and make sure that the port of the VM is "Active".
7. Check whether the type of the network connecting to external networks is "external
network". Run the neutron net-show command the check whether the value of router:
external is "True".
Yes → Step 9.
No → Step 8.
8. Configure the network type and make sure that the value of router:external is "True".
9. Check whether multiple external networks are configured.
Yes → Step 10.
No → Step 11.
10. If multiple external networks are configured, delete the unused ones to ensure that there is
only one external network.
11. Run the ovs-vsctl show command on the network node to check information about the br-
ex bridge.
Bridge br-ex
Port "eth1"
Interface "eth1"
Port br-ex
Interface br-ex
type: internal
Port "qg-1c3627de-1b"
Interface "qg-1c3627de-1b"
type: internal
If there is no br-ex bridge, run the ovs-vsctl add-br br-ex command to create a br-ex
bridge.
If there is no port in the bridge, run the ovs-vsctl add-port br-ex eth1 command,
where eth1 indicates the name of the physical network adapter connected to the
external network.
12. On the VM, ping the IP address of the network connected to the external network. Run the
following command to capture packets on the br-ex bridge:
//10.43. 166.1 is the IP address of the network connected to the external network and
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
If packets can be successfully captured, it indicates that the router and the floating IP
address are correctly configured. Otherwise, check the configuration of the router and the
floating IP address.
13. Capture packets on the external network adapter of the router. If packets are successfully
captured, it indicates that there is no fault in the TECS. Check the connection and
switching settings of the external network.
14. Troubleshoot the connection of the external network and correctly configure the switching
settings.
15. If the fault persists, contact ZTE technical support.
Expected Result
The floating network IP address bound to the VM port and the IP address of the external
network can access each other.
8.2.6 The Service VM Media Plane Using the SRIOV Port Cannot Be
Connected
Symptom
The status of the VM is correct, while the media plane of the board cannot be connected on
the service layer. No obvious exception is found during service troubleshooting. The underlying
network may be congested.
Action
+--------------------------------------+--------------------+------------+--------+
+--------------------------------------+--------------------+------------+--------+
+--------------------------------------+--------------------+------------+--------+
Id name status
----------------------------------------------------
2 instance-00000848 running
3 instance-00000849 running
4 instance-0000084e running
5 instance-0000084d running
-------------------------------------------------------
- hostdev - - 00:d8:03:50:70:12
- hostdev - - 00:d8:03:50:70:12
In this example, the service VM correspond to two bound media plane ports: 1701 for
the VLAN of one VF and none for the other. In this case, both packet receiving and
transmission are abnormal.
3. Perform the following operations as required:
If... Then...
The VLAN is lost a. Run the nova reboot command to reboot the VM.
b. Check whether the VLAN exists and whether the service
operates properly.
c. If the fault cannot be resolved after the VM is rebooted, run
the reboot command to reboot the compute node.
4. In a scenario supporting SRIOV port bonding, run the ovs-appctl bond/show command
on the compute node where the VM is located to check whether the status of the physical
network adapter is normal. In the following example, make sure that both ens2f0 and ens2fl
are enabled. If they are disabled, the network port or switching is faulty.
bond_mode: balance-tcp
bond-hash-basis: 0
updelay: 30000 ms
downdelay: 0 ms
lacp_status: negotiated
may_enable: true
hash 8: 0 kB load
active slave
may_enable: true
5. Run the ifconfig ens2f0 down/up command on the compute node and check whether the
corresponding network port can be recovered.
6. If the network port cannot be recovered, run the shutdown /noshutdown command to try
again on the blade port on the switching side.
7. If the network port cannot be recovered, restart the compute node.
8. If the fault persists, perform a switchover operation between the active and standby service
VMs and contact ZTE technical support.
Expected Result
The media planes of the service VMs can properly communicate with each other.
The VMs in the same subnet in the same network cannot communicate with each other.
1. Run the dvs show-dpifstats command. The query result is shown in Figure 8-1.
2. Run the dvs dump-dpflow br-int command to query the flow table. The result is as follows:
recirc_id(0),in_port(vhu6f1-80),packet_type(ns=0,id=0),eth(src=fa:16:3e:bc:6e:a7,
used:0.001s, actions:push_vlan(vid=401,pcp=0),enp33s0f0
recirc_id(0),in_port(enp33s0f0),packet_type(ns=0,id=0),eth(src=fa:00:00:12:30:10,
dst=fa:16:3e:bc:6e:a7),eth_type(0x8100),vlan(vid=402,pcp=0),encap(eth_type(0x0800),
vhu0deea455-6f
In this example, there are two flow tables. Take the first flow table as an example. It means
that the DVS has received packets from port vhu6f1-80, where the keyword information
is: Eth (src=fa:16:3e:bc:6e:a7, dst=fa:00:00:12:30:10), eth_type (0x0800), ipv4 (frag=
no). The final processing mode is actions:push_vlan (vid=401, pcp=0), enp33s0f0, that is,
VLAN 401 is added, and packets are sent out from port enp33s0f0. Currently, 11462223
packets/1650560112 bytes in total hit this flow table.
3. Run the dvs_tcpdump command to mirror packets for packet capture.
The dvs_tcpdump command is a debugging command encapsulated by the DVS and used
to reduce the threshold for users to use. It can be used to locate the functional problems in
low-traffic environment. Because packet capture has a great impact on the performance, it is
not recommended to use it in high-traffic environment.
Syntax: dvs_tcpdump -i port name -w/home/tecs/bond1.pcap
Where,
-w/home/tecs/bond1.pcap is an optional parameter, which means that captured packets are
saved into the /home/tecs/bond1.pcap file. You should select a proper location for the file.
Otherwise, serious consequences may be caused.
The port name can be the virtual port name or the bond port name (for non-bond interface,
capture packets directly at the physical port). Run the ovs-appctl bond/show command to
query the name of the bond interface. For example, bond1 is the name of the bond interface
in the current environment.
bond_mode: balance-tcp
bond-hash-basis: 0
updelay: 0 ms
downdelay: 0 ms
lacp_status: negotiated
lacp_fallback_ab: false
active slave
may_enable: true
may_enable: true
VM Communication Scenarios
Figure 8-2 shows two common scenarios for OVS+DPDK (DVS) communication. The
communication of VMs on the same node refers to the communication between VM1 and VM2.
The communication of VMs on different nodes refers to the communication between VM1/VM2
and VM3. This rule will be followed in the following description.
As a virtual switch, the main task of the DVS is to send service packets from interface A to B.
The basic principle for locating the network connection fault is to check whether the packets
have passed through the path between the VMs. You can use the following methods: statistics,
flow table and packet capture.
1. Check whether the states of the source and destination VMs are active, and whether the
IP addresses of the VMs are configured correctly. If no IP address is configured, but the
TECS allocates the addresses, refer to 8.4 DHCP Faults for troubleshooting.
2. Use the nova command to query the compute node name corresponding to the VM (that
is, hypervisor_hostname, in this example, the name of the compute node is tecs162) and
instance_name (in this example, instance-0000042d).
| OS-EXT-SRV-ATTR:hypervisor_hostname | tecs162
| OS-EXT-SRV-ATTR:instance_name | instance-0000042d
3. Log in to the above compute node, run the following commands as the root user to query
the network interface information corresponding to the VM, and check whether Type is
vhostuser.
-------------------------------------------------------
-------------------------------------------------------
Yes → Step 4.
No → Find the corresponding troubleshooting guide in accordance with the NIC type.
4. Check whether the openvswitch.service and neutron-openvswitch-agent.service are in
normal state.
Yes → Step 6.
No → Step 5.
5. Start the openvswitch.service and neutron-openvswitch-agent.service and ensure that they
are in normal status, and check whether the VM network is normal.
Yes → End.
No → Step 6.
6. Use the debugging function to query port information.
tx_overrun
6325182902
7325182902
tx_overrun
2325182902
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on
tapba6b6e42-67, link-type EN10MB (Ethernet), capture size 65535 bytes 17:36:11.101254 ARP,
1. Check whether the states of the source and destination VMs are active, and whether the
IP addresses of the VMs are configured correctly. If no IP address is configured, but the
TECS allocates the addresses, refer to 8.4 DHCP Faults for troubleshooting.
2. Use the nova command to query the compute node names corresponding to the VMs
(that is, hypervisor_hostname, in this example, the names of the compute nodes are
tecs162 and tecs163) and instance_name (in this example, instance-0000042d and
instance-00000433).
| OS-EXT-SRV-ATTR:hypervisor_hostname | tecs162
| OS-EXT-SRV-ATTR:instance_name | instance-0000042d
| OS-EXT-SRV-ATTR:hypervisor_hostname | tecs163
| OS-EXT-SRV-ATTR:instance_name | instance-00000433
3. Log in to the above compute nodes, run the following commands as the root user to query
the network interface information corresponding to the VM, and check whether Type is
vhostuser.
-------------------------------------------------------
-------------------------------------------------------
Yes → Step 4.
No → Find the corresponding troubleshooting guide in accordance with the NIC type.
4. Log in to the above two compute nodes, and check whether the openvswitch.service and
neutron-openvswitch-agent.service are in normal state.
Yes → Step 6.
No → Step 5.
5. Start the openvswitch.service and neutron-openvswitch-agent.service and ensure that they
are in normal status, and check whether the VM network is normal.
Yes → End.
No → Step 6.
6. Use the debugging function to query port information.
tx_overrun
6325182902
tx_overrun
7325182902
Bridge br-int
fail_mode: secure
Port "vhu4b2-26"
tag: 2
Interface "vhu4b2-26"
type: dpdkvhostuserclient
options: {vhost-server-path="/var/run/openvswitch/vhu4b2-26"}
Bridge "br-bond1"
Port "bond1"
Interface "ens1f1"
type: dpdk
Interface "ens4f1"
type: dpdk
Bridge br-int
Port "vhu6f1-80"
tag: 33
Interface "vhu6f1-80"
type: dpdkvhostuserclient
options: {vhost-server-path="/var/run/openvswitch/vhu6f1-80"}
Bridge "br-bond1"
Port "bond1"
Interface "ens1f1"
type: dpdk
Interface "ens4f1"
type: dpdk
tx_overrun
2325182902
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on
No → Check whether packets are sent in VM1 (if the VM supports tcpdump, capture
packets in the VM).
13. In the above captured packets, check whether there are ARP and ICMP response packets.
Yes → Step 14.
No → Check whether packets are sent in VM1 (if the VM supports tcpdump, capture
packets in the VM).
14. On the compute node tecs162, check the flow table by filtering the source MAC addresses,
and check whether the sending ports in the flow table of ARP and ICMP response
packets sent through the virtual port of VM1 contain the physical port, and whether the
encapsulated VLAN is correct. For the specific method, refer to the previous introduction to
basic commands.
recirc_id(0),in_port(vhu6f1-80),packet_type(ns=0,id=0),eth(src=fa:16:3e:bc:6e:a7,
dst=fa:16:3e:44:3a:cf),eth_type(0x0800),ipv4(frag=no), packets:11462223,
bond_mode: balance-tcp
bond-hash-basis: 0
updelay: 0 ms
downdelay: 0 ms
lacp_status: negotiated
lacp_fallback_ab: false
may_enable: true
active slave
may_enable: true
recirc_id(0),in_port(ens1f1),packet_type(ns=0,id=0),eth(src=fa:16:3e:bc:6e:a7,
dst=fa:16:3e:44:3a:cf),eth_type(0x8100),vlan(vid=402,pcp=0),encap(eth_type(0x0800),
actions:pop_vlan,vhu4b2-26
Verify that the physical interface receives ARP and ICMP request packets from the peer
end.
Yes → Step 17.
No → Contact the intermediate switch maintenance personnel of the compute node to
locate the fault.
17. Verify that the actions in the flow table of the above ARP and ICMP response packet are
correct.
Yes → Step 18.
No → Contact TECS/DVS technical support.
18. Filter the MAC address flow table of VM1, and check whether there are ARP or ICMP
response packets coming out of the virtual port of VM3.
Yes → Step 19.
No → Contact the VM service processing engineer to locate the fault.
19. Check whether the destination port of the above flow table is the physical port and whether
VLAN encapsulation is correct.
Yes → Step 20.
No → Handle the fault by referring to Steps 15 and 16.
20. On the compute node tecs162, refer to Steps 16 and 17.
21. On the compute node tecs162, check whether the destination port of the flow table is the
virtual port of VM1.
a. If the qvo port has request packets but the physical port does not, you need to check
the ovs bridge to see whether the qvo port has a tag (which is not necessarily the same
as the that of the VLAN of the network). If there is no tag, check whether the neutron-
openvswitch-agent.service status is normal.
Port "qvo8f301bf7-de"
tag: 4
Interface "qvo8f301bf7-de"
b. Check whether the VLAN in the packet is the same as that in the network. If not, check
the network firewall.
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening
on int-br-data1,
Expected Result
OVS VMs in the same subnet in the same network can communicate with each other.
Action
1. Check whether the SDN network (SDN topology) is normal. If there is problem in sending or
receiving packets, it is recommended that you check the SDN topology. The topology can
correctly reflect the status of tunnels.
Yes → Step 2.
No → Step 4.
2. Enter the VM and check whether the peer end has an IP address.
Yes → Step 3.
No → Refer to 8.3.2 Failed to Obtain an IP Address.
3. Set the ICMP rules so that the IP address is not filtered by the security group. Check
whether the fault is fixed.
Yes → End.
No → Step 4.
4. Contact ZTE technical support.
Action
1. Check whether the SDN network (SDN topology) is normal. If there is a problem in sending
or receiving packets, it is recommended that you check the SDN topology. The topology can
correctly reflect the status of tunnels.
Yes → Step 2.
No → Step 7.
2. Check whether the DHCP function is enabled for the subnet of the network where the VM is
created.
Method:
+-------------------+------------------------------------------+
| Field | Value |
+-------------------+------------------------------------------+
| cidr | 1.1.1.0/24 |
| created_at | 2020-06-03T08:26:49 |
| description | |
| dns_nameservers | |
| enable_dhcp | True |
| gateway_ip | 1.1.1.1 |
| host_routes | |
| id | f0b6510e-c782-45b4-9930-962800a2cb48 |
| ip_version | 4 |
| ipv6_address_mode | |
| ipv6_ra_mode | |
| name | test_subnet |
| network_id | cfa11c3c-915e-4670-8bed-e69a668ff440 |
| subnetpool_id | |
| tenant_id | 2492dad68d3e45798d98b0a2bd3a8300 |
| updated_at | 2020-06-03T08:55:27 |
+-------------------+------------------------------------------+
If enable_dhcp is True, this indicates that DHCP is enabled. If enable_dhcp is False, DHCP
is disabled.
Yes → Step 4.
No → Step 3.
3. IP addresses cannot be obtained automatically. You need to manually configure an IP
address. After an IP address is configured, check whether the fault is removed.
Yes → End.
No → Step 5.
4. Use dhclient and check whether an IP address can be obtained.
Yes → End. The fault is fixed. The cause is that the image does not automatically obtain
an IP address when the VM is started.
No → Step 5.
5. Check whether DHCP packets are allowed by the security group.
Yes → Step 7.
No → Step 6.
6. Set the security group so that UDP packets can pass through.Check whether the fault is
fixed.
Yes → End.
No → Step 7.
7. Contact ZTE technical support.
Probable Cause
Action
1. On the network node, run the following command to check whether the neutron-dhcp-
agent service is normal. In the result of this command, if the Active field is "active", it
indicates that the service is normal. Otherwise, it indicates that the service is abnormal.
systemctl statusneutron-dhcp-agent
Yes → Step 4.
No → Step 2.
2. Run the following command to restart the neutron-dhcp-agent service:
systemctl restartneutron-dhcp-agent
3. Run the following command to check whether the neutron-dhcp-agent service is
normal.
systemctl statusneutron-dhcp-agent
Yes → Step 4.
No → Step 9.
4. Check whether the connectivity between the network node and the computing node where
the VM is located is normal. You can manually configure an IP address for the VM, and
check whether the VM can be successfully pinged.
Yes → Step 7.
No → Step 5.
5. Check whether VLAN configuration or network connection is abnormal.
Yes → Step 6.
No → Step 7.
6. Modify VLAN configuration or network connection. Check whether the fault is removed.
Yes → End.
No → Step 7.
7. Check whether the number of VLANs exceeds 64.
Yes → Step 8.
No → Step 9.
8. Re-plan the network, so that the number of VLANs does not exceed 64. Check whether the
fault is removed.
Yes → End.
No → Step 9.
9. Contact ZTE technical support.
Expected Result
Probable Cause
Firewall extension checks IP addresses. To use addresses other than those distributed by the
DHCP, you should disable firewall extension.
Action
1. Plan the network properly, and check whether the DHCP function is needed.
Yes → Step 3.
No → Step 2.
2. Delete the subnets in the network, and check whether the fault is removed.
Yes → End.
No → Step 6.
3. Disable the firewall extension as follows:
Modification on the control node:
a. Change enable_security_group in the /etc/neutron/plugin.ini file to False as follows:
openstack-config --set /etc/neutron/plugin.ini securitygroup enable_security_group
False
b. If port_security exists in extension_drivers in the /etc/neutron/plugin.ini file, delete it.
c. When there are no other services, restart the service as follows:
openstack-service restart
Modification on the compute node:
a. Change enable_security_group in the /etc/neutron/plugins/ml2/openvswitch_agent.ini
file to False as follows:
openstack-config --set /etc/neutron/plugins/ml2/openvswitch_agent.ini securitygroup
enable_security_group False
b. Modify firewall_driver as follows:
openstack-config --set /etc/neutron/plugins/ml2/openvswitch_agent.ini securitygroup
firewall_driver neutron.agent.firewall.NoopFirewallDriver
c. When there are no other services, restart the service as follows:
openstack-service restart
4. Check whether the neutron service status is normal and whether the network is normal.
systemctl status neutron-server
Yes → End.
No → Step 5.
5. Run the following command to restart the iptables service on the compute node. Check
whether the alarm is cleared.
service iptables restart
Yes → End.
No → Step 6.
Expected Result
Probable Cause
Action
1. Check whether the MAC address of the VM conflicts with an existing address.
Yes → Step 2.
No → Step 3.
2. Set the MAC address of the VM to be unique. Check whether the fault is removed.
Yes → End.
No → Step 3.
3. Check whether the tenant port used for creating the VM is under the same tenant as the
tenant port used by the VM.
Yes → Step 5.
No → Step 4.
4. Modify the two tenant ports so that they are under the same tenant. Check whether the fault
is removed.
Yes → End.
No → Step 5.
5. Contact ZTE technical support.
Expected Result
A VM is successfully created and enters "running" status, but the control console cannot
connect to the VM.
Probable Cause
Action
my_ip=1.2.3.4
...
novncproxy_host=1.2.3.4
novncproxy_port=6180
4. Modify the /etc/nova/nova.conf file of the compute node. Make the "vncserver_listen"
value in the "vnc" section to be the same as the "my_ip" value in the "default" section, and
set "novncproxy_base_url" in the "vnc" section to "https://public-zte.dns-252:6080/vnc_auto.
html". For example,
[default]
my_ip=1.2.3.4
[vnc]
vncserver_listen=1.2.3.4
novncproxy_base_url=novncproxy_base_url = https://public-zte.dns-252:6080/vnc_auto.html
Expected Result
Probable Cause
Some processes (for example, qemu-system-x86) are abnormally restarted due to physical
memory exhaustion.
Action
free -m
Swap: 0 0 0
Where, "total" means total memory, "used" means used memory, "free" means remaining
memory, and "buff/cache" means cache memory that can be released. When "free+buff/
cache" is less than 4096 MB, it is considered that the memory will be exhausted. Unit: MB.
2. Run the cat /proc/cmdline command to check whether huge pages are configured.
cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-693.21.1.el7.x86_64 root=/dev/mapper/vg_sys-lv_root
,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,29,30,31,
32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55
Hugepagesz=1G hugepages=80 indicates the huge page size is 1 GB, and 80 hug pages
are configured. Generally, a memory of 30 GB should be reserved for the system, that is,
the total memory – huge page memory should be larger than 30 GB. If the requirement
cannot be met, modify the huge page configuration and contact ZTE technical support.
3. Check whether the memory configured for the VM is too large. You need to adjust
the configuration in accordance with the memory resources of the physical machine.
Generally, non-huge page VMs will not be used in the commercial environment. Check the
specifications of the VMs to see whether the numa feature is added. If not, contact ZTE tech
nical support.
4. Compare the memory usage with the actual usage. If memory leak occurs, contact ZTE tech
nical support.
Expected Result
The upload of an image is initiated on the TECS page. The image source is a local image file,
but the image status is queued for a long time.
Probable Cause
The image upload is interrupted because the browser is refreshed or closed and reopened
during the upload.
Action
Note
Chrome (version 49 or later) or Firefox (version 43 or later) is recommended.
Expected Result
9.1.2 Database Server Startup Failure Due to the Damaged Data File
Symptom
After the system is powered off and restarted, the TECS login page can be opened. But after
the username and password are entered, the "Error" information is displayed.
Probable Cause
The data file is damaged due to abnormal shutdown of the database server.
Action
Notice
This operation is highly risky and should be performed under the guidance of ZTE technical support.
1. Log in to the control node as the root user and run the docker-manage ps command to
query the container name, see Figure 9-1.
The container with STATUS being Up is the running container and the NAMES is provider.
2. Run the docker-manage enter provider command to enter the container. The container
prompt is -bash-4.2#, see Figure 9-2.
3. Run the ps –ef|grep mysql command to check whether the mysql process exists, see
Figure 9-3.
If no mysql process is displayed in Figure 9-3, the mysql process does not exist.
Yes → Step 12.
No → Step 4.
Yes → Step 5.
No → Step 12.
5. Run the /home/ztecms/mysql-5.6.19‒x86_63/ start_mysql.sh command to check
whether the database can be started properly, see Figure 9-5.
The content in Figure 9-6 indicates that the data file is damaged.
Yes → Step 7.
No → Step 12.
7. Run the ls -l /home/Data/backup/mysql-bak command to check backup files with the "
gz" suffix starting with mysqlbak. The backup files are generated at 00:00 every day. The
last two files are stored.
8. Select the recent backup file and run the /usr/local/mysqlbackup/ restore.sh backup
file name command to restore the backup data file, see Figure 9-7.
9. Run the exit command to exit the container and return to the control node.
10. Run the docker-manage restart provider command to restart the provider container.
11. Log in to the TECS page and check whether the fault is solved.
Yes → End.
No → Step 12.
12. Contact ZTE technical support.
Expected Result
The TECS page can be logged in to after the correct username and password are entered.
The TECS login page shows that a user is locked because the number of times that an
incorrect password is entered exceeds a threshold.
Probable Cause
If the number of times that a user enters an incorrect password exceeds three, the self-
protection mechanism of the system will forbid the user's login.
Action
Wait for 5 minutes. The system will automatically unlock the user.
Expected Result
The user can log in to the TECS page again after 5 minutes.
Probable Cause
Action
1. Run the following command to check whether the openstack-nova-compute service of the
physical node is successfully started. In the output of this command, if the Active field is "
active", it indicates that the service is successfully started. Otherwise, it indicates that the
service is not successfully started.
systemctl status openstack-nova-compute.service
Yes → Step 2.
No → Step 8.
2. Wait for about five minutes, and then check whether the performance indexes of the
physical machine can be obtained.
Yes → End.
No → Step 3.
3. Run the following command to check whether the openstack-nova-compute service of the
physical node is correctly configured:
cat /etc/nova/nova.conf
If the following information is displayed, it indicates that the configuration is correct.
# cat /etc/nova/nova.conf
compute_monitors=cpu.virt_driver
notification_driver = messagingv2
Yes → Step 8.
No → Step 4.
4. Modify the configuration in the /etc/nova/nova.conf file.
5. Run the following command to restart the openstack-nova-compute service of the physical
node:
systemctl restart openstack-nova-compute.service
6. Run the following command to check whether the openstack-nova-compute service of the
physical node is successfully started:
systemctl status openstack-nova-compute.service
Yes → Step 7.
No → Step 8.
7. Wait for about five minutes, and then check whether the performance indexes of the
physical machine can be obtained.
Yes → End.
No → Step 8.
8. Contact ZTE technical support.
Expected Result
Ceilometer fails to record performance data and the openstack-ceilometer-collector service log
information is as follows:
*args, **kwds)
Probable Cause
The collector service records database errors after receiving performance data. The status of
the default database mongodb is normal. The logs of the collector and mongodb services show
that the collector service depends on the mongodb service. The mongod service is started
before the ceilometer service is started; however, it will take some time to start the mongod
service. When the ceilometer service is started, the mongod service is not always available. If
the mongod service is connected at this moment, the fault may occur.
Action
Expected Result
Fault type:
Fault source:
Symptoms:
Solution:
Summary:
ARP
AZ
- Availability Zone
BIOS
DHCP
EMS
FC
- Fiber Channel
FSCK
FTP
HA
- High Availability
ICMP
IP
- Internet Protocol
iSCSI
NE
- Network Element
NTP
OS
- Operating System
OVS
- Open VSwitch
RAID
SSH
- Secure Shell
TECS
VLAN
VT
- Virtual Tributary
ZTE