NCE V100R020C10 Troubleshooting
NCE V100R020C10 Troubleshooting
Troubleshooting
www.huawei.com
1. NCE Maintenance
2. NCE Troubleshooting
Wait until the check progress is 100% and view the check results.
Click Check Result to view the check results online Click Download Report to download the check results in batches.
1. NCE Maintenance
2. NCE Troubleshooting
The firewall between the On the current PC, telnet to the service server through port 31943 and OMP If the server port is not monitored, the ER service may be stopped. Start
current PC and server server through port 31945: telnet X.X.X.X portvalue Test ports 31943 and the ER service on the management plane.
shields the port number 31945.
If the port is monitored, the service is running properly. You are advised to
used for logging in to the
If an exception occurs on the port between the current PC and the server, run check the network between the browser and server for port masking rules.
server
the following command. The command output indicates that the connection
failed: telnet x.x.x.x 31943
If the disk space is full, network-wide On the EulerOS, run the df –k command to check whether If the directory usage reaches 100%, contact Huawei technical support
alarms or alarms of some NEs cannot be the opt directory is full. for troubleshooting. In emergency situations, copy the .log files to
displayed. another PC and delete these files from the server.
If an alarm masking rule is configured, NCE Choose Monitor > Alarm > Alarm Settings from the main If the alarms that cannot be reported are displayed in the masked alarm
discards alarms although they are reported menu. Choose Masking Rules from the navigation pane to list, disable the alarm masking rule.
by NEs. check whether the corresponding alarm masking rule is
configured.
The trap parameter is not configured or MA5800T(config)#display snmp-agent trap enable Configure the trap parameter correctly.
incorrectly configured on the device
MA5800T(config)#display snmp-agent trap-source
NCE does not support the time format MA5800T(config)#display time date-format Configure the time format for the device correctly.
reported by the device
The current date display format is: YYYY-MM-DD
An alarm masking rule is configured on the MA5800T(config)#display trap filter Disable the alarm masking rule.
device.
The network is faulty. Check whether the fault such as network instability, packet Rectify the network fault.
loss or firewall screening occurs.
Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 11
2.3 Recovery from a Large Number of Stacked Tasks
Symptoms
1. A large number of task orders are in the pending list.
2. A large amount of the "too many tasks may cause process delay" message is displayed in logs.
Step 3 If the task frequency is too high, suspend related monitoring tasks.
If the configurations are inconsistent, modify them on NCE or the devices to keep consistent.
For example, double-click the record to modify the configuration on NCE.
Step 2 Check whether the interface IP address, the mask, and configuration on the NCE server are normal. The following uses operations on the EulerOS
as an example.
[ossadm@NMS_Server ~]$ ifconfig -a
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
inet 10.185.215.156 netmask 255.255.254.0 broadcast 10.185.215.255
inet6 fe80::2e97:b1ff:fe5a:51e4 prefixlen 64 scopeid 0x20<link>
ether 2c:97:b1:5a:51:e4 txqueuelen 1000 (Ethernet)
RX packets 26095140 bytes 1814230816 (1.6 GiB)
RX errors 0 dropped 394013 overruns 0 frame 0
TX packets 10857723 bytes 3347765107 (3.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
If the IP address and the mask are incorrect, reconfigure them on the NCE server.
If the interface status is DOWN, check whether the switch directly connected to the NCE server is normal. If the switch is normal, run the following command to
change the interface status to UP: ifconfig bge0 up
If the interface status is normal but the ping operation fails, go to the next step.
Run the netstat –rn command to check whether the routing table information about the devices exists:
[ossadm@NMS_Server ~]$ netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.185.214.1 0.0.0.0 UG 00 0 bond0
10.185.214.0 0.0.0.0 255.255.254.0 U 00 0 bond0
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth2
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth3
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth6
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth7
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth8
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth9
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth10
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 eth11
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 bond0
192.168.10.1 0.0.0.0 255.255.0.0 U 00 0 bond1
192.168.5.0 0.0.0.0 255.255.255.0 U 00 0 bond1
If the route is lost, run the following command to add a static route:
# route add target network IP address -netmask target network subnet mask gateway IP address
Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 17
2.6 Most NEs are offline.
Step 4 Check the system logs and check whether logs about MAC address conflicts exist.
In the /var/log/ directory, view the log file that starts with messages and check whether the log about the following conflict
exists:
10:58:15 NCEPrimary ip: [ID 876157 kern.warning] WARNING: node 00:0d:60:8c:a4:0e is using our IP address
172.029.220.006 on e1000g0
If the preceding conflict exists, correct the IP address of the server where the configuration conflict occurs.
Step 5 Run the arp –a command to check the MAC address entries. Check whether the MAC addresses of the egress
gateways are those of the gateways on the live network. (The MAC address check is used to locate the IP address conflicts.)
[ossadm@NMS_Server ~]$ arp -a
_gateway (10.185.214.1) at 84:5b:12:76:ae:0f [ether] on bond0
? (10.185.215.114) at 18:35:d4:01:77:4f [ether] on bond0
? (10.185.214.154) at 78:1d:ba:db:96:99 [ether] on bond0
? (10.185.215.74) at 00:18:82:2f:fc:46 [ether] on bond0
? (10.185.215.221) at 00:18:82:a0:aa:a2 [ether] on bond0
? (10.185.215.213) at 70:c7:f2:78:66:b2 [ether] on bond0
? (10.185.215.45) at 00:50:56:8e:2d:18 [ether] on bond0
? (10.185.214.250) at e4:c2:d1:e7:21:41 [ether] on bond0
If the MAC addresses are not the same, run the following command to delete the MAC addresses, and then run the ping
command to trigger MAC address learning. Run the arp –a command to check whether the learned MAC addresses are
correct.
arp –d 10.112.166.139
ping 10.112.166.139
Confirmation:
Log in to the database node where the instance is abnormal as the sopuser user. Switch to the dbuser user and run the ps -ef|grep
zengine command to check whether the abnormal database instance process exists
The command output shows that the process for the instance cloudsopdbsvr-0-1000 does not exist, which indicates that the
instance is not running properly, and the process fails to be started. View the instance logs in the /opt/zenith/data/cloudsopdbsvr-
0-1000/log directory
Check the startup logs starting with zctl-xxx. If the following error information is displayed, the database fails to read the port
number and the database instance fails to be started.
Copyright © Huawei Technologies Co., Ltd. All rights reserved. Page 20
2.7 GaussDB T V3 Database Instance Is Abnormal Due to Incorrect File Permission
The port number is configured in the database configuration file. Go to the /opt/zenith/data/cloudsopdbsvr-0-1000/cfg directory,
and check the content and permission of the zengine.ini configuration file. The content of the zengine.ini file cannot be viewed
and the file permission is inconsistent.
Solution
Step 1 Log in to the node where the instance is abnormal as the sopuser user. Switch to the root user, and run the chown
dbuser:dbgroup zengine.ini command to modify the configuration file permission.
Step 2 Wait for three to five minutes, and check the database node status on the management plane. The fault is
rectified. Therefore, the fault is caused by incorrect file permission, as shown in the following figure.
Confirmation
Log in to the database node where the instance is abnormal as the sopuser user. Switch to the dbuser user and run the ps -ef|
grep zenith command to check whether the process of the abnormal database instance exists
The command output shows that the process of the instance nmsdbsvr-2-50 does not exist, indicating that the instance on the
node is abnormal. View the logs of the instance. The log directory is /opt/zenith/data/nmsdbsvr-2-50/log.
Check the startup logs starting with zctl-xxx. If the following error information is displayed, the configuration file cannot be
found, causing the instance startup failure.
Solution
Copy the zengine.ini file of the instance in the slave database to the corresponding instance in the master database
(/opt/zenith/data/instance name /cfg). Set LOCAL_HOST of LSNR_ADDR of LOG_ARCHIVE_DEST_n and
LSNR_ADDR to the IP address in use. Set SERVICE of LOCAL_HOST in LOG_ARCHIVE_DEST_n and
LOG_ARCHIVE_DEST_N to the IP address of the node where the slave database resides. Set the instance name in
CONTROL_FILES to the corresponding instance name