0% found this document useful (0 votes)
17 views21 pages

Strata ASC Troubleshooting Playbook - Commit

The Strata ASC Troubleshooting Playbook provides guidance for troubleshooting commit-related issues in Palo Alto Networks' systems. It covers various symptoms, including Phase 0 and Phase 1 failures, slow commits, and communication errors, along with detailed troubleshooting steps and commands. The document emphasizes the importance of using the latest PAN-OS version and checking release notes for resolved issues.

Uploaded by

duy anh Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views21 pages

Strata ASC Troubleshooting Playbook - Commit

The Strata ASC Troubleshooting Playbook provides guidance for troubleshooting commit-related issues in Palo Alto Networks' systems. It covers various symptoms, including Phase 0 and Phase 1 failures, slow commits, and communication errors, along with detailed troubleshooting steps and commands. The document emphasizes the importance of using the latest PAN-OS version and checking release notes for resolved issues.

Uploaded by

duy anh Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

STRATA ASC TROUBLESHOOTING PLAYBOOK:

COMMIT

© 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and Authorized Support Center use only.
Strata ASC Troubleshooting Playbook: Commit

Strata ASC Troubleshooting


Playbook: Commit
Palo Alto Networks Restricted Info – do not share or forward this document outside
of Palo Alto Networks or Authorized Support Center (ASC) partners.

The purpose of this document is to provide advanced level troubleshooting, debugging,


data collection, and technical support escalation guidance for commit related issues.

Palo Alto Networks recommends running the latest version of PAN-OS. As a general
reminder, always check the release notes to see if is possibly experiencing an issue that
has already been resolved.

Contents
Triage ............................................................................................................................................................................................................... 2
Troubleshooting........................................................................................................................................................................................ 3
Symptom 1 - Phase 0 Failures ..................................................................................................................................................... 3
Symptom 2 – Phase 1 failure in a client daemon............................................................................................................ 4
Symptom 3 – Slow Commit ......................................................................................................................................................... 8
Symptom 4 – Panorama commit-all Does Not Reach FW, or Does Not Send Properly ....................... 11
Symptom 5 - Commit Fails on DP Due To Memory Allocation Failure........................................................... 13
Symptom 6 - Commit Stuck on FW or the commit-all from Panorama Is Stuck Even Though
Local FW commits Succeed.......................................................................................................................................................14
Issue Replication Tips ..................................................................................................................................................................... 15
How To Troubleshoot Which Daemon Is Causing a Commit Failure .............................................................. 16
Scenario-1: Commit failed with error "Failed to modify obj for phase1 push: TIMEOUT" ..................... 16
Scenario-2: Commit fails on panorama after a downgrade from 10.1.4-h2 to 10.0.9 ............................... 17
Scenario-3: Autocommit failure due to "Certificate <Cert-name>" failed to load: failed to parse
key............................................................................................................................................................................................................... 18
Escalation .................................................................................................................................................................................................... 19
Palo Alto Networks TAC Escalation Template ................................................................................................................ 19

1 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

Triage
When it comes to troubleshooting commit issues, the most important thing is to narrow
down the issue to the specific phase of the commit or system process.

Start by ensuring the configs referenced in the failure are valid. One that is done,
additional debugs can be enabled on processes to help identify root cause.

The image below represents the commit workflow:

• Pre-commit processing
o Generates commit candidate files
o Plugin pre-commit, validation and plugin file generation
• Schema Verification
o Validates the generated xml config to be committed against schema
• Validity checks (xsl transform)
o Does additional validations based on platforms/modes etc by running a xsl
transform
• EDL checks
o Additional validations for edl configuration
• Phase 0
o ID population from devsrvr
• Phase 1
o Mgmtsrvr/Configd sends respective transformed configuration to each client
and plugin for validation
• Phase 2
o Final phase of client commit where config is applied
• Post-commit operations
o After every client reports success in phase2, this is where the configuration
takes effect to reflect the newly committed changes

Pre-commit processing and verifications are done by mgmtsrvr on firewalls, and on


Panorama by configd. Starting from PAN-OS 10.1, configd is also present on firewalls, and
is responsible for commit processing.

2 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

Mgmtsrvr/configd are the the main processes when it comes to the rest of the commit
phases, so collecting commit debugs for those processes are usually necessary to
troubleshoot commit issues.

debug management-server set commit all

debug management-server on debug

Troubleshooting
Symptom 1 - Phase 0 Failures
Phase 0 failures are usually related to idmgr.

Example in the ms.log would be:

2020-12-18 06:40:31.426 +0000 client device reported error: Error: Error populating id
for 'vsys1+TestURL' (4294967295)

(Module: device)

2020-12-18 06:40:31.429 +0000 client device reported Phase 0 FAILED

This can occur if base-id idmgr capacity is exceeded, or in case of idmgr corruption.

PAN-157730 and PAN-155807 fix an issue where ids are not released during the idmgr
reset, so even after the capacity gets back within the limits, commits would still fail.

To troubleshoot base idmgr commit issues, enable device-server debugs:

debug device-server set base id

debug device-server on debug

Collect idmgr data:

debug device-server dump idmgr type type_name all

From PAN-OS 10.0 we use redis to store idmgr data in memory:

debug device-server dump idmgr redis type type_name all

In case the corruption is suspected, idmgr can be reset:


3 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

debug device-server reset id-manager type type_name

After resetting idmgr, always issue “commit force” after the reset.

In case of HA, after resetting idmgr on active device, reset it also on the
passive/secondary device, followed by commit force on both devices.

Symptom 2 – Phase 1 failure in a client daemon


During the phase 1 management-server puts the config for every process on sysd, the
daemons take it and try to apply it internally.

Phase 1 is where most commit failures occur.

To even start the commit all the daemons must register (up and running):

*opcmds only - this process does not participate in commit.

Once the commit starts the phase1 and phase2 progress can be checked using the same
command (show management-clients):

4 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

5 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

In the example above, the command "show management-clients" has been executed
during the commit, It can be seen that P1 finished OK on all processes. but failed on the
device client (device-server).

In ms.log, you can check the clients for which phase1 was successful, and for which it was
unsuccessful:

2021-06-01 12:17:18.616 -0700 client device reported Phase 0 was SUCCESSFUL

2021-06-01 12:17:20.884 -0700 client routed reported Phase 1 was SUCCESSFUL

2021-06-01 12:17:20.920 -0700 client ha_agent reported Phase 1 was SUCCESSFUL

2021-06-01 12:17:22.430 -0700 client ikemgr reported Phase 1 was SUCCESSFUL

2021-06-01 12:17:22.584 -0700 client dhcpd reported Phase 1 was SUCCESSFUL

2021-06-01 12:17:22.612 -0700 client varrcvr reported Phase 1 was SUCCESSFUL

2021-06-01 12:17:22.696 -0700 client rasmgr reported Phase 1 was SUCCESSFUL

...

2021-06-01 12:17:34.519 -0700 client device reported Phase 1 FAILED

6 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

Once you determine the client (process) that is failing, check the logs (if needed enable
debugs for that process).

In our example it was device-server – devsrvr.log (the name if the zone used was
incorrect):

2021-06-01 12:17:33.833 -0700 Error: pan_zones_find_by_name(pan_zone.c:449):


pan_objmap_find_by_name 'ret' failed

2021-06-01 12:17:33.833 -0700 Error: pan_app_policy_from_obj(pan_config_parser.c:12998):


pan_policy_parse_core_columns('test1') failed

...

2021-06-01 12:17:33.880 -0700


Error: pan_ctrl_save_config(pan_config_handler_sysd.c:2206): Error compiling config

<<vsys1>>

Error: Rulebase 'security'

'ret' zone is invalid from rule 'test1'

Error: Failed to parse security policy

<</vsys1>>

2021-06-01 12:17:33.925 -0700 Config commit phase1 failed

If management-server debug level is set to dump for the commit feature, the
transformed xml file will be created in the /tmp folder for every process participating in
the commit containing the xml config sent to that process:

[root@Lab80-143-PA-3250 tmp]# ls -al | grep transf


-rw-r--r-- 1 root root 7651 Jul 8 09:18 authd_transformed.xml
-rw-r--r-- 1 root root 4500 Jul 8 09:18 cord_transformed.xml
-rw-r--r-- 1 root root 385 Jul 8 09:18 cryptod_transformed.xml
-rw-r--r-- 1 root root 12337467 Jul 8 09:18 device_transformed.xml
-rw-r--r-- 1 root root 246 Jul 8 09:18 dhcpd_transformed.xml
-rw-r--r-- 1 root root 361 Jul 8 09:18 distributord_transformed.xml
-rw-r--r-- 1 root root 3361 Jul 8 09:18 dnsproxyd_transformed.xml
-rw-r--r-- 1 root root 57676 Jul 8 09:18 ha_agent_transformed.xml
-rw-r--r-- 1 root root 8310 Jul 8 09:18 ikemgr_transformed.xml
-rw-r--r-- 1 root root 275 Jul 8 09:18 iotd_transformed.xml
-rw-r--r-- 1 root root 787 Jul 8 09:18 l2ctrld_transformed.xml
-rw-r--r-- 1 root root 21836 Jul 8 09:18 logrcvr_transformed.xml
-rw-r--r-- 1 root root 109 Jul 8 09:18 pppoed_transformed.xml
-rw-r--r-- 1 root root 1428 Jul 8 09:18 rasmgr_transformed.xml
-rw-r--r-- 1 root root 3389 Jul 8 09:18 routed_transformed.xml
-rw-r--r-- 1 root root 248859 Jul 8 09:18 satd_transformed.xml
-rw-r--r-- 1 root root 407 Jul 8 09:18 sslmgr_transformed.xml
-rw-r--r-- 1 root root 369671 Jul 8 09:18 sslvpn_transformed.xml
-rw-r--r-- 1 root root 45898 Jul 8 09:18 useridd_transformed.xml
-rw-r--r-- 1 root root 1610 Jul 8 09:18 varrcvr_transformed.xml

7 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

The commit can fail during the phase 1 also if there is a communication error between
the devsrvr on the MP side and the DP side (pan_comm process), because devsrvr needs
to send the generated binary config to the pan_comm process on the DP:

Overall status: P1-abort. Progress: 0


Warnings:
Errors:
device: Error: A communication error happened during configuration commit
to dataplane, please try again.
device: (Module: device)

In the devsrvr log, there should be a "communication error" message:

2021-09-08 14:51:25.078 -0700


Error: pan_config_push_phase1(pan_config_handler_sysd.c:2427): A communication error
happened during configuration commit to dataplane, please try again. sysd error
description: TIMEOUT

In this case the netstat output should be checked if the connection between the devsrvr
and the pan_comm on the port 2001 exists:

tcp 0 0 127.1.1.1:43864 127.1.1.2:2001 ESTABLISHED


0 38032 5616/devsrvr

Also the status of the dataplane and the corresponding processes should be checked
using the command "show system software status". For both processes (devsrvr and
pan_comm) debug level can be increased (for the pan_comm the command "debug
dataplane process comm on debug" can be used), and core of the processes taken. After
that the processes can be restarted to try to re-establish the working connection
between them.

Symptom 3 – Slow Commit


Usually, it is expected to see the prolonged commit times in the cases where the config
size is big and/or the CPU over-utilization on the system.

There could be also other reasons for this issue, so when troubleshooting slow commit,
the main goal is to isolate the issue to specific commit phase that is mainly contributing
to the extended duration.

The exact duration of the commit can be checked from the output of “show jobs”
command or from the ms log. In the ms.log, there will be an entry when the commit job
starts and ends:

8 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

2021-06-02 16:22:52.902 -0700 Commit job started processing. Dequeue time=2021/06/02


16:22:52. JobId=15.User: admin

2021-06-02 16:23:34.435 -0700 Commit job 15 completed and system log generated

If management server is set to debug for the commit all feature, during the commit the
ms log should contain the exact duration in seconds for all phases and checks during the
process:

debug management-server set commit all

debug management-server on debug

admin@LabPA-3250> grep pattern "secs\|seconds" mp-log ms.log

2021-06-02 10:07:40.675 -0700 debug: pan_cfg_save_commit_candidate(pan_cfg_users.c:3537):


COMMIT commit: Time to pan_cfg_tpl_generate_candidate: 0 secs

2021-06-02 10:07:41.491 -0700 debug: pan_cfg_save_commit_candidate(pan_cfg_users.c:3547):


COMMIT commit: Time to pan_cfg_sp_generate_candidate: 1 secs

2021-06-02 10:07:41.491 -0700 debug: pan_cfg_save_commit_candidate(pan_cfg_users.c:3550):


Takes 1 seconds to generate shared policy/ template import.

2021-06-02 10:07:41.963 -0700 debug: pan_cfg_save_commit_candidate(pan_cfg_users.c:3634):


COMMIT commit: Time to generate .revertible-merged-running-config.xml: 0 secs

2021-06-02 10:07:41.963 -0700 debug: pan_cfg_save_candidate_config(pan_cfg_users.c:3150):


COMMIT SaveCandidate: Time to unlink devices node: 0 secs

2021-06-02 10:07:42.388 -0700 debug: pan_cfg_save_candidate_config(pan_cfg_users.c:3185):


COMMIT SaveCandidate: Time to merge global and custom global: 1 secs

2021-06-02 10:07:42.388 -0700 debug: pan_cfg_save_candidate_config(pan_cfg_users.c:3217):


COMMIT SaveCandidate: Time to make other node copies: 0 secs

2021-06-02 10:07:42.388 -0700 debug: pan_cfg_save_candidate_config(pan_cfg_users.c:3304):


COMMIT SaveCandidate: Time to add nodes back: 0 secs

2021-06-02 10:07:43.018 -0700 debug: pan_cfg_save_candidate_config(pan_cfg_users.c:3384):


COMMIT SaveCandidate: Time to commit transform: 1 secs

...

2021-06-02 10:07:43.758 -0700 Takes 4 seconds to generate commit candidate in


cfg_by_cookie.

2021-06-02 10:07:43.877 -0700 Schema validation including uuid check for job 12 takes 0
seconds

...

2021-06-02 10:07:55.416 -0700 debug:


pan_mgmt_client_table_do_commit(pan_cfg_commit_jobs.c:3953):
COMMIT CLIENT_TABLE_DO_COMMIT: Time to pan_mgmt_client_table_do_phase1: 5 secs

9 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

2021-06-02 10:08:06.943 -0700 debug:


pan_mgmt_client_table_do_commit(pan_cfg_commit_jobs.c:4020):
COMMIT CLIENT_TABLE_DO_COMMIT: Time to Phase1 completed.: 11 secs

2021-06-02 10:08:14.822 -0700 debug:


pan_mgmt_client_table_do_commit(pan_cfg_commit_jobs.c:4238):
COMMIT CLIENT_TABLE_DO_COMMIT: Time to Phase2 completed: 3 secs

If the duration of the specific phase is long, it can be checked which client exactly takes a
lot of time using either the command “show management-clients” during the commit, or
in ms log, where for both phase1 and phase2, there will be a log message for every client
when the specific phase is done (example useridd):

Line 893: 2021-06-02 10:08:05.867 -0700 client useridd reported Phase 1 was SUCCESSFUL

Line 917: 2021-06-02 10:08:11.481 -0700 client useridd reported Phase 2 was SUCCESSFUL

Once the problematic client is isolated, the logs for that daemon can be checked (debugs
also can be taken) for more details and its activity using “show system resources follow” (If
the daemon is very high on CPU or unresponsive, core and multiple traces can be taken
for further troubleshooting).

Sometimes the commit can be slow due to the amount of time needed for the config to
be saved on the disk - debug ms log example:

2021-07-22 14:00:13.975 -0400 debug: _do_save_configs(pan_cfg_commit_handler.c:2079):


COMMIT _Do_Save_Configs: Time to pan_cfg_save_last_candidatecfg_version: 264 secs

In that case, diskstat output in the mp-monitor can be checked during the commit time,
and also the system disk performance can be checked from root using the linux dd
command

From PAN-OS 10.0 on, the command "show system last-commit-info" can be used to get
the breakdown of the last commit including the duration per phase/process:

10 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

Symptom 4 – Panorama commit-all Does Not Reach FW,


or Does Not Send Properly
During a commit-all from Panorama, Panorama is sending template or SP config
depending on the type of commit.

That config is merged with FW’s candidate config and commit is done on the merged
configuration.

Panorama is generating the panorama-config file for device-group commit-all


(/opt/pancfg/mgmt/groups/name_of_the_dg). For template, it is generating template-
config.xml, under /opt/pancfg/mgmt/templates/name_of_the_template. The file
panorama-config.xml is present in the TS file.

Once this data is pushed to the FW, FW will save it to the folders /opt/pancfg/mgmt/sp
and /opt/pancfg/mgmt/template (per vsys there will be a folder and there will also be file
for shared data).

In every folder 2 files (pre-trans (exactly what was pushed from panorama) and one more
xml (transformed) - this one will be used for merging).

Once the push has been sent to the FW, the FW should start the commit job.

If the job itself is not queued (show jobs all), the first thing to check should be the
connectivity between FW and Panorama (system log, ms.log).

Panorama is sending the config files through the evtmgr to the FW, and there is a
possibility of hitting the limit of the message (approximately 150MB).

Check the ms.log on panorama for a possible communication error.

11 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

If the config pushed from the panorama does not contain some data (causing the
deletion of part of the merged config on the FW), the files panorama-config.xml and
template-config.xml should be checked and compared with the sp and template files on
the FW.

A very common problem is that in case of having the setting “Share Unused Address and
Service Objects with Devices” disabled, that not all objects that are used on the FW are
pushed.

Check also the references based on the DAG membership.

If the references are not pushed, the workaround could be to rebuild the xml cache on
the Panorama side:

• Load config from running-config.xml


• Enqueues a BuildXMLCache job → ensure this passes - check the Task Manager
• debug md5sum_cache clear → to ensure all DGs/TPL configurations are
regenerated
• commit force

When troubleshooting missing configs pushed from Panorama, the files to be


investigated (they are included in the TS file), are the files saved in the folder
"/opt/pancfg/mgmt/tmp/panorama-pushed" on the FW - by investigating this file, it can
be determined in which phase the config goes missing. These are the files that will be
present in that location (in order for some files to be generated, management-server
debug level should be temporarily set to dump):

sp-push-request.xml - the request from Panorama for dg push

pushsp.xml - version transform, shared policy is expanded to each vsys config

lastsp.xml - last pushed dg

newsp.xml - getting generated by transposing the new sp node over the old one (Go
through shared and each vsys and replace nodes with the ones from pushsp.xml)

mergesp.xml - merge the sp node with candidate cfg (replace /config/panorama with
newsp.xml, apply rename-map, and invoke the sp-importer transform)

tpl-push-request.xml - the request from panorama for tpl push


pushtpl.xml - version transform, vsys name/id change, modify the references to vsys
display name in the template configuration to the vsys name, based on the device vsys
configuration

newtpl.xml - compared to the dg, there is no transpose for the template, so the newtpl
will be either the newly pushed template or the last pushed one depending on the type
of the push
mergetpl.xml - merge the template with the candidate config
12 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

If management-server is set to dump there will be before-merged and after-merged sp


and tpl files generated:

before-sp-imported.xml - before the sp-importer transform

after-sp-imported.xml - after the sp-importer transform

before-tpl-merged.xml - before the template merge with candidate config

after-tpl-merged.xml - after the template merge with candidate config

Symptom 5 - Commit Fails on DP Due To Memory


Allocation Failure
During the commit phase 1, device-server needs to generate the binary config used for
DP and send it to pan_comm.

It can happen that due to no available memory, pan_comm fails the commit. Usually in
the pan_comm logs there will be entries like:

2019-03-08 13:24:10 2019-03-08 13:24:10.666 -0500


Error: pan_alloc_nofree_chunk(pan_alloc.c:1033): malloc 131072 failed

2019-03-08 13:24:10 2019-03-08 13:24:10.667 -0500


Error: pan_cfg_readin(pan_cfg.c:3037): pan_cfg_construct_cfg_allocator() failed

2019-03-08 13:24:10 2019-03-08 13:24:10.667 -0500


Error: _handle_CONFIG_UPDATE_START(pan_host.c:597): pan_cfg_readin() failed

2019-03-08 13:24:10 2019-03-08 13:24:10.667 -0500


Error: pan_host_handle_msg(pan_host.c:1072): failed to handle CONFIG_UPDATE_START

The commit job will be failing at the device module with the reason
CONFIG_UPDATE_START.

The first thing to check is the output of the command, and check the memory utilization:

debug dataplane show cfg-memstat statistics

It should be checked during the commit the “config allocator usage” and “policy cache
allocator usage”:

Example of the config allocator usage (in pan_comm memory are maintained two copies,
the previous and the current running config):

VSYS Config Allocator Usage : 16000KB (10% of 150656KB)

Current config memory usage

Misc : 1408 KB (Actual 1227 KB)


13 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

Custom URL : 128 KB (Actual 0 KB)

Global : 5632 KB (Actual 5538 KB)

vsys1 : 384 KB (Actual 188 KB)

Last config memory usage

Misc : 1408 KB (Actual 1227 KB)

Global : 5632 KB (Actual 5538 KB)

vsys1 : 384 KB (Actual 188 KB)

If the usage is close to the limit, that means that there are parts of the config that needs
size reduction (check the platform support limit for different config options). The
following table contains the explanation which feature belongs to which config memory
area:

Configuration/Feature cfg-memory used


IP-object/EDL-IP vsys
custom-url/EDL-url Custom URL & Global
url-profile Global
Security profiles (other than url-profile) Misc
Policies vsys
Network configuration and all other global configurations Global

The second thing in the output of the aforementioned command to be checked is the
policy cache usage.

Check the usage of the EDLs in the policies, especially as a destination inside the security
policies.

Once downloaded EDLs are saved on the disk, and they can be found in the TS file:

\opt\pancfg\mgmt\devices\localhost.localdomain

Symptom 6 - Commit Stuck on FW or the commit-all


from Panorama Is Stuck Even Though Local FW commits
Succeed
14 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

If the commit on the FW seems to be stuck, the first objective is to find the offending
process.

For this, the command "show management-clients" can be used.

Once the offending process is determined, the debugs for that process can be taken,
together with the core and trace files of the process.

Also core and trace of the management-server (configd) process should be taken.

There could be a situation, where during the push from panorama, commit on the FW
side is successful (the job is OK), but the corresponding commit-all job on the panorama
side does not finish.

This could be due to communication issues or malformed response, and the best way to
isolate the issue (if it is on the FW or panorama side), is to set the management-server on
the debug level on both sides and check for the following logs:

On the FW side (ms.log):

2021-04-08 18:40:59.630 -0700 debug:


pan_cfg_mgr_send_push_done_message(pan_cfg_push_handler.c:571): sending the following
push-done message:

On the Panorama side (configd.log):

2021-04-08 18:40:59.801 +0000 PUSH: received push-data : .....

Issue Replication Tips


When troubleshooting the commit failures, if the issue cannot be solved just based on
the observations from the logs, and config changes, or in situations when the case needs
to be escalated or the bug opened, it is necessary to create the setup in the lab.

Even though the issue may not be reproduced, this facilitates easier troubleshooting and
finding the root cause.

During the replication, use the same PAN-OS version and if possible use the same model
of the FW/Panorama.

The best would be if customer can give us the device state from the FW and running-
config from the panorama.

The file \opt\pancfg\mgmt\audit\cfg-audit.xml,v can be taken from the TS file, and used
on reproduction setup in case loading of the recent changes customer has done is
needed.

15 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

The EDL files from the TS can also be used on a lab web server to replicate EDL issues,
even though EDLs are not publicly reachable.

From PAN-OS 9.1 on, on the FW, the previous versions of the merged config are saved
(containing also the panorama config) – they can be seen from the CLI using the
command:

debug management-server last-candidatecfg-audit

In the folder /opt/pancfg/mgmt/audit there is an audit file for merged config - last-
candidatecfg-audit.xml,v (present in the TS file).

This can be used in case we need to restore the config from some point in time (including
the panorama’s pushed configuration).

How To Troubleshoot Which Daemon Is Causing a


Commit Failure
• Once a commit is issued on panorama or pushed from panorama to firewall
o Check ms.log on firewalls running pre-10.1 version, otherwise, check
configd.logs
o If the commit failure is on panorama, check configd.logs
• If the commit failed during phase1, check which daemon reported the failure first
• As an example:

2022-03-18 20:39:16.301 -0700 client device reported error: <>


.....
client device reported Phase 1 FAILED


o As you can see, devsrvr has reported a failure, This can indicate an issue on
the device server process and the team which handles the commit related
issues is Layer-7

Scenario-1: Commit failed with error "Failed to modify obj


for phase1 push: TIMEOUT"
log analysis:

2022-02-10 12:40:57.837 -0500 client device reported error: Warning: No Valid DNS
Security License
<>
Warning: cannot find complete certificate chain for certificate FRTrusted
Warning: vsys2 decryption: forward decrypt untrust cert is not configured, forward
decrypt trust cert will be used instead.
<>
<>
Warning: cannot find complete certificate chain for certificate ipac_cer_ft
<>

16 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

Error: failed to modify obj for phase1 push: TIMEOUT


response from cfgpush.s1.comm.cfg: Error: failed to modify obj for phase1 push:
TIMEOUT
(Module: device)
2022-02-10 12:40:57.837 -0500 client device reported error: Warning: cannot find
complete certificate chain for certificate FRTrusted
Warning: vsys2 decryption: forward decrypt untrust cert is not configured, forward
decrypt trust cert will be used instead.
Warning: cannot find complete certificate chain for certificate ipac_cer_ft
(Module: device)
2022-02-10 12:40:58.423 -0500 client useridd reported Phase 1 was SUCCESSFUL
2022-02-10 12:40:59.839 -0500 client device reported Phase 1 FAILED
2022-02-10 12:40:59.839 -0500
Error: pan_mgmt_client_table_do_commit(pan_cfg_commit_jobs.c:4091): phase 1 failed

• As you can see in ms.log, we see device server process reported phase1 failed
• If we look more in devsrvr.log during the time of the commit failure:

2022-02-10 12:40:57.755 -0500 push config takes 950 sec


2022-02-10 12:40:57.756 -0500 check cfgpush.s4.comm.cfg object
2022-02-10 12:40:57.756 -0500
Warning: pan_cfg_sysd_parse_response_msg(pan_cfg_sysd.c:952): got error response from
cfgpush.s4.comm.cfg: Error: failed to modify obj for phase1 push: TIMEOUT
2022-02-10 12:40:57.756 -0500 check cfgpush.s1.comm.cfg object
2022-02-10 12:40:57.756 -0500
Warning: pan_cfg_sysd_parse_response_msg(pan_cfg_sysd.c:952): got error response from
cfgpush.s1.comm.cfg: Error: failed to modify obj for phase1 push: TIMEOUT
2022-02-10 12:40:57.756 -0500 check cfgpush.s2.comm.cfg object
2022-02-10 12:40:57.756 -0500
Warning: pan_cfg_sysd_parse_response_msg(pan_cfg_sysd.c:952): got error response from
cfgpush.s2.comm.cfg: Error: failed to modify obj for phase1 push: TIMEOUT
2022-02-10 12:40:57.756 -0500 check cfgpush.s3.comm.cfg object
2022-02-10 12:40:57.756 -0500
Warning: pan_cfg_sysd_parse_response_msg(pan_cfg_sysd.c:952): got error response from
cfgpush.s3.comm.cfg: Error: failed to modify obj for phase1 push: TIMEOUT
2022-02-10 12:40:57.756 -0500
Error: pan_config_handler_sysd(pan_config_handler_sysd.c:2897):
pan_config_push_phase1() failed
2022-02-10 12:40:57.756 -0500
Error: pan_ctrl_parse_config(pan_controller_proc.c:590): pan_config_handler_sysd()
failed

• This indicates a push issue between devsrvr and pan_comm

Scenario-2: Commit fails on panorama after a


downgrade from 10.1.4-h2 to 10.0.9
Log Analysis:

• The commit failure on this panorama indicates an issue with sdwan-link-settings on


interfaces

sd_wan plugin validation: Config valid


Validation Error:
devices -> localhost.localdomain -> device-group -> AU-MELBN2-BR-FW-EXT-DG-
MDF -> post-rulebase -> authentication -> rules -> externalauth -> hip-
profiles unexpected here
devices -> localhost.localdomain -> device-group -> AU-MELBN2-BR-FW-EXT-DG-
MDF -> post-rulebase -> authentication -> rules is invalid

17 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

devices -> localhost.localdomain -> template -> DE-MUNCH1-HB-FW-EXT-T-FBMD ->


config -> devices -> localhost.localdomain -> network -> interface ->
ethernet -> ethernet1/19 -> layer3 -> units -> ethernet1/19.3 -> sdwan-link-
settings unexpected here
devices -> localhost.localdomain -> template -> DE-MUNCH1-HB-FW-EXT-T-FBMD ->
config -> devices -> localhost.localdomain -> network -> interface ->
ethernet -> ethernet1/19 -> layer3 -> units is invalid
devices -> localhost.localdomain -> template -> US-CENTN1-DC-FW-INT-02T-A06 -
> config -> devices -> localhost.localdomain -> network -> interface ->
ethernet -> ethernet1/2 -> layer3 -> units -> ethernet1/2.9 -> sdwan-link-
settings unexpected here
devices -> localhost.localdomain -> template -> US-CENTN1-DC-FW-INT-02T-A06 -
> config -> devices -> localhost.localdomain -> network -> interface ->
ethernet -> ethernet1/2 -> layer3 -> units is invalid
devices -> localhost.localdomain -> template -> MXNO-SO-FW-EXT-WH_NEW_T ->
config -> devices -> localhost.localdomain -> network -> interface ->
aggregate-ethernet -> ae3 -> layer3 -> sdwan-link-settings unexpected here
devices -> localhost.localdomain -> template -> MXNO-SO-FW-EXT-WH_NEW_T ->
config -> devices -> localhost.localdomain -> network -> interface ->
aggregate-ethernet -> ae3 -> layer3 is invalid

• This should be filed as a networking issue, not a management issue

Scenario-3: Autocommit failure due to "Certificate <Cert-


name>" failed to load: failed to parse key
Log Analysis:

• Since auto-commit fails due to the reason indicating failed to load the private key
and which is related to sslmgr
• ms.log

2022-03-07 20:31:53.215 -0800 client device reported error: <>


Warning: cannot find complete certificate chain for certificate EHDDVPN2
<>
Error: Certificate 'DecryptTraffic' failed to load: failed to parse key
Error preparing global objects
failed to handle CONFIG_UPDATE_START
(Module: device)
2022-03-07 20:31:53.216 -0800 client device reported error: Warning: cannot find
complete certificate chain for certificate EHDDVPN2
(Module: device)
2022-03-07 20:31:54.102 -0800 client useridd reported Phase 1 FAILED

• devsrvr.log

2022-03-07 20:31:51.150 -0800 phase1: modifying cfgpush.*.*.cfg


2022-03-07 20:31:52.987 -0800 push config takes 1 sec
2022-03-07 20:31:52.988 -0800 check cfgpush.s1.comm.cfg object
2022-03-07 20:31:52.988 -0800
Warning: pan_cfg_sysd_parse_response_msg(pan_cfg_sysd.c:952): got error response from
cfgpush.s1.comm.cfg: Certificate 'DecryptTraffic' failed to load: failed to parse key
Error preparing global objects
failed to handle CONFIG_UPDATE_START
2022-03-07 20:31:52.988 -0800
Error: pan_config_handler_sysd(pan_config_handler_sysd.c:2897):
pan_config_push_phase1() failed
2022-03-07 20:31:52.988 -0800

18 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

Error: pan_ctrl_parse_config(pan_controller_proc.c:590): pan_config_handler_sysd()


failed

• pan_comm.log

2022-03-07 20:31:52.983 -0800 Error: pan_pki_parse_privkey(pan_pki.c:426):


pan_pki_load_key_buf() failed
2022-03-07 20:31:52.983 -0800 Error: pan_ssl_load_certkey_i(pan_ssl.c:587):
pan_pki_parse_privkey() failed
2022-03-07 20:31:52.983 -0800 Error: pan_ssl_load_cfg_cert(pan_ssl.c:2675):
pan_ssl_load_certkey(DecryptTraffic) failed: Certificate 'DecryptTraffic' failed to
load: failed to parse key
2022-03-07 20:31:52.983 -0800 Error: pan_ssl_load_decrypt_cfg(pan_ssl.c:2802):
pan_ssl_load_cfg_cert(DecryptTraffic) failed: Certificate 'DecryptTraffic' failed to
load: failed to parse key
2022-03-07 20:31:52.983 -0800 Error: pan_proxy_load_cfg(pan_proxy.c:7335):
pan_ssl_load_decrypt_cfg() failed
2022-03-07 20:31:52.983 -0800 Error: pan_cfg_prepare_global(pan_cfg_dp.c:423):
pan_proxy_load_cfg() failed
2022-03-07 20:31:52.983 -0800 Error: _handle_CONFIG_UPDATE_START(pan_host.c:627):
pan_cfg_readin() failed
2022-03-07 20:31:52.983 -0800 Error: pan_host_handle_msg(pan_host.c:1271):
Certificate 'DecryptTraffic' failed to load: failed to parse key

• Based on the above analysis, this is a Layer 7 issue

Escalation
If after completing the steps in this playbook and conducting your own analysis you are
not able to solve the issue for your customer, it may be necessary to engage Palo Alto
Networks Global Customer Support team for additional assistance. Filing a support case
with the correct team and a full set of actionable data will substantially accelerate
identification and resolution of issues. Always include tech-support files from affected
devices whenever possible, and note when specific information cannot be shared.

Palo Alto Networks Support Escalation Template


Problem Description
-----
Should contain a brief description of the problem

Date/Time(s) of Issue
-----
specify exact time when the commit is started and when it failed.

Expected/Observed Behavior
-----
commit should succeed / commit failing

19 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.
Strata ASC Troubleshooting Playbook: Commit

Troubleshooting Steps Taken


-----
Explanation of the steps taken, and also the collected data.

Include the #playbook-used tag in this section.

What Has Changed?


-----
if there was some action that could cause the issue besides the config changes (SW
update, dynamic updates...)

Steps To Reproduce
-----
For commit failures, this is a mandatory field. It should contain copies of the problematic
configuration files whenever possible.

“show system info” Output


-----

20 | © 2022 Palo Alto Networks, Inc. All rights reserved. Sharing restricted, for internal and ASC partner use only.

You might also like