VMware Scenario Based
VMware Scenario Based
Answer: I am part of VMware team (5 to 10 members) and working as L2 level supporting Global
Customer Account. Tell him/her customer details are confidential and can’t be disclosed.
Technical details: We have 6 vCenter servers configured for two Data Center locations of the Customer.
3- production 3 – DR. Production vCenter servers having multiple (say 5) clusters and each of them has
10+ ESXi servers. We are supporting around 350+ ESXi servers which are running 2500+ Virtual Machines
from windows end and 6000 VMs from Linux. Clusters are configured as “All automated”.
We have 5.x & 6.X versions of ESXi. vCenter servers are configured at 5.5 U2 version.
Daily tasks:
1. VMware Health Check Report – Running the script and sending the reports to Management
2. Checking vCenter server console for alarms and alerts
3. Scheduling changes required for VMware tasks
4. Attend meetings with Architects
5. Working on VMware related incidents like backup failure, VM not pinging, ESXi host down,
vCenter service failure …. Etc
6. Datastore usage
7. Resource capacity planning for CPU, RAM & Disk
8. Network related issues
9. VMs related issues
10. Hardware’s related issues for HP, Cisco & IBM Servers
11. Co-ordinating with vendors like VMWARE, Microsoft, HP & Cisco
Question: There is Virtual Machine which is not pingable/rdp and in vCenter console it was hung. All
options are grayed out at VM options. How to recover this Virtual machine? Hint: Interviewer wants to
know your ESXi command line skills to troubleshoot the scenario
Answer: From the symptoms it is clear that there is no option available from vCenter server except you
can see Events & Tasks to understand if any action performed before it went to hung state.
For eg: Backup jobs taking snapshot, L1/L2 Admin tried to hit multiple tasks to
shutdown/Restart/Power-off VM
With all these symptoms let us identify the ESXi host on which VM is running and get the root password
either from your Team Lead or Tool where you can get the shared ID password.
Step 4: Power off one of the virtual machines from the list using this command:
Three power-off methods are available – Soft is the most graceful, hard performs an immediate
shutdown, and force should be used as a last resort.
Step 5: Check the Virtual Machine process again to make sure it is no more exist
esxcli vm process list
Question: What are the steps that you will take when vCenter service failed to start in your
Infrastructure? It is running with 5.1 or 5.5 version. Hint: Interviewer wants to know your
troubleshooting skills and confirm whether you worked on this issue at-least once
Answer: Validate if each troubleshooting step below is true for your environment. Each step provides
instructions or a link to a document that helps eliminate possible causes and take corrective action as
necessary.
Note: If you perform a corrective action in any of the following steps, attempt to restart the VMware
Virtual Center Server service.
1. Verify that the VMware Virtual Center Server service cannot be restarted. Try to restart the service
once again and check for logs for error messages.
2. Verify that the configuration of the ODBC Data Source (DSN) used for connection to the database for
vCenter Server is correct.
Based on your Infrastructure – SQL/DB Server either on vCenter or on other Production SQL Cluster
3. Verify there is enough free disk space on the vCenter Server. Also, disk space on SQL DB is running, DB
configured with dynamic size, DB logs are grown ... etc
Sometimes you need to contact SQL Team who can perform advanced troubleshooting steps
4. Verify that ports 902, 80, 8080, 8433 and 443 are not being used by any other application.
If another application, such as Microsoft Internet Information Server (IIS) (also known as Web Server
(IIS) on Windows 2008 Enterprise), Routing and Remote Access Service (RAS), World Wide Web
Publishing Services (W3SVC), Windows Remote Management service (WS-Management) or the Citrix
Licensing Support service are utilizing any of the ports, vCenter Server cannot start.
If you see an error similar to one of the following when reviewing the logs, another application may be
using the ports:
Failed to create http proxy: Resource is already in use: Listen socket: :<port>
Failed to create http proxy: An attempt was made to access a socket in a way forbidden by its access
permissions.
proxy failed on port <port>: Only one usage of each socket address (protocol/network address/port) is
normally permitted
5. Verify the health of the database server that is being used for vCenter Server. If the hard drives are
out of space, the database transaction logs are full, or if the database is heavily fragmented, vCenter
Server may not start.
Sometimes you need to contact SQL Team who can perform advanced troubleshooting steps
6. Verify the VMware VirtualCenter Service is running with the proper credentials.
vpxd.exe utility helps you to update DB credential (KB 1006482)
7. Verify that critical folders exist on the vCenter Server host
8. Verify that no hardware or software changes have been made to the vCenter server that may have
caused the failure. If you have recently made any changes to the vCenter server, undo these changes
temporarily for testing purposes
9. Before launching vCenter Server, ensure that the VMwareVCMSDS service is running
10. Verify that the vpxd.exe is present in C:\Program Files\VMware\Infrastructure\VirtualCenter
Server\vpxd.exe location. If this file is not present, reinstall vCenter Server
Your troubleshooting skills will be useful to identify error messages and use Google to find nearest
solution. Logical thinking is always required.
Question: How do you perform ESXi patching in your Infrastructure? Hint: Interviewer wants to
understand your ITIL Skills like Change management along with Technical answer
Answer: Based on your Infrastructure size and Internet Connectivity status your answer will vary. Let us
cover generic information followed by each case.
ITIL Process:
Take outage approval from Customer.
Submit the Change record based on outage window.
Discuss with management and Customer via meetings to get approval.
Once approved – apply the patches at approved outage window.
Submit the artifacts about successful closure of patching and Change record.
Patching will be performed once in 3months [OR] once in 6 months and before we start patching the
hosts, we need to configure the Update Manager.
Open VC server, and on the Home, page click the Update Manager icon select Network Connectivity
Network connectivity – On this section you can change the ports on which clients and ESX/ESXi servers
communicate with the Update Manager server.
Download Settings: Direct connection to the Internet – If the Update Manager server has an internet
connection you should choose this option to download patches from the VMware repository.
Use a share repository – This is for those that don’t have an internet connection on the Update
Manager server, and they are using an internal web server to publish VMware patches.
Use proxy – Use this only if your Update Manager server needs to pass through a proxy server to
connect to the internet.
When you are done with your configuration hit the Apply button to save the changes. To start
downloading the patches, press the Download Now button. By pressing the download button, it will not
start to download the patches only an index of them.
As soon as you click the download button, the patches index is downloaded. When the process is done
you can see all the available updates on the Patch Repository tab.
The next step is to create a Baseline, where we tell Update Manager what updates to download, and
what type of updates to use for patching. Usually the default baselines are sufficient, but we can
customize it based on requirement. Go to the Baseline and Groups tab and click the Create link.
Give the baseline a name and leave the default baseline type which is Host Patch
If you go with the first option, future updates will not be included in this baseline and you will need to
create a new baseline, or edit this one to include those updates.
Choose the patch type you want to include in this baseline based on you ESX/ESXi hosts and select Finish
button for successful creation of baseline
Click the data center object in the Inventory pane. If you want to patch one server only, click the server
object. It is not recommended to patch all your hosts in the datacenter at once, especially if you are in
a production environment because your VMs will stop, and customer will be unhappy.
In the Attach Baseline window select the baseline which is created earlier then click the Attach button.
Remember, all your VMs will stop because the ESX/ESXi hosts need to be in maintenance mode before
the actual patching begins. Move the VMs to another host if you are in a production environment.
At the Ready to Complete screen click the Finish button to start the patching process.
This is going to take a while, because those patches need to be downloaded from the VMware
repository.
Your ESXi hosts will reboot a couple of times maybe, depends on the updates.
Question: How do you troubleshoot Virtual Machine Backup failures in your Infrastructure?
Hint: Interviewer wants to understand your skills with various backup solutions at vCenter
Answer: We have many vendors to provide backup solutions like Symantec NetBackup, Veeam, IBM
Tivoli, VMware Data Protector, Avamar, Commvault… etc. Vendors started releasing Backup solutions
exclusively for Hyper-visors like VMware vSphere, Microsoft Hyper-V & Citrix Xen Server
Based on your customer Infrastructure Architecture you will fall in below cases:
1) Run backup clients from within a virtual machine performing file‐level or image‐level backups.
2) Run backup clients from the ESX service console, backing up virtual machines in their entirety as
files residing in the VMFS file system.
3) Back up virtual machine data by running a backup server within a virtual machine that is connected
to a tape drive or other SCSI‐based backup media attached to the physical system.
4) When virtual machine files reside on shared storage, use storage‐based imaging on storage such as
SAN, NAS, or iSCSI, or an independent backup server (a proxy backup server or NDMP) to back up virtual
machine files
Troubleshooting backup failures purely depends on error message received during backup failure like
“Cannot create a quiesced snapshot”
You need to logically separate the failure scenarios like if all backups are failing then it should be your
vCenter server service.
If only one VM failing with specific error then check for status of below services:
VSS: Volume Shadow Copy
SWPRV: Microsoft Software Shadow Copy Provider
Check for output of command “VSSADMIN LIST WRITERS” – Make sure no errors are reported.
Take manual snapshot from vCenter and check for errors (if any)
Question: How do you troubleshoot ESXi/ESX host that is disconnected or not responding in your
Infrastructure? Hint: Interviewer wants to understand your skills for common issues in VMware with
ESX/ESXi.
Answer: It is common problem faced by every VMware Administrator in their Infrastructure. Most of
them will stuck at Network troubleshooting as they feel disconnected happened due to connectivity
problems and their approach is correct for few scenarios.
Let us discuss various reasons for the host disconnect which is expected by Interviewer
1) Verify that the ESXi host is in a powered-on state – sounds silly but most of the people forget to
check server status via Remote cards like ILO/DRAC/RIB … etc
2) Verify that network connectivity exists from vCenter Server to the ESXi host with the IP and FQDN –
this is VMware Administrator common suspicious point
3) Verify that the ESXi host can be reconnected, or if reconnecting the ESXi host resolves the issue –
Simple and sometimes it will resolve the problem
4) Verify that the ESXi host is able to respond back to vCenter Server at the correct IP address. If
vCenter Server does not receive heartbeats from the ESXi host, it goes into a not responding state. To
verify if the correct Managed IP Address is set, see Verifying the vCenter Server Managed IP Address and
ESXi 5.0 hosts are marked as Not Responding 60 seconds after being added to vCenter Server. (Known
issue)
5) ESXi/ESX host disconnects from vCenter Server after adding or connecting it to the inventory
(VMware KB2040630)
6) Verify that you can connect from vCenter Server to the ESXi host on TCP/UDP port 902. If the host
was upgraded from version 2.x and you cannot connect on port 902, then verify that you can connect on
port 905.
7) Verify if restarting the ESXi Management Agents resolves the issue – You ran these commands many
times right
services.sh restart
8) ESXi hosts can disconnect from vCenter Server due to underlying storage issues – Complex one to
explain but you should know the pain points from HBA card of ESXi server to LUN/Disk of the Storage
Box for fixing these issues.
Symptoms:
Cannot power on the virtual machine
Powering on the virtual machine fails
You see the errors similar to: Failed to power on VM
Could not power on VM: Admission check failed for memory resource See the VMware ESX Resource
Management Guide for information on resource management settings.
Group vm.3582: Invalid memory allocation parameters for VM vmm0:New_Virtual_Machine. (min:
524288, max: -1, minLimit: -1, shares: -1, units: pages)
Group vm.13327: Cannot admit VM: Memory admission check failed. Requested reservation: 311199
pages
You see the error: Unsupported and/or invalid disk type
In the Events tab on the ESXi hosts or vCenter Server, you see the error:Module DevicePowerOn power
on failed.
Unable to create virtual SCSI device for scsi0:0,
‘/vmfs/volumes/datastorename/VirtualMachineHome/VirtualMachineDisk.vmdk’
Failed to open disk scsi0:0: Unsupported and/or invalid disk type 7. Did you forget to import the disk
first?
Troubleshooting Steps:
Couple of straight answers for above symptoms will be
Issue occurs if a virtual machine that is meant for VMware Hosted products such as VMware
Workstation, VMware Player or VMware Fusion is powered-on on an ESX/ESXi host. The underlying
format used to store virtual machines on VMware Hosted products differs from the format used to store
virtual machines on ESX/ESXi hosts. User or Distributed Resource Scheduling (DRS) have assigned limited
resource to a resource pool. Virtual machine’s host does not have enough memory for the reservation
required. Virtual machine’s resource usage does not match its resource settings
2. Creating a new power-on task may fail if another task for the virtual machine or other
component is already in progress, and multiple concurrent tasks on the object are not
permitted.
3. A virtual machine may fail to power on if licensing requirements are not met and you can see
Error Message “Cannot Power on Virtual Machines, not enough licenses installed to perform the
operation”. It is simple to fix by adding proper licenses for ESXi & vCenter server.
4. The virtual machine may be configured to reserve physical memory on the host, but the host
memory is over-committed and the required memory is unavailable
5. The virtual machine may be starting in a VMware High Availability cluster with strict admission
control enabled, and there are insufficient resources to guarantee failover for all virtual
machines.
6. A file required for starting the virtual machine, such as a virtual disk or swap file, may be
unavailable or missing.
7. The virtual machine may have been previously suspended and making use of CPU features
which are unavailable or incompatible with the CPU features available on this host. The virtual
machine cannot be started without the required features.
8. The virtual machine may require both a VT-capable CPU and the VT feature to be enabled in the
host system’s BIOS. This is true for all 64-bit virtual machines. If the VT feature is unavailable,
the virtual machine may produce the message msg.cpuid.noLongmode.
9. The virtual machine may require another CPU feature which is unavailable on this host. The
virtual machine may produce a message similar to msg.cpuid.<FeatureName>, identifying the
specific feature it has been configured to require. Move the virtual machine back to the host
which has the required CPU features, or edit the virtual machine’s configuration to remove the
requirement.
10. The virtual machine may start, but quickly fail with an error during startup. Review the contents
of the log file in the virtual machine’s directory for any errors or warnings, and search the
Knowledge Base for the error or warning. Base your troubleshooting on the specific messages
seen in the logs – this is the best method to fix this problem
11. If the virtual machine does successfully power on, but the guest OS doesn’t start correctly, there
may be an incompatibility between the virtual hardware and drivers within the guest OS. For
example, a missing SCSI driver may be required for booting.
12. If the guest OS, or a driver or application within the virtual machine experiences a problem
during startup, the guest OS may become unresponsive.
Question: How do you troubleshoot V-Motion failures for Virtual Machine Scenario?
Hint: Interviewer wants to understand your skills for common issues in VMware with Virtual Machines
Answer: How many times you see below messages at your vCenter Server? V-Motion failing at 14% –
10% – 82% – 90 to 95%
You are attempting a vMotion migration between two ESX/ESXi hosts, and the vMotion task reaches
14%, then times out with this error message:
The vMotion migrations failed because the ESX hosts were not able to connect over the vMotion
network. Check the vMotion network settings and physical network configuration.
vMotion fails at 90-95% Cannot perform a vMotion. vMotion times out In vCenter Server, you see the
error:
Operation timed out vMotion stops at 90% then fails with the error
a general system error occurred: failed to resume on destination
VMware vMotion fails at 10% vMotion times out The VirtualCenter/vCenter Server reports these errors:
A general system error occurred: Failed waiting for data. Error 16. Invalid argument
A general system error occurred: failed to look up VMotion destination resource pool object
Enough of error messages and let us see how to answer this question in the Interview:
Thumb rule for V-Motion failure issues is like if the operation fails below 15% then you can assume it
as Network/configuration issue.
You can tell him below are the settings to be verified for any V-Motion failures:
a) Ensure that vMotion is enabled on all ESX/ESXi hosts.
b) Determine if resetting the Migrate. Enabled setting on both the source and destination ESX or
ESXi hosts addresses the vMotion failure
c) Verify that VMkernel network connectivity exists using vmkping
d) Verify that VMkernel networking configuration is valid
e) Verify that the virtual machine is not configured to use a device that is not valid on the target
host
f) If Jumbo Frames are enabled (MTU of 9000) (9000 -8 bytes (ICMP header) -20 bytes (IP header)
for a total of 8972), ensure that vmkping is run like vmkping -d -s 8972 <destinationIPaddress>.
You may experience problems with the trunk between two physical switches that have been
misconfigured to an MTU of 1500
g) Verify that Name Resolution is valid on the host
h) Verify that Console OS network connectivity exists
i) Verify if the ESXi/ESX host can be reconnected or if reconnecting the ESX/ESXi host resolves the
issue
j) Verify that the required disk space is available
k) Verify that time is synchronized across environment
l) Verify that valid limits are set for the virtual machine being vMotioned
m) Verify that hostd is not spiking the console
n) This issue may be caused by SAN configuration. Specifically, this issue may occur if zoning is set
up differently on different servers in the same cluster
o) Verify and ensure that the log.rotateSize parameter in the virtual machine’s configuration file is
not set to a very low value
p) If the virtual machine being vMotioned is a 64-bit virtual machine, verify that the VT option is
enabled on both of your ESX hosts.
q) Restart the host management agents
r) Verify that time is synchronized across your environment
s) Verify that valid limits are set for the virtual machine being vMotioned
t) Verify that host management agents are not spiking the Service Console (ESX only)
u) Verify that there are no issues with the shared storage.
a) Check for IP address conflicts on the vMotion network. Each host in the cluster should have a
vMotion vmknic, assigned a unique IP address
b) Check for packet loss over the vMotion network. Try having the source host ping (vmkping) the
destination host’s vMotion vmknic IP address for the duration of the vMotion.
c) Check that each vMotion vmkernel port group have the same security settings. A security
mismatch causes a vMotion operation to fail. For example, a failure occurs if a source vmkernel
portgroup is set to allow promiscuous mode and the destination vmkernel portgroup is set to
disallow promiscuous mode.
d) Check for connectivity between the two hosts. Try having the source host ping (vmkping) the
destination host’s vMotion vmknic IP address
Question: How do you troubleshoot ESXi host PSOD problems? Most of the Windows Administrators
are familiar with Blue Screen of Death and it is time to know new term PSOD – Purple Screen of Death
(VMware seems try to be unique by not using Blue color )
I have an ESXi 5.5 installed on my HP ProLiant DL 180 G6 with a configuration of 8X Intel(R) Xeon(R) CPU
E5540 @ 2.53GHz, 24 GB RAM. Recently the server has crashed four times, showing the Purple Screen of
Death. Once this happens all of the virtual machines on the server stops and crashes until I restart this
server.
Answer: You can start with definition of what is PSOD to impress him/her and followed by
Troubleshooting steps
“A Purple Screen of Death (PSOD) is a diagnostic screen with white type on a purple background that is
displayed when the VMkernel of an ESX/ESXi host experiences a critical error, becomes inoperative and
terminates any virtual machines that are running”
You need to highlight important step to capture log file information after the PSOD occurred.
To resolve this issue, extract the log file from a vmkernel-zdump file using a command line utility on the
ESX or ESXi host. This utility differs for different versions of ESX or ESXi.
For ESXi 3.5, ESXi/ESX 4.x and ESXi 5.x, use the esxcfg-dumppart utility:# esxcfg-dumppart -L vmkernel-
zdump-filename
To extract the log file from a vmkernel-zdump file:
Note: The file name created for the log in this example is vmkernel-log.1. If another file with the same
name already exists, the new file is created with the number suffix incremented.
Most of the times it will be hardware issue and you need to open a case with Hardware vendors, in this
case it is HP. Based on findings you need to replace the Hardware devices or upgrade the firmware as
suggested by Hardware vendors via ITIL Change Management process.
In some cases, it may be problem with software installed on ESXi server like additional agents for
monitoring both software & hardware, additional VIBs added for Storage … etc
Finally, if you want to be expert to analyze the logs on your own, then here is the good KB Article from
VMware. It’s rare that Interviewer asking about debugging this issue but he wants to check your
understanding about procedure followed in case of PSOD.
VMware KB1004250
Question: How do you troubleshoot P2V Failure Issues in your Infrastructure? (P2V = Physical to Virtual)
Answer: There is lot of discussion about which Physical server is good candidate for VMware
Infrastructure like Exchange, SQL or Cluster … etc
Interviewer also show interest to hear from you that, how you judge which Physical server is good
candidate for Virtualization
Answer for this point is first we need to analyze 3 months of data from any Performance reporting tools,
if you notice server utilization is 80% of CPU & Memory then most likely that Physical server not much
suitable for VMware Infrastructure.
If the Server utilization is less than 70% then you can recommend it for VMware Infrastructure. Once the
server is selected for P2V and you started the process (hope you have Pre & Post P2V checklists) and ran
into some issue. Here you can find good check list to fix P2V problems.
✔ To eliminate permission issues, always use the local administrator account instead of a domain
account.
✔ Note: Disable UAC for Windows Vista, Windows 7, or Windows 8 prior to converting.
✔ If you are unable to convert directly to an ESX host in vCenter Server 5.0, see vCenter
Converter Standalone 5.0 errors when an ESXi 5.0 host is selected as a destination. Check
KB2012310
✔ VMware vCenter Converter Standalone has many more options available to customize your
conversion. If you are having issues using the Converter Plug-in inside vCenter Server, consider
trying the Standalone version.
✔ If a conversion fails using the exact size of hard disks, decrease the size of the disks by at least
1MB. This forces VMware Converter to do a file level copy instead of a block level copy, which
can be more successful if there are errors with the volume or if there are file-locking issues.
✔ Make sure there is at least 500MB of free space on the machine being converted. VMware
Converter requires this space to copy data.
✔ Shut down any unnecessary services, such as SQL, antivirus programs, and firewalls. These
services can cause issues during conversion.
✔ Run a check disk on the volume before running a conversion as errors on disk volumes can cause
VMware Converter to fail.
✔ Do not install VMware Tools during the conversion. Install VMware Tools after you confirm that
the conversion was successful.
✔ Do not customize the new virtual machine before conversion.
✔ Ensure that these services are enabled:
✔ Workstation Service
✔ Server Service
✔ TCP/IP NetBIOS Helper Service
✔ Volume Shadow Copy Service
✔ Your answer should also cover Logs information which will prove your real time experience
VMware Converter logs: There are also several ways to diagnose issues by viewing the VMware
Converter logs. The logs can contain information that is not apparent from error messages. In newer
versions of VMware Converter, you can use the Export Log Data button. Otherwise, logs are typically
stored in these directories:
Note: In order to access this location in Windows Vista, 7, or 2008, you may need to go into the folder
options and ensure that Show Hidden Files is enabled and that Hide Protected Operating System Files is
disabled.
C:\WINDOWS\Temp\vmware-converter
C:\WINDOWS\Temp\vmware-temp
Linux:
$HOME/.vmware/VMware vCenter Converter Standalone/Logs
/var/log/vmware-vcenter-converter-standalone
Question: As part of Data Center Network devices upgrade/change – someone changed vCenter IP
Address. How do you tackle this Scenario as a VMware Administrator? What is the Technical plan that
you will follow for this Change Record? (ITIL Process) Hint: Interviewer looking at your Technical
direction/plan along with ITIL Change management procedures
Answer: We may think that changing IP Address is easy job like going to vCenter VM Console (most of
the cases) [OR] Remote console for Physical servers and modify the Network Adapter Settings. But what
happens to your ESXi servers, NSX VM’s and Update Manager?? will they communicate to your vCenter
server with new IP directly without any modification? Here is the detailed Technical Plan to answer this
question.
1. Create backups of the vCenter Server & underlying SQL database for Backup Plan
2. Set DRS to manual mode to avoid anything moving around
3. Identify the ESXi host running the vCenter VM and connected directly to the host with the
vSphere Client – Do not forget your vCenter going to disconnect and you can’t manage it
anymore via vSphere client
4. Close any sessions you have open to the vCenter Server (Web Client, vSphere Client sessions
5. Open a console window to the vCenter Server by way of the ESXi host.
6. Stop all VMware related services
7. Change the IPv4 address and IPv4 gateway as per new Networking configuration
8. Put DRS back to fully automated (optional based on your setup)
9. Uninstall Update Manager software from the VM (Sometimes it’s installed other than vCenter)
10. Install Update Manager and point it new vCenter Server IP Address
a. NOTE: There is easy method to update vCenter IP Address at Update Manager via
command line (we will discuss it in future posts)
11. Update the vCenter Managed IP Address with below procedure
12. NSX requires your attention as vCenter re-registration is complex procedure – leave this for
Network Specialists to provide technical plan
13. Disconnect host from vCenter to flush out the database entry
14. Reconnect to the ESXi host to use new vCenter IP Address for communication and agents
Installation
Question: Can you explain me any major issue that you fixed in VMware Infrastructure? How you
handle that situation? Can you list the steps that you followed to fix that problem?
Answer: It’s always good to pick Storage related issues which creates high impact for both Physical &
VMware Infrastructures. I’m going to take the scenario of couple of VMs’ are not responding including
vCenter Server which is also running as Virtual Machine.
You can start the narration like – you got a call from Helpdesk about some VM’s are inaccessible in the
vCenter server. When you connected to the vCenter server, you can’t access the console of VM’s as they
stuck at black screen. From the screenshot it’s clear that, CPU & Memory usage is high for couple of
hosts in the Cluster. You want to understand the reason for CPU & Memory utilization on the ESXi hosts
and ran the “esxtop” command to know which VM/process creating more CPU/Memory utilization. If
you are new to esxtop, then follow this article to know the usage of this command. As the issue is
related to couple of virtual machines’ only then you downloaded vmware.log file to understand more
about this issue from VM’s point of view. You notice that there is some latency to download the log files.
While this issue is running, vCenter VM went to unresponsive state which makes the situation worse.
From the attempt of esxtop & vwware.log there is not much information to extract and management
wants you to fix the vCenter VM issue first. Also, SQL DB server also running as Virtual Machine which is
holding vCenter & Update Manager DB’s. Also, you installed PernixData (acquired by Nutanix) to
improve storage I/O Performance for VM’s. These factors make the situation complex and below
diagram helps you to understand the scenario in better way.
Now let’s talk about resolution step by step and how you found the issue with Storage.
Help desk reported that another set of VM’s are not reachable in the network and they are hung at
black screen when they opened the VM console from making direct connection to ESXi server. It means
issue is really growing and impacting more business functions like SharePoint, Exchange, Citrix, Finance
Applications … etc. Management declared this as MI (Major Incident) and requested you to join the
bridge call along with other Technical teams like Storage, Network & Incident escalation managers. This
is going to be collaboration effort rather than you fix this issue alone from VMware team. As there was
latency to download the file and black screen for the VM console brings your attention to Storage
related problems. esxtop has specific switches to understand storage related problems and check this
article to know more details. For this escalation you need to know KAVG & DAVG values and thresholds
as listed below.
So from your troubleshooting it’s clear that there is ongoing Storage related problem in the
Infrastructure which shows in esxtop and VM consoles are with black screen. You requested Storage
team to validate the VNX configuration and performance to make sure they are also seeing any alerts or
issues from their side. They came back quickly with an update that ALL stats are looking good for read &
write operations. Management asking what is the problem and you are confident that there is some
Storage issue but they can’t see in VNX side. So, you requested the Storage team to give some
recommendations and they comeback with a plan that, they are going to try to switch the Storage
Processor from A to B. They executed this plan with management approval but there is no change in the
situation. Later they want to try to reboot storage processors one by one but it has major impact if they
can’t come online after reboot. Management agreed as ALL the production VM’s are down at that time.
Storage team finished rebooting both processors successfully in 1 hour but that didn’t change anything
in the VM’s status. Which means Storage team performed all the Troubleshooting including rebooting
the SPA & SPB. Now management wants to know next action plan from VMware Infrastructure.
You are confident that there is Storage problem but VNX looks good then there should be a problem
with another Storage layer which is PernixData. Upon investigating more from Pernixdata process – you
found that there are lot of pending I/O jobs waiting at one of ESXi host server. You killed the pernix
process on that specific host which brought the services online to normal state. One the VM’s became
normal, you recommended the Management to perform clean ESXi reboot to avoid any immediate
potential outage to the Production. They agreed for it and clean reboot performed for each ESXi host
and conference call got closed after validating the Applications from the end users. (UAT – User
Acceptance Test)
Resolution: PernixData helps to improve the Storage I/O performance by offloading that service to ESXi
host local cache, however in this situation it’s holding lot of jobs which created above situation. By killing
pernix process from ESXi host console and clean ESXi reboot brought the situation to normal state. Hope
this helps you to answer Interview questions in better way and be social to share the knowledge.