VCP6-DCV Objective 7.2 - Troubleshoot vSphere Storage and Network Issues

Today's topic of VCP6-DCV Study Guide is touching troubleshooting. In case something goes wrong and you loose connectivity to your application, you must probably troubleshoot the underlying VM first, the network second, but also a storage. When storage is under a pressure then your whole infrastructure just slows down and you might experience disconnections at the VM/application level. VCP6-DCV Objective 7.2 – Troubleshoot vSphere Storage and Network Issues is today's lesson.

You can also check vSphere 6 page where you'll find how-to's, news, videos concerning vSphere 6.x. Last but not least, my Free Tools page where are the post popular tools for VMware and Microsoft. Daily updates of the blog are taking time, but we do it in the goal to provide a guide which is helpful for the community and folks learning towards VCP6-DCV certification exam. If you find one of those posts useful for your preparation, just share.. -:).

vSphere Knowledge

Verify network configuration
Verify storage configuration
Troubleshoot common storage issues
Troubleshoot common network issues
Verify a given virtual machine is configured with the correct network resources
Troubleshoot virtual switch and port group configuration issues
Troubleshoot physical network adapter configuration issues
Troubleshoot VMFS metadata consistency
Identify Storage I/O constraints
Monitor/Troubleshoot Storage Distributed Resource Scheduler (SDRS) issues

—————————————————————————————————–

Verify network configuration

Start from one end. Either from the host level > physical switch > uplinks > switches > port groups > VMs

Check the vNIC status – connected/disconnected
Check the networking config inside Guest OS – yes it might also be one of the issues. Bad network config of the networking inside of a VM.
Verify physical switch config
Check the vSwitch or vDS config
ESXi host network (uplinks)
Guest OS config

Check for disabled/inactive adapters or other unused hardware (if Guest OS has been P2V)

In Windows VM do this:

Click on Start > Run > devmgmt.msc > click + next to network adapters > check if it's not disabled or not present

You can also check the network config like IP address, Netmask, default gateway and DNS servers. Make sure that those informations are correct.

If a VM was P2V – check if there are no “ghosted adapters”. To check that:

On your VM go to Start > RUN > CMD > Enter > Type “

set devmgr_show_nonpresent_devices=1

While still in the command prompt window type:

devmgmt.msc

and then open Device Manager and click on the Menu go to View > Show Hidden Devices (like on the pic).

Then you should see which devices are marked like ghosted devices.They are grayed out. Those devices you can safely remove from the device manager.

Check IP stack – It happened to me several times that the IP stack of a VM was corrupted. The VM has had intermittent networking connectivity, everything seems to be ok but isn't. You can clear the local cache by entering this:

ipconfig /renew

For Linux:

dhclient -r
dhclient eth0

Verify storage configuration

Check the documentation of vSphere storage, the basic concepts, iSCSI etc.

I've done few posts in configuring iSCSI and vSphere (not particulary related to vSphere 6 but those are step-by-steps:

Also check this VMware KB for Teaming and Failover Policy section in the vSphere Networking guide.

Troubleshoot common storage issues

Storage Issues – Check that the virtual machine has no underlying issues with storage or it is not experiencing resource contention, as this might result in networking issues with the virtual machine. You can do this by logging into ESX/ESXi or Virtual Center/vCenter Server using the VI/vSphere Client and logging into the virtual machine console.

Good doc – Troubleshooting Storage guide (p.55 – p.70) which talks about:

Resolving SAN Storage Display Problems – page 56
Resolving SAN Performance Problems on page 57
Virtual Machines with RDMs Need to Ignore SCSI INQUIRY Cache on page 62
Software iSCSI Adapter Is Enabled When Not Needed on page 62
Failure to Mount NFS Datastores on page 63
VMkernel Log Files Contain SCSI Sense Codes on page 63
Troubleshooting Storage Adapters on page 64
Checking Metadata Consistency with VOMA on page 64
Troubleshooting Flash Devices on page 66
Troubleshooting Virtual SAN on page 69
Troubleshooting Virtual Volumes on page 70

Troubleshoot common network issues

Again, networking can be tricky to troubleshoot. But choosing one end to start with should help. Another tip is perhaps to check load balancing policies when more than 1 nic is connected to a VM.

Verify that the virtual machine is configured with two vNICs to eliminate a NIC or a physical configuration issue. To isolate a possible issue:

If the load balancing policy is set to Default Virtual Port ID at the vSwitch or vDS level:
- Leave one vNIC connected with one uplink on the vSwitch or vDS, then try different vNIC and pNIC combinations until you determine which virtual machine is losing connectivity.
If the load balancing policy is set to IP Hash:
1. Ensure the physical switch ports are configured as port-channel. For more information on verifying the configuration on the physical switch, see Sample configuration of EtherChannel / Link aggregation with ESX/ESXi and Cisco/HP switches (1004048).
2. Shut down all but one of the physical ports the NICs are connected to, and toggle this between all the ports by keeping only one port connected at a time. Take note of the port/NIC combination where the virtual machines lose network connectivity.

Load balancing and failover policies – configure VM with 2 vNICs to eliminate physical NIC problems. Check esxtop using the n option (for networking) to see which pNIC the virtual machine is using. Try shutting down the ports on the physical switch one at at time to determine where the virtual machine is losing network connectivity.
Check the vNIC's connection – check the status of the vNIC, (connected/disconnected) at the VM level AND also the NIC inside of the Guest OS (activated/deactivated).

Check more in this KB: Troubleshooting virtual machine network connection issues (1003893)

Verify a given virtual machine is configured with the correct network resources

I've invoked few areas already above. All or most of the possible problems can be found in this KB – KB 1003893

Troubleshoot virtual switch and port group configuration issues

Same name for port groups – Make sure that the Port Group name(s) associated with the virtual machine's network adapter(s) exists in your vSwitch or Virtual Distributed Switch and is spelled correctly. Usually if this isn't done right on per-port group then you have connectivity problems

VLANs – check VLANS on each standard switch

Troubleshoot physical network adapter configuration issues

Physical switch config is usually simple if “trunking” ports are used. Perhaps some of the issues might be if vNICs are not set to automatic (default) but fixed network speed, which do not match the speed of the physical switch… I doubt it…

If beacon probing is used, make sure that you have more than 2 pNICs in the team….

VMware KBs:

1005577 – What is beacon probing?
1004048 – Sample configuration of EtherChannel / Link Aggregation Control Protocol (LACP) with ESXi/ESX and Cisco/HP switches (1004048)
1001938 – Host requirements for link aggregation for ESXi and ESX

Troubleshoot VMFS metadata consistency

There is a VMware KB which explains what to do if:

You have problems accessing certain files on a VMFS datastore.
You cannot modify or erase files on a VMFS datastore.
Attempting to read files on a VMFS datastore may fail with the error:

invalid argument

You can run file system metadata check by using VOMA.

Check it out – Using vSphere On-disk Metadata Analyzer (VOMA) to check VMFS metadata consistency (2036767)

Quote:

To perform a VOMA check on a VMFS datastore and send the results to a specific log file, the command syntax is:

voma -m vmfs -d /vmfs/devices/disks/naa.00000000000000000000000000:1 -s /tmp/analysis.txt
where naa.00000000000000000000000000:1 is replaced with the LUN NAA ID and partition to be checked. Note the “:1” at the end. This is the partition number containing the datastore and must be specified. See note below. As an advisory, if you run voma more than once, add the NAA ID and a time stamp to the output log file name. EG: -s /tmp/naa.00000000000000000000000000:1_analysis_<<hhmm>>.txt

Note: VOMA must be run against the partition and not the device.

Identify Storage I/O constraints

Again, Good KB article to check – VMware KB 1008205.

Per LUN basis – To monitor storage performance on a per-LUN basis:

Start esxtop > Press u to switch to disk view (LUN mode).
Press f to modify the fields that are displayed.
Press b, c, f, and h to toggle the fields and press Enter.
Press s and then 2 to alter the update time to every 2 seconds and press Enter.

Per HBA – To monitor storage performance on a per-HBA basis:

Start esxtop by typing esxtop > Press d to switch to disk view (HBA mode).
To view the entire Device name, press SHIFT + L and enter 36 in Change the name field size.
Press f to modify the fields that are displayed.
Press b, c, d, e, h, and j to toggle the fields and press Enter.
Press s and then 2 to alter the update time to every 2 seconds and press Enter.

Then the metrics to check out:

GAVG, DAVG, KAVG – latency stats.

You should check this community thread from which I quote the main part because I think that it's a very good work done by the community:

Latency values are reported for all IOs, read IOs and all write IOs. All values are averages over the measurement interval.

All IOs: KAVG/cmd, DAVG/cmd, GAVG/cmd, QAVG/cmd
Read IOs: KAVG/rd, DAVG/rd, GAVG/rd, QAVG/rd
Write IOs: KAVG/wr, DAVG/wr, GAVG/wr, QAVG/wr

GAVG – This is the round-trip latency that the guest sees for all IO requests sent to the virtual storage device. GAVG should be close to the R metric in the figure.

Q: What is the relationship between GAVG, KAVG and DAVG?
A: GAVG = KAVG + DAVG

KAVG – These counters track the latencies due to the ESX Kernel's command.

The KAVG value should be very small in comparison to the DAVG value and should be close to zero. When there is a lot of queuing in ESX, KAVG can be as high, or even higher than DAVG. If this happens, please check the queue statistics, which will be discussed next.

DAVG – This is the latency seen at the device driver level. It includes the roundtrip time between the HBA and the storage.

DAVG is a good indicator of performance of the backend storage. If IO latencies are suspected to be causing performance problems, DAVG should be examined. Compare IO latencies with corresponding data from the storage array. If they are close, check the array for misconfiguration or faults. If not, compare DAVG with corresponding data from points in between the array and the ESX Server, e.g., FC switches. If this intermediate data also matches DAVG values, it is likely that the storage is under-configured for the application. Adding disk spindles or changing the RAID level may help in such cases.

QAVG – The average queue latency. QAVG is part of KAVG.

Monitor/Troubleshoot Storage Distributed Resource Scheduler (SDRS) issues

Even when Storage DRS is enabled for a datastore cluster, it might be disabled on some virtual disks in the datastore cluster.

Check the vSphere, ESXi and vCenter server troubleshooting guide p.47 and p.52.

Scenarios like the one below are invoked there:

Storage DRS generates an alarm to indicate that it cannot operate on the datastore.

Problem – Storage DRS generates an event and an alarm and Storage DRS cannot operate.
Cause – The following scenarios can cause vCenter Server to disable Storage DRS for a datastore.

The datastore is shared across multiple data centers – Storage DRS is not supported on datastores that are shared across multiple data centers. This
configuration can occur when a host in one data center mounts a datastore in another data center, or
when a host using the datastore is moved to a different data center. When a datastore is shared across
multiple data centers, Storage DRS I/O load balancing is disabled for the entire datastore cluster.
However, Storage DRS space balancing remains active for all datastores in the datastore cluster that are
not shared across data centers.
The datastore is connected to an unsupported host – Storage DRS is not supported on ESX/ESXi 4.1 and earlier hosts.
The datastore is connected to a host that is not running Storage I/O Control. The datastore must be visible in only one data center. Move the hosts to the same data center or
unmount the datastore from hosts that reside in other data centers.
Ensure that all hosts associated with the datastore cluster are ESXi 5.0 or later.
Ensure that all hosts associated with the datastore cluster have Storage I/O Control enabled.

Tools

vSphere Networking Guide
vSphere Storage Guide
vSphere Troubleshooting Guide
vSphere Server and Host Management Guide
vSphere Client / vSphere Web Client

5/5 - (2 votes)

VCP6-DCV Objective 7.2 – Troubleshoot vSphere Storage and Network Issues

Verify network configuration

Verify storage configuration

Troubleshoot common storage issues

Troubleshoot common network issues

Verify a given virtual machine is configured with the correct network resources

Troubleshoot virtual switch and port group configuration issues

Troubleshoot physical network adapter configuration issues

Troubleshoot VMFS metadata consistency

Identify Storage I/O constraints

Monitor/Troubleshoot Storage Distributed Resource Scheduler (SDRS) issues

Free Trials

VMware Engineer Jobs

YouTube

…

Find us on Facebook

…

Verify network configuration

Verify storage configuration

Troubleshoot common storage issues

Troubleshoot common network issues

Verify a given virtual machine is configured with the correct network resources

Troubleshoot virtual switch and port group configuration issues

Troubleshoot physical network adapter configuration issues

Troubleshoot VMFS metadata consistency

Identify Storage I/O constraints

Monitor/Troubleshoot Storage Distributed Resource Scheduler (SDRS) issues

About Vladan SEGET

Free Trials

VMware Engineer Jobs

YouTube

…

Find us on Facebook

…