This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Monitoring ESXi Hosts - A deeper look at the what and the why

Monitoring ESXi hosts is something most IT units have to do these days. While monitoring ESXi hosts for the number and details of guests that are running and for the usual counters of CPU, memory and NIC usage is well understood these days, it can be hard to look deeper at performance and to spot possible issues. I think most people viewing this forum are familiar with Orion's virtualisation monitoring but if not have a look at the the Virtulization page in the the online demo for an example of the guest monitoring and information returned. With SAM component monitors it is possible to get an even deeper look at host operation.

ESXi provides a large number of performance counters out of the box to help with getting an in depth look at how it is operating. The number of counters is a bit of a blessing and a curse, you know the information is there and available but how do you know what to monitor? The provided counters are spread across a wide range of categories. There are the well known CPU and Memory counters and while CPU and memory usage percentages are important there are also ballooning and ready times that have to be monitored. Storage performance has become an area of IT in its own right and when you add in virtualisation it becomes much more complex. Storage performance can determine a company's experience of virtualisation. In all there is quite a bit to monitoring ESXi performance and a lot to gain in getting it right.

There is a lot of discussion on ESXi performance counters and while particular counters are well described it is hard to know what is the most effective to monitor to watch for problems and to try predict trends. One of the best guides on ESXi performance is the SQL server best practices guide. It focuses on SQL server but gives some good detail on specific performance counters. It also has a great quote

"as with all in-guest measurement tools, time-based performance measurements are subject to error. The degree to which the measurements are inaccurate depends on the total load of the VMware ESX host."

Meaning in virtual environments we have to monitor the virtual host as a whole to get real insight into the performance of guests. Fortunately the same guide lists the counters to look at. Again partially quoting the guide:

SubsystemvCenter Counter
CPUReady (milliseconds in a 20,000 ms window)
Usage
MemoryActive
Swapin Rate
Swapout Rate
NetworkpacketsRx
packetsTx
StorageCommands
deviceWriteLatency & deviceReadLatency

kernelWriteLatency & kernelReadLatency

These counters can be monitored in SAM and I've attached a basic SAM template with the CPU and memory counters below with a few exceptions and additions from what the chart lists. I've added memory ballooning, for most this is an obvious counter to monitor.

The memory swap counters are excellent, if you see these go over 1 then you are hitting a memory limit and you've over subscribed guest VMs on the host.

I've not added the network counters to the monitor below, monitoring your VMware hosts via SNMP as well as with the VMware API in SAM will bring these back automatically along with interface errors and discards.

Storage counters are a very tricky one, I've added a very high level counter "Datastore.Highest latency.latest". This counter is an aggregate across all datastores. Read and write latencies are going to be some of the most valuable counters but adding them at the right level is the trick. VMware have some very detailed information on monitoring and troubleshooting storage performance starting here and with some more here and it is worth reviewing and trying to get your head around (the linked blog posts look at storage in a lot of depth). Adding the right counters for your environment to your SAM templates would be the best way to build a fuller ESXi template. You can do this by going to the Orion web console

1.Going to Settings

2.SAM Settings

3.Browse for Component Monitor

4.Here you will be asked to select the component monitor type

5.If you choose VMWare Systems - VMWare Performance Counter Monitor you will be able to browse to one of your ESXi hosts and see all the performance counters on it.

6.Select Server IP Address (this will be the server where your service resides) and enter your vSphere credentials.

7.Select Components – Here you can choose the processes that you want to monitor

8.Edit Properties – Here you will be able to choose the thresholds for the processes

9.Add to Application Monitor or Template – Here you will be able to create a new Application Monitor so that you can assign it to other nodes or else assign it to an existing Monitor

10.Assign to Nodes – assign to other nodes if you need other nodes monitored for the same services

11.Confirm that you want to monitor the components.

12.Go back to SAM summary page and confirm that you are seeing the new monitor

Processes/Services:

Process and service monitoring in ESXi is tricky, it is hard to find official documentation on the processes/services and their function. In the template below there is an SNMP monitor for the hostd service. This service is imporatant for normal ESXi, it acts as a go between for vSphere and all actions. A few other services to monitor are vpxa (used for communicating with vCenter) and DCUI (the console UI) but if you need to monitor them depends on your VMware licensing and how you use VMware. There is a VMware community post that is a good read when looking at the services and what they do and what would be best for you to monitor.

You can monitor ESXi services in SAM by going to the Orion web console and

1.Going to Settings

2.SAM Settings

3.Browse for Component Monitor

4.Here you will be asked to select the component monitor type

5.If you choose Process Monitor - SNMP (from under the

Linux - Unix Systemslisting) you will be able to browse to one of your ESXi hosts and see all the processes running on it.

6.Select Server IP Address (this will be the server where your service resides)

7.Select Components – Here you can choose the processes that you want to monitor

8.Edit Properties – Here you will be able to choose the thresholds for monitoring the service

9.Add to Application Monitor or Template – Here you will be able to create a new Application Monitor so that you can assign it to other nodes or else assign it to an existing Monitor

10.Assign to Nodes – assign to other nodes if you need other nodes monitored for the same services

11.Confirm that you want to monitor the components.

12.Go back to SAM summary page and confirm that you are seeing the new monitor

Hardware Status

SAM's hardware health module will pick up on an ESXi hosts health if the CIM service/protocol is enabled and accessible on the host. This will pick up on any hardware issue with the host and doesn't have to be configured to be monitored in SAM.

Importing the Template

Download and save the below file

Open your SAM web console and browse to Settings->SAM Settings->Manage Templates and then select Import and load the file.

Threshold Choices

I've set the memory swap rates to 1 for critical but it might be best to set lower warnings. CPU usage is set to 80 for critical, I'm not sure if a link backing this choice up is needed but Cisco (this is a good document and worth a read) have found this value to be an issue once over 80. That same link also mentions the CPU ready value of 3% as being an issue. I've used the more generous values reported here. I find memory is more an issue for me. The memory active threshold values are going to need tweaking for each host, VMware explain this. Memory balloon is again going to depend on the host and there is no one size fits all threshold we can use.

Further Reading

General:

Counters and their meaning

http://www.vmware.com/pdf/vsphere4/r40/vsp_40_resource_mgmt.pdf

http://pubs.vmware.com/vsphere-50/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-50-monitoring-performance-guide.pdf

http://communities.vmware.com/docs/DOC-5600

http://www.yellow-bricks.com/2011/04/29/which-metric-to-use-for-monitoring-memory/ (see the comments also)

CPU Ready Conversion (give some insight to understanding the counters)

VMware Storage Performance:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008205

http://blogs.vmware.com/vsphere/2012/05/troubleshooting-storage-performance-in-vsphere-part-1-the-basics.html

http://blogs.vmware.com/vsphere/2012/06/troubleshooting-storage-performance-in-vsphere-part-2.html

http://blogs.vmware.com/vsphere/2012/06/troubleshooting-storage-performance-in-vsphere-part-3-ssd-performance.html

http://blogs.vmware.com/vsphere/2012/07/troubleshooting-storage-performance-in-vsphere-part-5-storage-queues.html

SQL Performance Guide

http://www.vmware.com/files/pdf/sql_server_best_practices_guide.pdf

Services

If you are interested in looking into the services more, /sbin/services.sh and /etc/chkconfig might give a further route for investigating.

Notes:

1) Virtualisation/Virtualization is spelt correctly.

2) I have no idea where part 4 of the storage blog is. If you can find it please post a link below.

3) I haven't uploaded the template to the Content Exchange, I think it is a little light on storage counters as I'm finding hard to be generic. I'm also not happy with the thresholds. If anyone can improve on these then please post a more complete template up to the Content Exchange.


Template updated by: Erica Gill

ESXi 5 Service and Performance Monitor (V2).apm-template
  • Hello Erica,

    There is something wrong with CPU summation value formula or maybe VMWare changed  something since you create this template.

    The value returned doesn't scale with powercli returned average value. Also powercli command returned 180 sample values in total.

    It looks like that Solarwinds VMWare Performance Counter just sum them. As soon you do an average, you get a really close to the value.

    The right formula should be Round(${Statistic}/180/200,2)

    Olivier

  • Hi Olivier,

    The monitor for CPU ready summation is converting the value deliberately (the description of the component explains a little bit more and links to http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2002181). I was hoping to use a % value to make it easier to read and match with the links where the % value is discussed. The VMware KB points to a difference between esxtop and the API and most documents seem to reference the esxtop format. Having two component monitors for this, one with and one without the formula might make sense?

    When it comes to the formula itself, the one there is really just dividing by 200 as per the KB "A realtime CPU summation value of 1000 is divided by 200 to give a CPU ready % of 5."

    The Round(${Statistic}/180/200,2) looks to be dividing by 36000. This looks like it is matching the timing for a fortnightly update interval from what I can tell.

    "   Realtime: CPU summation value / 200

        Past Day: CPU summation value / 3000

        Past Week: CPU summation value / 18000

        Past Month: CPU summation value / 72000

        Past Year: CPU summation value / 864000"

    I think that might account for a difference in values/formulas?

    All the best,

    Erica

  • Hello Erica,

    I found what was wrong with the CPU Ready item.

    The template use the entity HostSystem and  following vSphere Documentation Center the value is an aggregate. It means the number depends of machine CPU.

    I have an average of 1452ms on a 16 CPU host which is about 91ms. Now I can match the value retrieved via PowerCli.

    Now we need some  ${CPUCount} variable to be added to the formula. Sadly there is no such thing right now emoticons_sad.png

    or it could be a bug into VMWare Performance plugin not taking into account this.

    Olivier

  • Hi,

    Just to update this for anyone looking into the CPU Ready monitoring. Olivier opened support case 439521 where we looked into this.

    What we found was the value was having the 20 second real-time value returned 15 times, once for each 20 second time slot in the polling interval of 300 seconds. These values were being summed as Olivier spotted.

    I've updated the template above with the correct formula:

    Round(${Statistic}/200/15,2)

    All the best,

    Erica