Monitoring ESXi hosts is something most IT units have to do these days. While monitoring ESXi hosts for the number and details of guests that are running and for the usual counters of CPU, memory and NIC usage is well understood these days, it can be hard to look deeper at performance and to spot possible issues. I think most people viewing this forum are familiar with Orion's virtualisation monitoring but if not have a look at the the Virtulization page in the the online demo for an example of the guest monitoring and information returned. With SAM component monitors it is possible to get an even deeper look at host operation.
ESXi provides a large number of performance counters out of the box to help with getting an in depth look at how it is operating. The number of counters is a bit of a blessing and a curse, you know the information is there and available but how do you know what to monitor? The provided counters are spread across a wide range of categories. There are the well known CPU and Memory counters and while CPU and memory usage percentages are important there are also ballooning and ready times that have to be monitored. Storage performance has become an area of IT in its own right and when you add in virtualisation it becomes much more complex. Storage performance can determine a company's experience of virtualisation. In all there is quite a bit to monitoring ESXi performance and a lot to gain in getting it right.
There is a lot of discussion on ESXi performance counters and while particular counters are well described it is hard to know what is the most effective to monitor to watch for problems and to try predict trends. One of the best guides on ESXi performance is the SQL server best practices guide. It focuses on SQL server but gives some good detail on specific performance counters. It also has a great quote
"as with all in-guest measurement tools, time-based performance measurements are subject to error. The degree to which the measurements are inaccurate depends on the total load of the VMware ESX host."
Meaning in virtual environments we have to monitor the virtual host as a whole to get real insight into the performance of guests. Fortunately the same guide lists the counters to look at. Again partially quoting the guide:
Subsystem | vCenter Counter |
---|---|
CPU | Ready (milliseconds in a 20,000 ms window) |
Usage | |
Memory | Active |
Swapin Rate | |
Swapout Rate | |
Network | packetsRx |
packetsTx | |
Storage | Commands |
deviceWriteLatency & deviceReadLatency | |
kernelWriteLatency & kernelReadLatency |
These counters can be monitored in SAM and I've attached a basic SAM template with the CPU and memory counters below with a few exceptions and additions from what the chart lists. I've added memory ballooning, for most this is an obvious counter to monitor.
The memory swap counters are excellent, if you see these go over 1 then you are hitting a memory limit and you've over subscribed guest VMs on the host.
I've not added the network counters to the monitor below, monitoring your VMware hosts via SNMP as well as with the VMware API in SAM will bring these back automatically along with interface errors and discards.
Storage counters are a very tricky one, I've added a very high level counter "Datastore.Highest latency.latest". This counter is an aggregate across all datastores. Read and write latencies are going to be some of the most valuable counters but adding them at the right level is the trick. VMware have some very detailed information on monitoring and troubleshooting storage performance starting here and with some more here and it is worth reviewing and trying to get your head around (the linked blog posts look at storage in a lot of depth). Adding the right counters for your environment to your SAM templates would be the best way to build a fuller ESXi template. You can do this by going to the Orion web console
1.Going to Settings
2.SAM Settings
3.Browse for Component Monitor
4.Here you will be asked to select the component monitor type
5.If you choose VMWare Systems - VMWare Performance Counter Monitor you will be able to browse to one of your ESXi hosts and see all the performance counters on it.
6.Select Server IP Address (this will be the server where your service resides) and enter your vSphere credentials.
7.Select Components – Here you can choose the processes that you want to monitor
8.Edit Properties – Here you will be able to choose the thresholds for the processes
9.Add to Application Monitor or Template – Here you will be able to create a new Application Monitor so that you can assign it to other nodes or else assign it to an existing Monitor
10.Assign to Nodes – assign to other nodes if you need other nodes monitored for the same services
11.Confirm that you want to monitor the components.
12.Go back to SAM summary page and confirm that you are seeing the new monitor
Processes/Services:
Process and service monitoring in ESXi is tricky, it is hard to find official documentation on the processes/services and their function. In the template below there is an SNMP monitor for the hostd service. This service is imporatant for normal ESXi, it acts as a go between for vSphere and all actions. A few other services to monitor are vpxa (used for communicating with vCenter) and DCUI (the console UI) but if you need to monitor them depends on your VMware licensing and how you use VMware. There is a VMware community post that is a good read when looking at the services and what they do and what would be best for you to monitor.
You can monitor ESXi services in SAM by going to the Orion web console and
1.Going to Settings
2.SAM Settings
3.Browse for Component Monitor
4.Here you will be asked to select the component monitor type
5.If you choose Process Monitor - SNMP (from under the
Linux - Unix Systemslisting) you will be able to browse to one of your ESXi hosts and see all the processes running on it.
6.Select Server IP Address (this will be the server where your service resides)
7.Select Components – Here you can choose the processes that you want to monitor
8.Edit Properties – Here you will be able to choose the thresholds for monitoring the service
9.Add to Application Monitor or Template – Here you will be able to create a new Application Monitor so that you can assign it to other nodes or else assign it to an existing Monitor
10.Assign to Nodes – assign to other nodes if you need other nodes monitored for the same services
11.Confirm that you want to monitor the components.
12.Go back to SAM summary page and confirm that you are seeing the new monitor
Hardware Status
SAM's hardware health module will pick up on an ESXi hosts health if the CIM service/protocol is enabled and accessible on the host. This will pick up on any hardware issue with the host and doesn't have to be configured to be monitored in SAM.
Importing the Template
Download and save the below file
Open your SAM web console and browse to Settings->SAM Settings->Manage Templates and then select Import and load the file.
Threshold Choices
I've set the memory swap rates to 1 for critical but it might be best to set lower warnings. CPU usage is set to 80 for critical, I'm not sure if a link backing this choice up is needed but Cisco (this is a good document and worth a read) have found this value to be an issue once over 80. That same link also mentions the CPU ready value of 3% as being an issue. I've used the more generous values reported here. I find memory is more an issue for me. The memory active threshold values are going to need tweaking for each host, VMware explain this. Memory balloon is again going to depend on the host and there is no one size fits all threshold we can use.
Further Reading
General:
http://www.vmware.com/pdf/vsphere4/r40/vsp_40_resource_mgmt.pdf
http://communities.vmware.com/docs/DOC-5600
http://www.yellow-bricks.com/2011/04/29/which-metric-to-use-for-monitoring-memory/ (see the comments also)
CPU Ready Conversion (give some insight to understanding the counters)
VMware Storage Performance:
http://blogs.vmware.com/vsphere/2012/06/troubleshooting-storage-performance-in-vsphere-part-2.html
SQL Performance Guide
http://www.vmware.com/files/pdf/sql_server_best_practices_guide.pdf
Services
If you are interested in looking into the services more, /sbin/services.sh and /etc/chkconfig might give a further route for investigating.
Notes:
1) Virtualisation/Virtualization is spelt correctly.
2) I have no idea where part 4 of the storage blog is. If you can find it please post a link below.
3) I haven't uploaded the template to the Content Exchange, I think it is a little light on storage counters as I'm finding hard to be generic. I'm also not happy with the thresholds. If anyone can improve on these then please post a more complete template up to the Content Exchange.
Template updated by: Erica Gill