Data gaps. They are the bane of every monitoring engineer's life. No data means no alerts (unless you are alerting on no data, which is another topic entirely) and no alerts means that when something goes wrong that customers know before the engineers who can fit the problem know and that is bad for business. It is also bad for trust and trust is all that we have. (https://www.linkedin.com/pulse/20140901182532-13406765-honesty-is-the-best-policy )
With that preamble, let me set the stage for the problem I having.
All of our VMware hosts are SNMP enabled. That is a relatively recent thing for us so not all of them are actually collecting data via SNMP. Fortunately we have VMAN which collects data using the vCenter API. Another engineer reached out to me indicating that a VMware host had gaps (actually, *has* gaps) in some of the metrics charts. His specific concern was memory data.
I checked the VIM_HostStatistics_Detail table and found no gaps in the data. (We poll that particular in VMAN every 15 minutes because the data set is *huge*)
I checked the CPULoad_Detail table for the same host and found gaps in the data of up to 1 hour. I would have expected to see data at 15 minute intervals assuming that VMAN is populating that data instead of SNMP.
Yes, the node is SNMP-enabled. Oddly enough, CPU & memory data are NOT being collected via SNMP on this node (which I will fix in a minute) which leads me to believe even more strongly that the data from the VIM_HostStatistics_Detail table is being parsed into the CPULoad_Detail (and others) table for display in NPM.
My questions are :
1) Is my assumption about the flow of data from VMAN to NPM correct?
2) If a host is polled via SNMP for CPU and memory data does the SNMP or VMAN data take precedence assuming that my assumption in question 1 is correct?
3) What process(es) are involved in the ingestion of that data and why, if the data is present in the VIM_HostStatistics_Detail table would it not be reflected in the CPULoad_Detail table?
Screenshots:
Here is the view of the data from the CPULoad_Detail table:
And here is the same time interval from the VIM_HostStatistics_Detail table