Shortly after a server move and upgrading APM from 4.0.2 to 4.2 I started experiencing problems, so I opened a case on Oct 9th (277505). It started out with a fairly basic question, did the way APM handles polling problems change between 4.0.2 and 4.2 to show components that previously would have listed as unknown to listing as down?
I was told by the support rep that he didn't think the behavior had changed. After a polling incident which sent out about 3000 emails for over 750 "down" applications I pushed a bit and got this reply:
Below is the reason that they gave for why the monitor is listed as down.
In APM 4.0.2 all monitors had by default Unknown state. Up, Warning, Critical, Down states were handled. All monitors, which didn't have such status, remained into Unknown state.
In APM 4.2 handling of status was changed .Now, by default, monitor status set as Down. Up, Warning, Critical, Unknown states were handled.
Monitor will be in Unknown state, when some from the following errors appear: "ManagementException" (Wmi Error), "WmiServerNotAvailableException", "WmiAccessDeniedException" and "ApplicationJobNoTimeToRunException". (found in the log files)
In all other cases, monitor will be in Down state.
Basically if the monitor does not meet the criteria for being unknown it goes into a down state. They are looking in to making changes to that but currently their is no workaround. I will let you know when I hear back from development on the issue. |
The first thing I see as wrong here is that this wasn't in the Release Notes. I looked, couldn't find it anywhere. How in the hell can a change that can potentially impact things so much go completely unmentioned in the release notes? The fact that support didn't even know about it is telling. It gets worse from there though.
I was told development was working on a buddy drop and he would let me know when it was ready.
I did some looking around Thwack and found this post: .
|
Dev has added this result in 4.2 to a list of error conditions to ignore so upgrading should rid you of it causing a down status / alert. The root issue would of course still be there so you may still have inaccurate data if you ever need to mine into it. If you are NOT on 4.2 and won't be for some time support can help you add this error code to your config to correct your issue. |
I asked for info on adding the errors I got to this config so they won't display as down, referencing that Thwack post. I got this reply, and then never received anything else about that particular line:
| I will check with development to see if it is possible for you to edit the unknown list to add the entries for your errors. I will get back with you once I hear from them. |
So I was a bit frustrated that I never heard anything back on this, but I was getting a buddy drop to fix my problem, so I had nothing to worry about right?
I was still waiting to hear about a fix so I emailed support to try to express the importance of this and give one situation where I was seeing this. I provided support with a screenshot of one of the errors I was getting which was showing the component as down. It was for an event log monitor. These monitors are up if the event hasn't shown up, and down if the event has shown up, so the only way to configure alerts is to say, when state = down, send alert. I had 269 of these at that time, and any time one of them failed to poll then an email was sent out to the team responsible for the server involved saying that it was down and the eventID had been identified in the logs. At least one or two of these was failing a day, so I was getting daily calls asking why Orion is sending out false alerts. This wastes manpower and reduces confidence in the Orion toolset's ability to do its job.
I started requesting info on the scope of what the buddy drop would correct, as I was getting concerned that they may change one or two error condtions from down to unknown and I'd be right back in the same boat if a different error condition occurred.
I got the link for the buddy drop on Oct 25th and didn't receive info on the scope of what the buddy drop fixed until Oct 27th. Here was the reply:
| This buddy drop fixes false down statuses of Windows Event Log monitor components. There was an unwanted change of behavior when error occurred during polling. In that case component went to Unknown status in APM 4.0.2, but in APM 4.2 it went to Down status. This BD corrects this to same behavior as was in 4.0.2. It should be also integrated in future APM releases. |
So as I'd feared the scope of the buddy drop ended up being for the one single error I had shown on the screenshot I sent in, not for the underlying problem that polling errors in general should not mark components as down.
I then received an email saying that there was a service pack prerelease that should fix my error also, but the release notes say:
| The problem of the Windows Event Log Count Monitor going into a ‘down’ status instead of ‘Unknown’ when incorrect credentials are used has been resolved. |
Besides the fact that my event log monitors weren't using incorrect credentials, this still only deals with one specific condition. So I've emailed support again, re-explaining the situation and asking to be contacted by management.
| I would appreciate being contacted by an APM product manager so I can properly explain the situation. |
I was told my email was forwarded to development, which is not what I requested. At this point, I need personal assurances from management that this issue is being looked at as a whole and with high priority. My contact details are in the ticket.
-Luke