Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials
Store

Frustrations with Solarwinds

Shortly after a server move and upgrading APM from 4.0.2 to 4.2 I started experiencing problems, so I opened a case on Oct 9th (277505). It started out with a fairly basic question, did the way APM handles polling problems change between 4.0.2 and 4.2 to show components that previously would have listed as unknown to listing as down?

I was told by the support rep that he didn't think the behavior had changed. After a polling incident which sent out about 3000 emails for over 750 "down" applications I pushed a bit and got this reply:

Below is the reason that they gave for why the monitor is listed as down.

In APM 4.0.2 all monitors had by default Unknown state. Up, Warning, Critical, Down states were handled. All monitors, which didn't have such status, remained into Unknown state.

In APM 4.2 handling of status was changed .Now, by default, monitor status set as Down. Up, Warning, Critical,
Unknown states were handled.

Monitor will be in Unknown state, when some from the following errors appear: "ManagementException" (Wmi Error), "WmiServerNotAvailableException", "WmiAccessDeniedException" and "ApplicationJobNoTimeToRunException". (found in the log files)

In all other cases, monitor will be in Down state.

Basically if the monitor does not meet the criteria for being unknown it goes into a down state. They are looking in to making changes to that but currently their is no workaround. I will let you know when I hear back from development on the issue.

The first thing I see as wrong here is that this wasn't in the Release Notes. I looked, couldn't find it anywhere. How in the hell can a change that can potentially impact things so much go completely unmentioned in the release notes? The fact that support didn't even know about it is telling. It gets worse from there though.

I was told development was working on a buddy drop and he would let me know when it was ready.

I did some looking around Thwack and found this post: .

Dev has added this result in 4.2 to a list of error conditions to ignore so upgrading should rid you of it causing a down status / alert. The root issue would of course still be there so you may still have inaccurate data if you ever need to mine into it.
If you are NOT on 4.2 and won't be for some time support can help you add this error code to your config to correct your issue.

I asked for info on adding the errors I got to this config so they won't display as down, referencing that Thwack post. I got this reply, and then never received anything else about that particular line:

I will check with development to see if it is possible for you to edit the unknown list to add the entries for your errors. I will get back with you once I hear from them.

So I was a bit frustrated that I never heard anything back on this, but I was getting a buddy drop to fix my problem, so I had nothing to worry about right?

I was still waiting to hear about a fix so I emailed support to try to express the importance of this and give one situation where I was seeing this. I provided support with a screenshot of one of the errors I was getting which was showing the component as down. It was for an event log monitor. These monitors are up if the event hasn't shown up, and down if the event has shown up, so the only way to configure alerts is to say, when state = down, send alert. I had 269 of these at that time, and any time one of them failed to poll then an email was sent out to the team responsible for the server involved saying that it was down and the eventID had been identified in the logs. At least one or two of these was failing a day, so I was getting daily calls asking why Orion is sending out false alerts. This wastes manpower and reduces confidence in the Orion toolset's ability to do its job.

I started requesting info on the scope of what the buddy drop would correct, as I was getting concerned that they may change one or two error condtions from down to unknown and I'd be right back in the same boat if a different error condition occurred.

I got the link for the buddy drop on Oct 25th and didn't receive info on the scope of what the buddy drop fixed until Oct 27th. Here was the reply:

This buddy drop fixes false down statuses of Windows Event Log monitor components. There was an unwanted change of behavior when error occurred during polling. In that case component went to Unknown status in APM 4.0.2, but in APM 4.2 it went to Down status. This BD corrects this to same behavior as was in 4.0.2. It should be also integrated in future APM releases.

So as I'd feared the scope of the buddy drop ended up being for the one single error I had shown on the screenshot I sent in, not for the underlying problem that polling errors in general should not mark components as down.

I then received an email saying that there was a service pack prerelease that should fix my error also, but the release notes say:

The problem of the Windows Event Log Count Monitor going into a ‘down’ status instead of ‘Unknown’ when incorrect credentials are used has been resolved.

Besides the fact that my event log monitors weren't using incorrect credentials, this still only deals with one specific condition. So I've emailed support again, re-explaining the situation and asking to be contacted by management.

I would appreciate being contacted by an APM product manager so I can properly explain the situation.

I was told my email was forwarded to development, which is not what I requested. At this point, I need personal assurances from management that this issue is being looked at as a whole and with high priority. My contact details are in the ticket.

-Luke

Find more posts tagged with

Accepted answers

All comments

danielleh

Luke--

I have notified and forwarded this information to Support Management. We should be in contact with you shortly. Please let me know if you have any other questions.

Thanks,
DH

aLTeReGo

lhorstma, let me start by apologizing for the difficulties you've been experiencing since your upgrade to APM 4.2. I can tell you that an inadvertent change (I.E. Bug) was introduced in APM 4.2 related to how we handle error states for the Windows Event Log monitor. This has since been corrected in the forthcoming SP1 for APM 4.2. This service pack is currently only available from support, but should be made available publicly very shortly. No other bugs have either been identified or reported by customers regarding component monitor status changes in the APM 4.2 release.

My understanding thus far is that you have a GoToMeeting scheduled with Kate, who is one of our most senior APM support engineers, as well as members of the APM development team tomorrow at 8:30am. I'm absolutely certain that they will be able to identify, isolate, and most likely resolve the issues you're currently encountering while on the call. In the event you've uncovered a previously unknown bug, we will work diligently to provide a solution or workaround in a timely manor.

I will also be reaching out to you so that I can provide you with my contact information. Should you encounter issues like this again in the future, I welcome and encourage you to contact me directly so I can see to it personally that the matter is being addressed.

I sincerely appreciate your patience and look forward to talking with you.

aLTeReGo

Quick Status Update

After working with our Support and Development team we were able to identify the issue and provide a solution the very next day. The fix for this problem is included in the forthcoming SP1 for APM 4.2 which should be available for download through the customer portal on Tuesday November 15th for all customers under active maintenance.

bobross

Great news alterego. Sometimes wish there was a like button on Thwack.