
Last Friday I finally found the time to upgrade our production APM server to SAM 5.0 RC1. I've been very excited to get the hardware health monitoring capability. I had no showstopping issues with the upgrade and Orion is functional, but I did run into some glitches:
Anybody else experience these?
Thanks for all the effort that's gone into this new update. The hardware monitoring capabilities are a huge benefit to us!
I performed an upgrade as I usually would with any other APM update. The install process correctly identified the existing SolarWindsOrion database. However, after the installation was complete, I found it had installed SQLExpress on the system and started all its services with the SOLARWINDS_ORION instance name. It did not create a new SolarWindsOrion database there either -- it upgraded the already existing one which is located on another server. After I noticed this, I disabled all the SQLExpress services and everything seems to be fine. Is this a known problem?
I can't explain this behavior if the file you downloaded came through the customer portal and did not include the word "eval" in it then SQL express wasn't bundled with the installer. Only files that contain "eval" in the name include SQL Express. It sounds likely that at some point in the history of this machine an evaluation was installed this host.
SAM reports being in Evaluation with 50 days left, and I did activate my license during the upgrade. I'm guessing this is expected as part of being a Release Candidate. Am I correct in this assumption?
This is normal RC behavior. RC keys are essentially extended evaluation keys. Once SAM 5.0 GAs a commercial license will be available through your customer portal. RC keys are designed to extend though to and beyond the scheduled RC period so there should be no concern of the license unexpectedly expiring before you receive your commercial license key.
I'm able to use the Realtime Process Explorer through the web interface of SAM, but alerts generated using the new "High xxx Utilization with Top 10 Processes" templates fail to include the process list, usually showing a time-out error or it might just be blank.
For Windows SNMP nodes there is a known limitation with the RTPE. Microsoft only updates SNMP statistics every two minutes and CPU can only be calculated after two counter updates. This means that you may need to increase the default wait in the alert to 4-5 minutes for Windows nodes being managed/monitored via SNMP. The easiest/best way to rectify this is to change these Windows nodes to WMI. Alternatively you can ensure you have at least one working/up WMI managed component monitor assigned to the host. Due to this Microsoft SNMP limitation SAM 5.0 will use WMI whenever possible, even when the node is managed via SNMP. The RTPE will search for any working/up WMI component monitors assigned to the host and use those credentials to connect and collect the necessary data.
To reply to each point:
Thanks again!
To reply to each point:
- I have Windows machines initially discovered and monitored through SNMP, but I also have the "Windows 2003-2008 Services and Counters" template assigned to each, which uses WMI for all (most?) components. I still, however, get a blank process list from these alerts with these servers. Are you saying that this shouldn't be the case?
Are Windows Service Monitors in "Windows 2003-2008 Services and Counters" template in Up/Warning/Critical state? If so, then these credentials should be used for Real Time Process Explorer.
Could you please open a support ticket for this issue and reference this thread to make sure that it gets escalated to development quickly?
Thank you
To answer your question, yes. They are currently up. I just saw that RC2 came out today -- should I hold off on the update until I've created this support ticket? Could the update possibly fix the issue?
Thanks.
If at all possible, please upgrade to SAM 5.0 RC2 and open a support ticket. We don't have reason to believe RC2 will resolve the issue but it would be best if you were on the latest build.
If you don't have an opportunity to upgrade to RC2 tomorrow then open a support ticket. We're quickly approaching GA so if this is a systemic issue we'd like to identify and resolve it quickly while we're still in the release candidate phase.
Okay, I will install the upgrade this afternoon and then open a ticket if the problem still exists. Thanks!
Opened case 313312.
I just saw something a little different. One of my alert emails that just came through said this:
The Physical Memory on <hostname> is currently running at 91 %. The top 10 processes running at the time of this poll are listed below:
Unable to get list of processes - Value was either too large or too small for a UInt64.
For more information click the link below.
http://<orionservername>:80/Orion/View.aspx?NetObject=N:19
Out of curiosity, do you receive this same error when using the Real-Time Process Explorer on this node from within the WebUI?
I just saw something a little different. One of my alert emails that just came through said this:
Normal 0 false false false EN-US X-NONE X-NONE
The Physical Memory on <hostname> is currently running at 91 %. The top 10 processes running at the time of this poll are listed below:
Unable to get list of processes - Value was either too large or too small for a UInt64.
For more information click the link below.
http://<orionservername>:80/Orion/View.aspx?NetObject=N:19
Could you please confirm if this is a Windows machine monitored as SNMP node? If so, could you please let me know uptime of this box and number of CPUs?
It could be a known issue, that Real Time Process Explorer might have issues with Windows boxes, which are running for a very long time.
Thank you
One of the machines that gave me the "Unable to get list of processes - Value was either too large or too small for a UInt64." error is a Dell R910 with 40 CPU cores, running Windows 2008 R2, monitored with SNMP, and using the Windows 2003-2008 WMI template. It was last rebooted on 2/12/2012. Another one that gave me the same error is a Hyper-V VM, Win2008R2, SNMP/Windows WMI template, 4 vCPUs, and up since 12/4/2011.
I am not just getting these errors in email alerts. I can see them in the Alerts tab of SAM's web interface as well.
I can use the Real-time Process Explorer through the SAM web interface without any issue.
It is combination of both - idle uptime and number of cores. On 40 core machine, the period after which it gets to this state can be only approx 6.5 days, then after another 6.5 days it switches back to correct state. This period may get longer with increased CPU load of machine.
Difference in behavior of alerts and RTPE GUI can be caused by inconsistency of used polling method (alerts may use SNMP, RTPE can use WMI).
Can you try unmanage all applications assigned to that nodes for a while to confirm, that you will receive same error also in RTPE (when it is forced to use SNMP)?
Changing the node type from SNMP to WMI should resolve these issues.
Also some improvements were done in RC3 for Windows SNMP nodes, so it's worth giving it a try when it's available in the next few days.
Thanks for all your help so far!
I have tried the first suggestion of unmanaging one of the affected nodes. I unmanaged one for 15 minutes and then did a test fire of the "High Physical Memory with Top 10 Processes" alert against it, after it came back to the Managed state. The generated alert still contained the time-out error.
I could try switching over to WMI, but would the switch cause me to lose historical data in the node's applications?
RC3 should be available later this week. Once you receive the notification if you could upgrade your installation and provide feedback if this issue is in fact resolved in RC3 we'd appreciate the confirmation. Thanks!
I have good news -- this issue appears to be completely fixed with RC3. All of these I received since upgrading to RC3 yesterday have included the process list. Thanks!