We're still working on our implementation of Orion, and SRM seems to be the one throwing us for a loop the most. It took us a while, but we eventually got SRM to read our Dell Compellent environment.
Now that we've done that, we have added our 7 arrays to the Orion environment, which are all being polled through our single Unisphere appliance. We're still working out strange "Access denied" errors, which we have a ticket open with Dell to investigate.
My big question is - how long does it take your storage pollers to complete? On just one of our DR arrays, we are seeing the following:
performance/hardware health takes an average of 40 minutes (this usually defaults to polling every 15 minutes)
"Default" (which corresponds to the Capacity polling as far as I can tell) takes an average of 2 hours 15 minutes (default polls every 6 hours)
Topology polling takes about 4-6 minutes (again this one is every 6 hours).
I have to assume other people aren't seeing their arrays take this long to poll. Heck, I'd be curious how other storage arrays stack up to the Compellent. These are brand new environments, having been set up in January. The Unisphere seems to have no problems talking to it. But our Orion instance just likes to say "array unavailable" all the time.
Yay, month and a half later and still no progress, really!
We've done the following:
Even with all this, our numbers are still seemingly super slow.
(Note: The topology data is empty because there are no assigned volumes, and I discovered that after making this screenshot.)
Is anyone else out there using the Compellent monitoring in their production environment? If so, how many disks/volumes/arrays are you monitoring, and how long is it taking? At this point, I really need a point of comparison to someone else who has a comparable environment to ours. The testing equipment we're using has 1 array, 2 pools and 298 disks. Our entire environment has around 1300 disks. and 7-8 arrays.
So good to hear someone with the exact same issue that we have!
We've been working with Dell for quite some time regarding DSM issues involving the tiebreaker for LV's. We've only recently migrated this to the virtual appliance version of the DSM 2019.1.
For SRM polling, our 10 SC arrays constantly time out with inconsistent polling returns (topology, hardware,statistics etc). Currently running Orion platform 2018.4 HF3 with SRM 6.8.0. Like yourself i've also setup a standalone 2020 instance with the latest SRM and experienced the same results (even installed it on the SRM polling engine to eliminate network latency). SC SCOS version currently used is 220.127.116.11.
Your post led me to look into old requests and naturally located this one -> https://support.solarwinds.com/SuccessCenter/s/article/Dell-Compellent-polling-issue?language=en_US However it looks like the new DSM doesn't use 'Pegasus' anymore.
I have noticed that when I poll only 1 array, the results are successful (hardware health, topology, statistics). From a previous case with Solarwinds support, the analysis was that the DSM could not handle the polling requests from SRM.
Please let me know how you get on with further testing or analysis from Dell or Solarwinds, i'd be forever grateful!
I am so glad I'm not going crazy! 😄 We had asked our account rep to find someone with a similar environment so we could do some comparisons, but so far nothing has come up.
If you don't mind saying, how many disks is your environment?We have found that smaller arrays do complete in reasonable time, but naturally our main arrays are not small. 🙂
Did you ever try adjusting the amount of polling threads? We found a setting involved in the # of simultaneous queries, but adjusting it didn't seem to help much.
Right now, we've been exploring the database looking for expensive queries. However, our DBAs so far have not found anything out of the ordinary. The isolated environment is helping us track down things... and this may provide the hint to go to our Dell representatives with. If the problem is inside the load balancing and how it communicates, that would explain a lot. Though I do wonder why the Microsoft SCVMM doesn't have this same issue.
Definitely not crazy, i'd all but given up on my hopes and dreams of having a consolidated view for performance & capacity management of my storage arrays. However recently decided to give it one more shot and although I've run into the same issues, I think i may have found a workaround.
My environment is comprised of 20 Arrays (mixed bag of Dell SC's, Powervault's & some new flashy (no pun i swear it) PureArrays). We have approximately 1280 disks being monitored at present with a few more arrays yet to be added.
The Dell's are the only ones which rely on the DSM for polling, we've never had an issue with the Powervaults (which actually use the main Orion polling engine as the provider). Pure has SMI-S built into the array itself, and has not missed a beat since adding them.
I, like you, studied the SRM logs painfully to see glimpses of where my issues root cause could be located, however what i noticed was that when adding SC's and then inspecting the corresponding logs, i'd see polling timeouts.
Over the weekend, i have managed some success, this is the process i followed;
I'm only really concerned about hardware health and capacity as I can monitor performance via the alternate DSM (virtual appliance) and cloudIQ etc. My SC's are for backup and file storage (clustered file server roles) which makes performance a lesser priority for me. Hence why the polling frequency is set the way it is.
I can confirm my largest array takes about 20min to finish polling for performance/hardware health and all my other arrays are polling fine (no timeouts etc.)
It looks like it's definitely the polling time of DSM from the monitored array's itself, rather than SRM having an issue with resourcing. I've had issues with the tiebreaker service (uses port 443/3033) being on the same DSM as the SRM monitored one, hence my decision to add another DSM. Dell has also mentioned that the max latency between the DSM and it's array's should be less than 10ms.
Thanks and good luck!
@viveasheanone other thing - you mention adding arrays 1 by 1. Does the SRM "add arrays" page let you do that? When I select and scan the DSM, it lists all the arrays assigned to it, and it gives bars/highlights as if you can choose which arrays to monitor through the DSM... but even when I select a single array, all of them get added. If I want to limit things I have to add them all, then delete those I don't want.. which is a pain as it artifically bumps up our SRM node counter and I don't know if anything gets loaded in the database or not.
You are correct, SRM (well 6.8.0 at least) doesn't let you only select certain arrays for monitoring when it scans a provider. To get around this I only added arrays to the Dell DSM one at a time.
I also found that when using this process, I couldn't use the same provider and 'rescan' for more arrays. I now have duplicate entries of the same provider 😞
Error message when doing this is;
"Required field 'Provider IP Address or Hostname' cannot be empty"
I'm planning the upgrade from 2018.4 HF3 -> 2020.2 next month, hopefully we're still polling after that.
Hey @viveashean !
I wanted to give you a quick update.
Okay, after a bit of delays we finally got on a call with Solarwinds support/development and Dell Compellent support.
Long story short: Dell admitted the problem is 100% in their software. They just can't handle the kinds of requests that Orion makes using SMI-S -- at least not without significant restructuring or the like. And because Solarwinds is not officially supported, they aren't going to make those changes.
Both teams had the same suggestion: Assign a single collector to an array. This can be a small instance off to the side, and management can still be done by a large, centralized collector, but any more than 3 arrays on a given collector will cause problems. And there's also the question of the # of disks - if that is too high, we'll see problems too.
We're going to try experimenting with that and report back to the support ticket. But in the meantime, we're also going to be reaching out to our account executive and other resources to let them know this is an issue, and will continue to be an issue going forward. Our hope is that Solarwinds and Dell can partner together officially to make a better product for everyone involved. But that doesn't help us short term, so we're going to try the small collector idea.
I'm still digging into this...
The more I dig, the more I wonder if our indexes on the Compellent's database aren't optimized. When trying to browse with the CIM browser, we're seeing anywhere from 30 seconds to 10 minutes to open a resource, depending on what I click. I do have a ticket open with Solarwinds support and with Dell support on this...
I think you are on the right track having a ticket open on both sides. Seems like something isn't quite right there but not certain exactly where. Feel free to ping the SolarWinds ticket number back here or in PM and I'll keep an eye on it.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.