We do have customers running far beyond 100k elements, and the limiting factor is almost always SQL server performance. Specifically disk I/O. With a beefy enough SQL backend, and perhaps a tuning of polling intervals, this number could be well extended. adatole may have some further experience to share regarding scalability.
1 of 1 people found this helpful
rob.hock is absolutely correct about the 100k element limit being a function of your database, not of any internal "lock" from the software.
With regard to your original question - when you say "mutliple cores" do you just mean a server that has multiple CPU's (or a virtual server with multiple CPU/cores assigned to it) or something else?
The reasonable limit for elements per poller is ABOUT 10,000 - the SolarWinds specs say 8,000, but I've pushed it as high as 12,000. The devil is in the details - what kinds of elements, how often are they polled, what kinds of machines are those elements on (an old Cisco 6509 is not going to give up it's data as fast as a Nexus 2000), etc.
I know that adds up to a big pile of "it depends" but unfortunately... it depends.
It also depends on what ELSE you are running - it is just NPM, or do you also have SAM, or NTA (netflow)? Are you using (or planning to use) the new #DPI features in NPM 11 (hint: YES! YES YOU ARE!!). Do you have 800 alert triggers? Those are each queries, and will have an impact on your primary poller. Do you have 2,000 users logged into the web console, each of which is sitting on their favorite 3,000 row report? That's gonna leave a mark.
You get the point.
At the end of the day, here are the limits and options you have:
- About 8,000 to 10,000 elements per poller
- More CPUs are good. More RAM is gooder
- figure 8CPU and 12Gb RAM per polling engine. Maybe more for the primary.
- Faster database is goodest of all
- hardware, not virtualized
- for 100,000 elements, you should think around 12-24 effective CPU (that can be 4 proc's with 6 cores)
- for that same 100k elements, think around 96-128gb RAM
- RAID 10. Say it with me again: RAID 10.
- Not RAID 5. No, really.
- Watch your pollers. When they get too busy, add more. Yeah, it's an expense. Not as expensive as missing a critical alert.
Hope that helps.
Leon is sure right
I will push for DBA that will help you with the SQL.
About the SQL HW if if money was no object look at that kind IO card
1 of 1 people found this helpful
The element count limit per polling engine has been raised to 12K. The old limit was 10K. The overall limit (old) of 100K is for a multi tenant deployment consisting of 10 polling engines (10.000 x 10 = 100.000). The limit now is 120.000 elements in total. Once the element count is exceeded, polling intervals will automatically be throttled and you'll likely end up with a 200 seconds poll interval or more instead of the default 120.
In my experience, the solarwinds job engine has a hard time keeping up with anything above 10k elements. That is however my case and we use all the modules except for IPAM. Your SQL server should be built to take the load.
Deltona - so just for clarification, once the 120K limit is reached for a primary polling engine, polling will automatically be throttled or will it only be throttled if polling rates can't keep up with the element load?
No, it's still only 12.000 element per Primary Polling Engine (the main engine). The PPE can not handle more than 12.000 elements.
As for the second question, you are right in both cases. It depends on the polling load and or number of monitored elements. If the polling load is too high or the elements count is too high, it will throttle polling interval.
We are NOT going to use the actual primary polling engine for node polling, meaning we are not going to assign nodes to the primary polling engine. All our polling will be assigned to additional polling engines of which we will have 8 of them. Our main concern is when will we hit the wall and have to deploy another primary polling engine instance to support the total number of elements we are monitoring (i.e. the 120K limit per primary). I know from past experience that sometime these "limits" are more like recommendations and the software can perform and scale much farther than the recommended limits. I was just wondering if other large implementations have seen these limits be realistic or if they have been able to go beyond the limits. I know its a delicate balancing act that you have to perform to make sure that you aren't affecting the overall health of your monitoring environment. Unfortunately there are so many variables that play into this discussion that there isn't going to be a "hard and fast" rule.
Just for the record, the primary poling engine alone can poll as many elements as an additional polling engine ie: 12.000 elements. Therefore, if you're only going to be using 8 of the 9 polling engines then you'll end up with a polling capacity of 96,000 elements total, not 120,000. You will only be able to reach 120,000 or more if you
1. Have Additional Polling Engines (more than 8, or 9 if you use the Primary Polling Engine to poll as well)
2. Tweak your polling intervals to allow polling of more than 96,000 elements
Remember that the Primary Polling Engine a.k.a PPE a.k.a main orion server can not poll more than 12,000 elements with default poll settings.
There are ways of quantifying how many active polling engines you'll need to do the job but in general, and like with everything else IT related, prepare for more and build to scale before you begin.
Great information guys...we are in the process of implementing as we speak. When I was talking about 'cores', I was referring to primary polling engines. We are planning on deploying all modules (NPM, SAM, WPM, NCM, VNQM, NTA, UDT). I have successfully pushed a polling engine beyond the 10-12k limit without adverse affects. My previous engagement was a much smaller infrastructure, so I am going from a 2000+ node installation to a 6500+ node installation. As it stands, I don't think we are going to need more than one primary polling engine since we are currently only licensed for 8 additional polling engines - unless we can push these polling engines beyond the 12K limit. Our plan is to virtualize all the Orion VM's including the database instance. We have some pretty robust infrastructure to back our VMWare environment, so I have confidence that we can support the load. As stated it does sound like it "just depends" on a lot of factors. We are replacing an existing toolset in parallel, so we can see how the health of the SolarWinds environment plays out as we deploy our nodes and we will adjust polling engines and primary polling engines as necessary. Thanks everyone.
Save yourself the trouble and go for a physical SQL server. No virtual infrastructure, no matter how beefed up will be able to cater for the required latency and IOPS of the SQL server.
I think we are going to have to prove that the virtualized database instance isn't going to cut it before we will be able to justify to management that we need to purchase an expensive physical host. I completely agree with you...my years of monitoring experience tell me that we need to go physical on the database instance.
Well, here is a little bit to help you, perhaps. Keeping in mind what adatole points out - Polling load is highly dependent upon What you are polling, and how.
These are all values from one of the currently running SolarWinds installs. Lots of networking gear (route switch firewall) with application monitors,
Medium loaded poller - 8k IOPS
Heavier loaded poller - 15k IOPS.
I consider the polling load on the heavier one to be a touch high, really. I will be adding another poller shortly. The rest of the pollers are varied in between that. So, if you are going to load up a poller with the max elements, you will see around 15k IOPS. If you are doing two medium loaded pollers - 8k * 2 = 16kIOPS.
Reach out to jbiggley and see if he will do a reference call for you and your management team. (Josh: make sure you ask for Thwack points for that!)
While you can certainly do a virtual SQL server, it's going to cave under the pressure at some point. And unfortunately the failure will not be "oh look the server crashed". It's going to be failed maintenance jobs, slow to load screens, missed polls on random servers (which you won't notice until you need THAT 15 minutes of data on THAT node), missed alerts, and more.
I'm telling you now that, especially in a 6500 node installation, a virtualized SQL box will spell your eventual doom.
I both agree and disagree at the same time.
I disagree that you cannot virtualize a high-performance SQL server. It can most certainly be done. (This is from back in the VMware 5.0 days and they were touting IOPS > 1 million, 32 vCPUs and 1TB of memory)
I agree that you should not virtualize high-performance SQL servers. Why? The time, effort and expense to build, tune and maintain a VMware environment that would support a virtualized SQL server just doesn't seem like it is worth the effort to me. By all means, attach your physical server to your crazy-fast SAN. Get your DB team involved to do all sorts of fancy SQL clustering, etc. but do it on physical hardware. Unless you have a very robust VMware environment you shouldn't even attempt to virtualize a large SQL instance for your NPM environment.
Mr J is 100% right.
VM host that take 100% of the resources (CPU/IO/RAM) it does not make sense to run it as VM host.
Yes you can run very big SQL on VM and VM is usely not the problem.
It's usely the SAN = IO.
If your SAN can take the load so no problem :-)
If not make the SAN pepole run the DB on RAM DISK or SSD.
Would it help the IO if you span the DB on multiple disks if you run a VM environment?
IOPS are always an issue in high performance DBs, whether physical or virtualized. You most definitely need to run your DB on multiple spindles (disks) and always, always, run the DB on a RAID 10 disk. Solarwinds published a document and, though it was published 2 years ago, the Best Practices for Managing the Orion Platform Database is still the go-to guide for building your DB.
From pg 2 of that guide:
At a minimum you should make sure that the LUN is on a RAID 10 array. One common
mistake is to place the storage on a Storage Area Network (SAN) that is RAID 5, RAID 6 configured or
has performance lower than a DAS subsystem. RAID 5 and Raid 6 configurations are not supported and
should never be used for Orion MS SQL storage.
Note the NEVER user RAID 5 or RAID 6 configurations for MS SQL storage.
Page 3 also lays out how the DB server should be configured with a separation of application, DB, log and temp files for a high performance solution.
This seems to be a theme this week, but read the guide. It really does an excellent job of detailing what needs to be done in a large DB environment. There will be those within your org that believe Solarwinds is simply throwing resources at the environment to protect bad code. I can assure you that this is not the case. Follow the recommendations and you will never have to ask yourself "Is it the DB?" (Of course, if you did you could always run SAM and use AppInsight for SQL to root out the problems! )
We have deployed two primary polling engines, each with an additional poller and one with an additional web-front-end as well. primary reason was total segmentation and control. We have a mixed breed of physical servers and virtuals, however, DB is preferred as physical.
We currently monitor just over the 30K elements in total, so we had to expand. The next step is for us to tie the two primary instances together with EOC to get a consolidated view across all our client environments. We also might be looking at several additional polling engines in the next year or two!
Your design will depends on a few things:
1) Total number of elements you monitor (regardless of license level) - each polling engine can handle around 10K - 11K elements (element is a node, interface or volume) - depending of course on your polling interval and CPU and memory requirements
2) Each primary polling engine can have up to 9 additional polling engines below it with a max element count of around 100K
3) Your design should take consideration the geography of your devices and polling details - for instance, if you want latency stats between your Head Office and remote branches, the polling engine should reside in your Head office so that the latency reflect the local path(s).
4) It is not recommended to have an additional polling engine on a remote site with your DB at a different site - rather opt then for two primary pollers and use EOC to merge the two together. Your stats/reports will be more accurate and your design will be certified/supported by Solarwinds. They do not recommend the additional remote poller solution.
5) If it is a hosted (MSP) environment, additional network issues such as NAT/Overlapping IP addresses are a real factor. Having separate primary engines overcomes the need for firewalls or NAT which could get very complicated when troubleshooting.
6) With EOC you should cater for at least 1 Mbps bandwidth between the EOC instance and the Primary polling engines
Hope this info helps you!
You might find the following links interesting: