28 Replies Latest reply on Feb 4, 2016 1:17 PM by rschroeder

    Ideal Solarwinds Deployment

    rspotswood

      I would like some advice on the idea deployment for my company. We currently have 7 datacenters located throughout the US, but primarily in the NE. Our main DC monitors itself, and several of the smaller centers, while 2 of the larger ones have their own independent servers that monitor everything within their networks. We currently use EOC for some of our colo customers so they can log in and see a universal interface.

       

      Our main DC currently monitors 1200 nodes, with 2100 volumes and 3000 interfaces, while the 2 separate instances are a little lower than those totals. Would we be benefited optimally or financially to move to a single overall instance of Orion and just deploy remote pollers in the 2 larger DC's?

       

      I know this is a vague question, but i'm try to get opinions on the matter before I start investing huge amounts of time and energy when others might have useful insight. Thank you!

        • Re: Ideal Solarwinds Deployment
          CourtesyIT

          rspotswood,  This is kind of tricky I know.  The Admin Guide does give information about what the server requirements should be, but unfortunately there is no "Magic" Build for a single Orion Server.  I have worked in several environments over the years and given what you have provided us it sounds like that standard build would work just fine for you.  If you have the ability to "grow" your resources if needed then I would start with the base build specs and go from there.

           

          Since this is a key topic for those out there, if you could keep us informed on your progress we would appreciate it.  Please include Orion Server, SQL Server, and Polling Engine Server spec, along with node counts on all and database sizing and architecture. 

           

          All the Best.

          • Re: Ideal Solarwinds Deployment
            jeff.stewart

            How much bandwidth and what is the latency between the remote locations?  While the Additional Polling Engines support remote-ability there are some factors to consider.  Also, which modules would be running in these remote locations?

            • Re: Ideal Solarwinds Deployment
              agusst

              It really depends on what components you're going to use...

               

              My current environment isn't nearly as large as yours...we have about 600 nodes, 2400 volumes, 1000 app monitors.  The majority of the systems are in 2 main DC's with a handful of systems at remote sites.  We only use Orion and SAM in this environment...we have one Poller and one SQL server.

               

              My last gig we had about 7000-ish nodes....I can't remember the volumes or app monitors...50% of the systems were in one data centers...the remaining 50% were split between about 6-7 other DC's and several remote sites.  We used SAM, NPM, IPAM and a few other components......In this environment we had 3-4 Pollers and two SQL servers in our main data center.

               

              Not sure if this is helpful or not...but it gives you an idea of what I'm doing / have done.

              • Re: Ideal Solarwinds Deployment
                ecklerwr1

                You can't really put an additional poller at remote location without anything else... it has to talk to the database and talking to the database over the WAN isn't really practical.  You really need to either stack pollers in one main site and monitor everything remotely or use multiple instances.

                • Re: Ideal Solarwinds Deployment
                  rschroeder

                  I think jeff.stewart may have hit the most important question:  What's the available bandwidth between the locations, and what's the latency.  If you have fat enough pipes, I'd consider building a main NPM instance and using local pollers at each DC that report to the main NPM.

                   

                  I had three separate instances of NPM, each doing two data centers, each pair of DC's about 150 miles from the next pair--the result of several mergers.

                   

                  Local polling, local dedicated databases, all monitored by EOC.  That was NOT a good way to go.

                   

                  EOC degraded our team's ability to respond properly and in a timely manner due to its slow ability to display changes from the remote pollers, and also due to regional boundaries and "turf" responsibilities.  Instead of being one team managing all regions, we ended up split in two--the bigger East region doing East's work, and West doing their own work plus that of the Central region.

                   

                  Our WAN between these sites is a resilient MPLS solution with dual 1G pops at each region.  Central and West are about 7 and 11 ms away from East.

                   

                  We saved money by dropping EOC and converting the NPM instances in West and East into pollers that report to a new main NPM instance in East.  Logging, NTA, NPM, and NCM all are done at the local regional pollers, which updates East's NPM.

                   

                  We were able to remove the EOC license for a nice savings, and that simplified it to look like one organization instead of three.  Our service response times and customer satisfaction improved. 

                    • Re: Ideal Solarwinds Deployment
                      ecklerwr1

                      You are reading and writing to the SQL server over the WAN?  Most MPLS WAN's i've seen any further than a little way a way that wouldn't be possible... there's no way to cheat distance no matter how fat your pipes are it's physics.  If you's are just a few miles down the road maybe but I have sites across the US and in other places around the world.  There's zero way a poller could write the SQL server across the wan at most of these sites.  If you have SQL Server at each site... that's different.  I've yet to see anyone get away with something like that but maybe you're a first?

                       

                      What I have seen people do it this...

                      convert from multiple instances of NPM reporting up to EOC to one massive instance with stacked pollers.  From my experience much of your performance comes from reads/writes fast to the database more than anything else.

                       

                      Perhaps you were blessed with close to theoretical latency... most aren't in the real world although things are getting better.

                        • Re: Ideal Solarwinds Deployment
                          RichardLetts

                          I've been tasked with relocating one half of our FOE install  from a location that is rtt min/avg/max/mdev = 0.380/0.496/0.671/0.060 ms

                          to a location that is rtt min/avg/max/mdev = 8.392/8.488/8.670/0.125 ms

                           

                          i.e. about 8 milliseconds further away -- this site would only be active during disasters (when I would also be running the database there) or application maintenance on the main cluster (when the database would remain here)

                          [i.e. not in earthquake range -- the round-trip distance is ~560 miles, which accounts for ~5ms of the delay and the additional network hops for the other 3ms]

                           

                          Do you think that is too much delay?

                          • Re: Ideal Solarwinds Deployment
                            stevenstadel

                            We are reading and writing SQL over a WAN across country. We operate two datacenters 3,500 km apart with a single SQL instance. The link between the two DCs operate on average 40-45 ms latency on a 660 Mbps MPLS link. All our inter-datacenter traffic shares also shares this link.

                            Getting rid of EOC to a real single pane of glass was the best thing we would have done administratively.

                             

                            We have 2 additional polling engines in our eastern DC monitoring 750 Nodes / ~8000 elements. The eastern DC connects back to our primary installation in the western DC running 5 additional polling engines 3000 Nodes / ~30000 elements.

                             

                            We are running SAM, IPAM, SRM, NTA, NCM, NPM.

                             

                            Take a look at

                             

                            Centralized Deployment with Remote Polling Engines in

                            http://www.solarwinds.com/documentation/Orion/docs/ScalabilityEngineGuidelines.pdf

                             

                            Latency should be below 300 ms.

                            • Re: Ideal Solarwinds Deployment
                              rschroeder

                              There's a local SQL server in each region--no WAN traffic except for reporting stats to the main NPM instance.  My apologies for not making this clear previously, and thanks for calling it out.

                                • Re: Ideal Solarwinds Deployment
                                  rschroeder

                                  Disregard my comment here, ecklerwr1; I was mistaken.  All my regional pollers report SQL and everything else back to my main East data centers.

                                   

                                  I don't know why it works well here, or why others have had some issues with distance.  My West and Central sites run about 7 ms and 11 ms in round trip latency from/to my East NPM instance.

                                   

                                  It's very satisfactory--even for my team members that site 300 miles away.

                                   

                                  It wasn't satisfactory when we used EOC.

                                • Re: Ideal Solarwinds Deployment
                                  humejo

                                  I will have to respectfully disagree with you ecklerwr1.  As a consultant I've worked with over a hundred different Orion environments all across the country, dozens of which have pollers at remote locations writing back to their SQL servers over all kinds of WAN links (MPLS, P2P, VPN, etc...) and have never had any problems related to writing/reading to/from the SQL server over these links.  I've even had a dozen or two clients with worldwide operations that had additional pollers oversees in Europe, Asia, and Australia.  No problems with those either.  No weird missing data issues or anything.  While it is less than ideal it does work just fine provided the WAN links are stable and the latency stays under the 300 to 400 ms range (Solarwinds says 200 to 300 ms, but provided the links are stable and there is plenty of bandwidth, I've not seen problems with stable 300ms links whatsoever).  Plus, their MSMQ implementation used by the polling engines provides a nice buffer for sustained outages up to roughly one hour.

                                   

                                  Per page 23 of SolarWinds own Scalability guide, they even recommend a distributed poller environment reporting back to a SQL server at the main site where the Primary server and SQL server reside:

                                  A reliable static connection must be available between each region.

                                  o This connection will be continually transmitting MS SQL Data to the Orion Database Server; it will also communicate with the Primary Orion Server.

                                  o The latency (RTT) between each additional polling engine and the database server should be below 300ms. Degradation may begin around 200ms, depending on your utilization. In general, the remote polling engine is designed to handle connection outages, rather than high latency. The ability to tolerate connection latency is also a function of load. Additional polling engines polling a large number of elements may be potentially less tolerant of latency conditions.

                                    • Re: Ideal Solarwinds Deployment
                                      ecklerwr1

                                      This is all a recent development Message Queuing (MSMQ) wasn't even part of the product until fairly recently (well recently to me since I've been using it almost since the Orion product was acquired by SW)... I didn't say it couldn't be done just that until recently it wasn't even an option.  I'm sure many of us here would like to here more about these 100 people using one SQL server with pollers across WAN links over long distances.  Only a year ago there were issues with MSMQ blowing up our pollers and being way behind... In version 10.x.x it was even worse.  Also we all know what the guide says... but sometimes that's not always how things work.

                                        • Re: Ideal Solarwinds Deployment
                                          humejo

                                          No, I didn't say there were 100 people using one sql server...  I said I've worked with over 100 different Orion environments, dozens of which had remote polling engines pointing back to a single SQL server.  These are all different clients.  Different businesses completely unrelated who hired our company to configure/setup/train on their Solarwinds Orion environment.  Sorry if that wasn't clear.

                                  • Re: Ideal Solarwinds Deployment
                                    CourtesyIT

                                    Anyone have some diagrams to depict what they have done.  This is a really good conversation and would like to get some artwork on it.

                                    • Re: Ideal Solarwinds Deployment
                                      CourtesyIT

                                      rschroeder and ecklerwr1,  I am trying to calculate the bandwidth saving of having stacked pollers opposed to deployed pollers.  Can you describe/define a little better. I am sitting on the fence about a solution for a global architecture and trying to come up with some solid pros and cons for each.

                                        • Re: Ideal Solarwinds Deployment
                                          ecklerwr1

                                          Redundancy for some things could be one although I'm not sure it matters that much anymore.  With virtualization of everything (including SQL Server) I don't have much reliability issues any more.  I have been monitoring wan sites from multiple locations ie. overlap between pollers.  I also have EOC license still although the old one I rarely use (I'm hoping new version based on NPM will be better) but... I like you am considering all my remotes being brought back here to my main site.  I currently have two stacked pollers now at my main site. (I'd increase this if I scrap the remotes)

                                           

                                          Let me get this straight is anyone here actually running an additional poller without NPM at a remote site miles away?  I hear Richard thinking of moving from sub ms to around 8ms latency to the SQL server... For as long as I can remember we didn't even want to have anything but a dedicated physical SQL Server right next to NPM until very recently.  I've been virtualizing it since before it was supposed be ok and it's worked fine but... nothing in NPM or it's modules functions well at all without primo SQL server connectivity ie. it will never run any better than it's access to the database period.  Also initial attempts to virtualize SQL for some was a disaster.  This has always been the weak point.  I think moving NTA to it's own noSQL database was another step in improving this.

                                           

                                          I've always had sub ms access to SQL...  I've never considered pushing this because I always assumed it would greatly hurt performance but... now I want to know if anyone has successfully even tried this?  Everything the main NPM server does is all about the database so...

                                            • Re: Ideal Solarwinds Deployment
                                              jeff.stewart

                                              "Let me get this straight is anyone here actually running an additional poller without NPM at a remote site miles away?" Yes, we have customer doing this today and added some functionality to the APEs to help support this model.  Additionally, if you look at the WWWO for NPM you'll see a "Remote APE" listed in the ongoing section.  This is should continue to help the support of APEs that are not in the same location as the DB.

                                            • Re: Ideal Solarwinds Deployment
                                              rschroeder

                                              As someone previously caught, I have local SQL servers in each region.  Our regional pollers simply report stats & make beautiful graphs with the main NPM solution, along with sending fast alerts.

                                              • Re: Ideal Solarwinds Deployment
                                                rschroeder

                                                CourtesyIT

                                                 

                                                I was mistaken about the SQL servers--all my remote pollers send their SQL over the WAN to my East NPM deployment--and it works quite well.

                                                 

                                                Orion pollers aren't killing our WAN with bandwidth or alerts or Netflow--not in any way.

                                                 

                                                 

                                                There is no problem with the amount of WAN utilization by Orion--the three regions are connected by dual-path resilient 1G fiber, for an effective 2G in each direction.  Hitless failovers, too, I'm happy to say, so the WAN provider or my team can take down one of the two routers at each site for replacement or upgrade maintenance and all we've done is temporarily decreased our resilience and throughput.

                                                • Re: Ideal Solarwinds Deployment
                                                  rschroeder

                                                  For what it's worth, the bulk of my monitored elements are in my East region, and I have two SLX (10,000 element) licenses active there on two pollers.  Each one runs somewhat over 10,000 elements, and many are being polled more frequently than 120 seconds.

                                                   

                                                  My Central poller is doing about a third of that:

                                                   

                                                  And my West poller is doing about 3/4 the number of elements of one of the East pollers:

                                                   

                                                  My main NPM instance lives with hundreds of other servers inside a UCS chassis; the UCS port can handle 10G and it's only doing 26M.  A tiny drop in the bucket of a 1 Gb WAN pipe, for displaying the monitoring of nearly 35,000 nodes, including everything across the WAN.

                                                    • Re: Ideal Solarwinds Deployment
                                                      CourtesyIT

                                                      So rschroeder I am guessing the amount of Management traffic is not effecting your production traffic over your WAN links?

                                                        • Re: Ideal Solarwinds Deployment
                                                          humejo

                                                          In my 6 years of working with Orion I have never seen Orion monitoring traffic ever have any significant effect on a network yet.  I've had plenty of people claim it at some point or another, but aside from the rare SNMP bug on a Cisco switch/router/firewall here and there causing issues (and that is a Cisco problem, not an Orion problem), I've never seen anyone prove it yet.  I have had a couple occassions where clients' managers have sworn up and down that Orion was slowing down the network. So, I would pop an eval version of NTA into their environment (unless they already had it), setup the core switches and the edge routers at the suspected sites with Netflow collection back to the Orion server, filtered the view by SNMP and/or WMI traffic, and then had to go turn the Top Talker 95th % setting up to 100% because the SNMP/WMI traffic was so low that it wouldn't even show up in NTA at all because it was in the bottom 5%, so it was getting discarded.  Yeah, people always think that monitoring huge amounts of nodes/interfaces/volumes/applications/flows/syslogs/traps are going to cause problems, but they don't understand how little these monitoring technologies transmit.  All of these technologies have been around for a very long time, back when network speeds and computer speeds were considerably lower and even then they caused no issues.

                                                          • Re: Ideal Solarwinds Deployment
                                                            rschroeder

                                                            That is true:  our Orion Management traffic is not impacting our WAN.  Neither is the Orion SQL traffic, which is all sent to my region's data centers.

                                                             

                                                            Perhaps because our WAN is comprised of dual-1G MPLS circuits, and the vast majority of our access traffic is Citrix.  However, there are many very large files passing over that WAN (e.g.: digital medical images like MRI's and PACS and Radiology and Cardiology--perhaps as many as 4000 files may comprise a single study, and those files are a few Gigabytes each), and we're not yet experiencing problems with throughput or latency.  We'll probably up the WAN to dual 10G pipes in the next few years as we aggregate more of our legacy L2 cloud into the MPLS cloud, and that might bring a need for more throughput.

                                                              • Re: Ideal Solarwinds Deployment
                                                                ecklerwr1

                                                                I used to have a customer called Radiologix years ago that did the same thing... those NMRI images are HUGE!  You can never get a big enough pipe for that stuff... back in the day the doctors all had 2B+D ISDN  at their homes to try and deal with them... boy times have changed... they can probably have fiber at the house to do it now.

                                                            • Re: Ideal Solarwinds Deployment
                                                              ecklerwr1

                                                              That's actually pretty impressive Ricky... also you should be able to see the actually monitoring traffic from any of your pollers as each is a user experience monitor in it's own right... at least I think it is if you have NTA at least.  Just look at the snmp and whatever else you use across the WAN... If it's SAM and a lot of WMI that's been a little more difficult to wrap head around for me... At least SAM has some agents now which can reduce all of those additional ports WMI want to use if you're using it across your WAN.  That's a good question how do you handle configuring your WMI for SAM?  Defaults or custom configuration?  Anyone using the agents at remote sites?

                                                                • Re: Ideal Solarwinds Deployment
                                                                  rschroeder

                                                                  QoE for our servers is done via WMI using the defaults.  I get some beautiful QoE pages.

                                                                   

                                                                  We've drastically reduced the number of servers in the Central and West regions four data centers, and moved most services into our two new resilient data centers in East.

                                                                   

                                                                  That puts most of the servers in the same room as the UCS chasses holding my NPM solution, so distance and bandwidth isn't the factor that it would be if servers on the East Coast and West Coast were being monitored with QoE by the same NPM poller, over a thousand miles away.

                                                                   

                                                                  While WMI and QoE are in play, it happens that I have an NPM poller in every data center.  So even the distant data centers in Central and West have a local NPM poller monitoring them.  But the pollers all report back to East.  And my team accesses the East NPM via web from every region.  Distance and location don't matter for us, except for the difference in sub-milli-second latency to NPM while in East, and up to 11 ms latency from other regions.

                                                            • Re: Ideal Solarwinds Deployment
                                                              CourtesyIT

                                                              Ok folks, here is one for you.

                                                               

                                                              I like rschroeder idea and case study.  I have engineers all over the planet.  5 people / 50 nodes in Asia, 5 people /100 nodes in Hawaii, 15 people / 450 nodes in East Coast, and 5 people / 125 nodes in Europe.

                                                               

                                                              I wanted to ask about additional webservers for each location.