20 Replies Latest reply on Aug 6, 2016 7:21 AM by HerrDoktor

    FoE - Help us Help You

    aLTeReGo

      We're plugging away at some improvements we have planned for FoE and would love to get a better understanding of the environment in which you're running FoE today. If you own SolarWinds Failover Engine, we would really appreciate it if you'd complete the following survey telling us a little about your Orion and FoE deployment. There's a total of 17 questions, all of which should be super simple and easy to answer. It shouldn't take much longer than 5 minutes from beginning to end and it's perfectly painless.This will help us build better products, which in turn means happier customers like you.

       

        • Re: FoE - Help us Help You
          michael stump

          I didn't even know what FoE was, so I think you have my answer already.

          • Re: FoE - Help us Help You
            contactjt

            I was hoping this survey would have more feedback questions. Which I can tell you is very negative.

              • Re: FoE - Help us Help You
                aLTeReGo

                I have recently assumed product management responsibilities for FoE and this survey and others like it are designed to help us determine ways in which we can improve the product. I have been, and will continue to reach out to the community to solicit feedback on what we're working on, and better understand your existing pain points with the product. We're currently in the design phase for the next release, so nothing is yet in stone or imminently on the horizon. We do however know that FoE has a healthy amount of room for improvement.

              • Re: FoE - Help us Help You
                RichardLetts

                Survey answered -- I would be more than happy to talk with you about how we have this deployed/

                 

                Here are my thoughts on the whole failover engine using NeverFail: this is a sledgehammer to crack a grape.

                It's really good for file-based replication, e.g. voicemail systems where each message is stored in the filesystem. it's not so good for application failover.

                 

                So, taking a deep breath, here are my thoughts that go deeper into product architecture messing around trying to get neverfail to be the solution:

                Where I am using the failover engine to provide failover for an additional poller this could be provided by using an passive polling engine. each node could then have a primary and secondary polling engine for a node. if the primary polling engine doesn't poll a node in a certain time the secondary poller does it automatically. You might even think about providing N+1 redundancy schemes for those people with a single site that only need to cope with a single server failure.

                 

                Eliminate the difference between the application server and the additional poller so all polling engines regardless of where they are running are equivalent.

                so, a legacy application server = 1 web server + 1 polling engine.

                 

                If an install needs more web users or redundancy of the web UI then allow N additional web servers.

                 

                Provide poller upgrade packages that can be deployed from a central location so we can upgrade the whole infrastructure from one point rather than like now where I have to do at two installs per package (admittedly I am now so fast at this that I can complete reinstall NPM app server and four additional pollers on all primary and secondary servers before I get off hold for techsupport.)

                 

                Load all of the configuration files (ob-except the database connection information) into the database so there are no files that need to be replicated anywhere -- when a node makes contact with the database on initialization it exports the config blobs to the components can startup.

                 

                Provide a Kiwi-syslog like node that centralizes all of the logfiles from all of the nodes in the orion cluster into one place instead of on every poller.

                 

                This would give me an install that is significantly easier to manage, and the complexity of the install would scale at less than O(2n)

                 

                /RjL

                • Re: FoE - Help us Help You
                  byrona

                  We received FoE licensing in replacement for our old... very old SolarWinds standby engine product.  I never used FoE and let it expire due to it being an overly complicated product that ultimately cause the environment to be more fragile and not more resistant as designed.  I found that the same task could more easily be accomplished in other ways.  Considering that the pollers are just app servers restoring them from a backup isn't difficult so long as you don't have a really restrictive RTO.

                  • Re: FoE - Help us Help You
                    jay.perry

                    Ok ill take it!

                    • Re: FoE - Help us Help You
                      Hari Pala

                      Waiting for informative Q&A

                      • Re: FoE - Help us Help You
                        rtharp@snl.com

                        Yes allowing some feedback would have been helpful.  I answered yes just so I could complete the survey but we abandoned FoE for some of the same reasons others have specified (Instability and incomplete solution).  We went with Hyper-V replication.  No shared storage or clustering requirements.  Hyper-V automatically inserts alternate IP addresses that match the remote sites network.  The fail-over process is manual right now but could be automated with Powershell.  Fail-over time is about 5-7 minutes to have everything polling again

                          • Re: FoE - Help us Help You
                            aLTeReGo

                            Any and all additional feedback about FoE you would like to provide beyond the questions in the survey is more than welcome to be posted here in this thread. The only way the product can improve is to solicit feedback from those currently using the product, or have used it in the past. That feedback we find to be invaluable in helping us shape the future of the product. So by all means, please tell us what you like and what you don't like about the product and please be completely open and honest.

                              • Re: FoE - Help us Help You
                                contactjt

                                Well if you wanted details...

                                 

                                I'm sure my deployment is one of the larger ones. We aim to provide a 24x7 service, FOE was used as recommended for extra resiliency.

                                 

                                 

                                First a quick description of my architecture:

                                3 different types of networks (with pollers in each) with private subnets

                                 

                                Solarwinds Applications:

                                Orion Platform 2013.2.0, IPAM 3.1.1, NCM 7.1.1, NPM 10.6, NTA 3.10.0, IVIM 1.8.1, VNQM 4.0.1

                                 

                                Dual additional Webservers

                                 

                                 

                                ===========================

                                Concerns with FOE:

                                Application upgrades have to be done while the server is in the 'Active' state. Having one FOE server isn't a big deal. Having 9 pollers and NPM with FOE and Additional webservers makes upgrading a horrible task. You'll notice many of my applications are out of date because of the burden to update.

                                 

                                Upgrading FOE itself isn't straight forward. Upgrades are also slow to update to new versions of Solarwinds Applications.

                                 

                                NeverFail IP Filters often have trouble. Perhaps this is a necessary evil however it would seem like the filtered state could be easier to see, and easier to remove than find some cmd-line tools.

                                 

                                NeverFail failure conditions seem to cause split-brain often. It would seem like I choose Primary to be the default. The system should realize it's split brain and let me know.

                                 

                                Configuration is cumbersome. Okay say we look again at the 9 poller with NPM scenario. I have to go check and uncheck various alarms, configure switchover settings on each system. It's easy to mess them up. Maybe a method to apply the settings related to alarming and options to all pollers.

                                 

                                Speaking of alarms, the secondary box since it's off the network can't send email notifications. It can have issues.

                                 

                                Fail-over is SLOW. On an additional poller the only reason we need FOE is the IP and Service change overs. Why not have a fast-switching mode for light duty like a poller. Maybe still syncing but don't let it stop you from switching, or do a file check nightly instead of constant sync.

                                 

                                File Filters are confusing. It appears that hot fixes and patches don't sync? Why not?

                                 

                                Known FOE issues lists are a mile long. Any chance these will be solved?

                                 

                                NeverFail API's and command lists are nearly impossible to find.

                                 

                                That's just some of the issues.

                                 

                                ===========================

                                Quick fixes:

                                The whole upgrading process is so fubar. At least allow parallel Solarwinds upgrades. The ability to upgrade the Solarwinds applications while standby would be even better.

                                 

                                Perhaps a dashboard APP on the PC to see the local PC status. Filters, Replication, Channel, and quick buttons. Something more for status than the GUI.

                                 

                                Make Solarwinds connect and display NeverFail stats. I know SAM could do it but assuming you don't have SAM then Solarwinds should at least be able to tell you how it's own FOE systems are working.

                                 

                                Make the FOE common configurations easier. Like a configure all or apply to all, or even just make it clear how to copy a these config files manually.

                                 

                                Secondary should be able to send notifications via the channel through the primary.

                                 

                                Improve fail-over options for pollers. These almost have static file systems and don't need replication.

                                 

                                Make Neverfail sync hotfixes and patches so they don't need to applied to each poller.

                                 

                                Fix the known issues... Licensing issues seem to be a big one.

                                 

                                Provide more information on the API's or allow us access to neverfail systems through OrionSDK API.

                                 

                                Additional Web servers should provide some usage stats.

                                 

                                 

                                ===========================

                                Dreams:

                                NPM redundancy should not be active standby. It should be active / active. Allow me to have two working NPM cores. The web servers and pollers should randomly pick from the NPM cores. Then if one fails they switch over so everything runs on the primary NPM. Obviously the two NPM's might need to do heavy syncing here.

                                 

                                Additional Pollers should be able to pull down the required polling elements (NCM/Syslog/NTA/etc) as needed when they connect to the NPM system. Include extra's like MIB files and other configurations. Allow me to push a update from the NPM to each poller or just make it happen on a reboot.

                                 

                                See RichardLetts post. The idea of N+1 redundancy would be great. Or even better you let me build a poller group, when I assign a node it's to the specific poller group. This group shares the work then has stats on load allowing me to determine how many pollers I need for my level of comfort on redundancy.

                                 

                                Also Solarwinds should self monitor better. If an additional poller or webserver had a failure it should be easier to see from Solarwinds itself. Like a System health page.

                                  • Re: FoE - Help us Help You
                                    aLTeReGo

                                    Excellent feedback contactjt!  Thank you for taking the time to compile this list together in such a comprehensive fashion. I'll try and take a stab at your questions below.

                                    File Filters are confusing. It appears that hot fixes and patches don't sync? Why not?

                                     

                                    Known FOE issues lists are a mile long. Any chance these will be solved?

                                     

                                    Patches should sync between the active/passive cluster members but will not if the file is locked (in use) at the time. All Orion services must be stopped on both active/passive members simultaneously for the patches to replicate successfully. This however is not a recommended method of deployment as all patches are not created equal. Some patches may be simple files on the file system, while others may make modifications to the registry, database, or other types of metabases that are not replicated via FoE. For this reason it is recommended that hotfixes, service packs, and buddy drops be installed on both active and passive members separately.

                                     

                                    The known FoE issues list is certainly on our radar. Are there any particular issues on that list that stand out to you as the most limiting? If so, can you point out those you feel are having the most impact in your environment?

                                    2 of 2 people found this helpful
                                • Re: FoE - Help us Help You
                                  Al Ma

                                  I am evaluating Vmware SRM and would like to ask if you have it as DR between 2 sites or just an HA solution available on the same site?


                                • Re: FoE - Help us Help You
                                  ZibaK

                                  I completed the Survey

                                  • Re: FoE - Help us Help You
                                    donpepe

                                    i think my most glaring issues are that the Version 11 Engineers Toolset is not compatible with FoE and the fact that the upgrade process is so cubersome. Also having a way to know what is installed on which server and what version across the entire redundant stack would be very useful.

                                    • Re: FoE - Help us Help You
                                      sean76

                                      Not much update on this is there? I have to pretend that this works for the most part to my management. Its a shame because NeverFail for SQL works great but FoE has led to many a late night.

                                      • Re: FoE - Help us Help You
                                        HerrDoktor

                                        Just wanted to share my 5Cents on this.

                                         

                                        My customer bought the FoE and we tried to implement it, It was a very big pain, because every other day the whole FoE system crashed and we had to set it up again. The reason were strict GPOs we talked to the server guys and after the found out that Solarwinds uses the "Neverfail-Engine" they told us - "this won't work in our environment, we tried that with another product before." So we kicked the FoE

                                         

                                        I am really looking forward to the HA Solution in the "What we are working on"-Section.

                                         

                                        Cheers,

                                        Holger