We have recently been plagued by a few application known bugs that took down our platform. One of them had to do with the alerthistory table getting up to 1,000,000 objects. Long story short this ate all the RAM and took the primary app server down. Very recently we have an issue where the app server was unable to keep up with the number of polling jobs running, our SAM module ended up failing. After a long troubleshooting session with SW and an outage a reboot and a few registry keys modification brought us back online.
With all that negative information, myself and my company are fully vested to make this our go to NOC and troubleshooting platform. We have come to the conclusion that FOE is not for us. Based on my research its not a product that can provide the HA we need.
Just as a bit of background we have about 9000 devices, NPM, NCM, SAM, UDT, VOIP, and IPAM. The servers are appropriately sized and do not need attention.
When our platform went down the only teams drastically and negatively impacted were our monitoring teams, both network and application sides. I have proposed an active active solution in which our current platform with all of its data will be the primary platform at all times unless its getting patched or has some type of outage related issue. The secondary active platform will only house NPM, and SAM. In addition to the smaller number of applications it will only monitor devices that our NOC, and APP monitoring teams monitor. The idea is that this will be a DR build, only critical network and applications will be in the secondary database.
I'm hoping that this design will create a rock solid HA design. Can anyone see any issues with this ? I know we will have to poll certain devices from both app servers. I'm thinking I will increase the polling intervals on the secondary platform during times of normalcy. When the secondary takes over I can crank up the polling emulating the default metrics.