I recently started a new gig and their current Orion NPM server is in BAD shape. It's running version 10.1.2 and I constantly have to restart service because it stops polling data after random periods of time. They started to stand up a new NPM server before I began working with them but it is still unfinished and I have some reservations.
Currently, the new NPM server has it's own physical server to itself including the SQL server database (local). I'd like to get the SQL database on our SQL cluster so that it's better protected on our storage array. In doing that, I don't really see the necessity of having a pretty nice physical server just for NPM/NCM, so I was thinking about P2V. The only issue with that is that we have a modem attached to the older Orion NPM server that sends out SMS messages for alerts. I'm sure we could configure the new NPM server to use e-mail to send e-mails to our SMS carrier, however, what happens if e-mail, Orion, or our entire virtual environment goes down. How do we get notifcations if our Orion server cannot communicate with us? What are you guys doing to get around this? More than 1 server?
We had similar concerns- triggered mostly because a network outage prevented our email -> SMS alerts from making it to the internets.
We went to a service like Knowledge Front Network Monitoring . It's an offsite monitoring solution, and can monitor hosts & email availability.
For a few bucks a month ($10 per probe), we poked a hole through the firewall, and probe the solarwinds host. If it is down, they can email or text etc alert you if solarwinds goes down. Similarly, if email is down, they do an email themselves test, and you get an alert when that goes down.
I think we pay less than $500/year and we monitor the monitor pretty well that way.
Always have more than one method/protocol other than SMTP to send notifications out, if you can, to alleviate issues with email (latency, email maintenance or email system down). Usually just one additional method like SNPP/SMS or anything non-SMTP will work "well enough" as a backup method for "critical" notifications. It can be "easier" to do this by buying a seperate notification system/product. Some notification systems support voice dialing so ithey can call telephones directly as a third notification option if email and SMS/SNPP is down or experiencing latency. If you can physically separate the notification system from the monitoring system(s) then you can spend time/$ beefing up the notification system to be highly available (they are usually designed for this anyway), and then you can put some simple secondary monitoring (whatever you need) of the monitoring system(s) on the notification system servers themselves (if locally installed). Wherever your highly available system is, if you can add additional redundancy with the connection to the network (like a modem, or, if the system supports dialing the phone you can hook into your phone system if it's a seperate path out) all the more safety you have. If you can put components in remote sites as a backup that may also leverage a secondary connection "out". Orion/NPM has a "hot failover?" product ( ice mentions above) for the application (not the database I believe) that I've never tried as the docs made it look a little too complicated for something I wanted to support (and may mess with our backup system?), but that may also work for you as a vendor supported option.
On a side note: If you can buy a seperate notification system/service most of them support all kinds of protocols other than those mentioned above. Many of the notification systems are "cloud" so you just pay for the service as a whole but if you go for a local install you will probably need to pay (and sign contracts) for each carrier (AT&T, Verizon, etc.) for an "enterprise messaging" feature so you can connect to them directly. You may also, in the case of NotePager Pro support for Orion ( ice mentions above above via Enabling SMS/Text alerts in Orion) find there is an intermediate service/plugin/etc. for the monitoring->notification system connection. For all of these options, I don't really have a recommendation as there are trade offs for price vs. time spent managing the product vs. solution fitting the number of notifications/users you are supporting. etc.
In addition: There is one notification + monitoring system I would recommend that's cheap and a "feel good" backup plan if you have one: Having any entity that is working 24x7 to "watch" the montioring system and know who to call/contact when it goes down. The Help Desk usually suffices. Simply having them keep a console up (just one, not every console you have) or just check it regularly/hourly and having a printed out document indicating who to call and every known method to contact them, up through a management chain (try not to go higher than a supervisor/director), would be minimal. They should be trained enough to feel comfortable escalating through this list when there are any problems with the console. When I say "watch the console" I literally mean making sure it can be launched, can be logged into, and *maybe* has some simple check or "montioring statistics" page to check that indicates the health of the montioring system itself, NOT "look at all the alerts" or "determine what problems exist in the environment" after they log in. When they call you at 3 AM and there is no actual monitoring issue please be nice to them .
Also: There are many reasons to use multiple protocols to deliver messages. One of the major ones (focused on above) is to have multiple ways to get notifications to user/support should one notification method fail to deliver in a timely manner, but the other reason people appreciate is (if you can do this with your setup/corporate culture) to be able to send non critical notifications via email, and critical notification through "anything other than just email". That way if the user receives a notification from anything other than email they take more notice, especially if they ahve special buzzes/ringtones/etc. set up to make them notice. It also gets around the problem of people saying "I didn't get/hear/see the notification". Well, if you tried email, and you tried SMS, and you tried calling all their available numbers (work cell, desk phone, and...gasp...personal phone) and got no response, then how far were you (the monitoring admin/system or the Help Desk) supposed to go? If you can prove they just "weren't available" then that's a people/process issue, but if you can show all protocols failed to work (and not due to configuration issues like old phone numbers/email addresses) then you can justify adding *more* notification protocols. Personally I've never seen email + SNPP/SMS + voice all fail at the same time for any other reason than users not keeping up their contact info or simply not being available (car wreck, deep sleep, ignored due to unplanned sickness, etc.)
Really great post!
We aren't staffed 24x7 so the help desk idea is out but I agree that they would be instrumental in helping to monitor during off hours.
As I mentioned in one of my replies above, today, we are using Notepager Pro on our old physical deployment of NPM. However, if we decide to go virtual, this would no longer be an option for us.
Also, we have looked into a "cloud" product called PagerDuty that seems to be very well done and seems to do everything we need it to do plus more like handling escalations and our on-call rotation; that way, only the people who need to get notified, get notified. SolarWind's product called AlertCentral appears to do similar things albeit not as polished or refined, but we'd have to host the product ourselves as opposed to PagerDuty who has their own data centers and redundancy.
Thanks for a really thought provoking response.
If it's worth anything, I dug up an *incomplete* list of companies/products that seem to do "notification management" as a core of the product (or a specific product they have) that might be useful. I'm not recommending/commenting on anything in particular with the order or "additional" information added to each. Sorry, but I don't have links to them . If there are any I missed or one that's "not so much a notification product", speak up!
- AlertCentral (from SolarWinds)- > non-cloud
- pandorafms-> open source version, with paid options?
- zabbix-> open source version, with paid options?
- MIR3 ("TelAlert")-> non-cloud
- MIR3 (SAAS)-> cloud
- pagerduty -> cloud
- SendWordNow -> cloud
There is also a Gartner doc from 2014 with many of these in it (if you/your management are into Gartner) on the web but you have to register to get it. Some of the companies in it are posting/hosting the same doc and you usually have to register with them to get to it. Search the net for:
gartner "Magic Quadrant for U.S. Emergency/Mass Notification Services"
Click around for awhile and you may get lucky and find a large document (I estimate about 15-20 ages long) with charts and analysis OR just "register" with one of them and then get their link to the doc.
We migrated Solarwinds to a virtual environment. Then I took the physical box that used to host SQL and use that as a SpiceWorks server that is only used to monitor my SW environment
1. Self monitoring for Orion - there are several ways to do it, but then the best approach would be outside Orion. If you have another tool linked to Orion, create scripts or monitors or log files parser or event logger using the other tool to monitor Orion and its services (this is the approach I would go with).
2. If you have to go with Solarwinds Orion to monitor itself and you have multiple pollers, considering the fact that primary poller is always working fine (primarily because of alerting), we can create templates using SAM to monitor the Orion services on all the pollers you have (You can as well create log monitors). You would already have Solarwinds Self Mon Service Template on your SAM, if you dont have it you would defn get it on thwack, if its not available on thwack you can easily create it yourself .
3. If you dont have SAM in your environment and say you only have NPM, create a custom sql alert which would query ENGINE table on your Solarwind DB, check for the field SYNC in it (this is time based field and would give you the latest timestamp of the handshake between primary and additional poller). You can create an alert based on time delay. If Sync field wasnt updated for last 5 mins then something is wrong with your additional poller and you can fire an alert.
Hope it helps
It is really recommended to separate the SQL from the Solarwinds application, this is to prevent performance issue. If your email server went down, you can set-up SMS Text alerts for notification:
If the whole virtual environment went down or a network issue, then it is a different issue, of course Orion will not able to monitor your network, unless you have a fail-over set-up of Orion server NPM/NCM then using EOC or FOE.
EOC - check page 9
We are currently using Notepager Pro to trigger SMS/Text alerts from our old NPM server. My concern is if we go virtual, we lose this ability.
I've heard about the Failover Engine (FoE), however it's a separate license in addition to another set of licenses for a secondary Orion server and I'm not quite sure if all of this will fit in the budget. We are a very small organization.
Thanks for the information, I'm sure it will help someone.
I also went through some of this pain. We had our NPM/NCM/SAM servers in our production cluster environment, and the SQL database was a separate (virtual) server. we had some minor speed issues - mainly related to the virtual sql server, but in the end, we pulled the Orion environment out when the SAN crashed, and we had no notifications.
Now, we have 2 physical servers - One is a ESXi host that runs my server vms, the 2nd is a dedicated SQL server (raid 10).
As for using a modem - I bought a IP-2-Serial device from dialogic. it has a built in modem, but its basically and IP addressed modem. Once you configure the OS, NPM sees the modem. pretty slick. We only use this if our Exchange servers crap out, or we loose our internet circuit.
We ran into some problems late last week. Apparently, Verizon discontinued their TAP service on February 2nd so we can no longer use the modem to send texts to our Verizon Wireless cell phones. We now have to rely on our Exchange servers and the Internet. With that being said, having a physical instance of NPM is not so important anymore.
What you had previously sounds exactly like where I'm trying to direct this ship.
I'd like to have NPM/NCM/SAM on the same VM in our production cluster, while the SQL cluster is currently virtual. We have lots of redundancy built in from the network-compute-storage. Nexus 5Ks with vPC as the core DC switches with at least 4 links to each of our 2 UCS B-series chassis that host our 4 ESX hosts and everything is on our IBM XIV storage array.
I would think this would be enough to give us good performance and keep our NMS server up, but we also have to prepare for the worse case. I'm not sure what we are currently using as the modem but I'll have to check out the dialogic device in case we stay physical.
I too inherited a similar configuration. I am the one that stood up a new server and I debated about going virtual rather than physical. In the end, I made the decision to stay with it being physical. I agree that moving the sql to your cluster is a better design unless you have issues connecting to the cluster. Just my thoughts....
Thanks for the quick response. What made you decide to stay physical? Was your virtual environment already fully utilized? Storage space an issue?
What happens if your physical server goes down? What do you have monitoring it?
Decided physical was best after reviewing the capacity of the SAN, the number of hosts, the switching fabric, and the same concern you listed that what happens if the virtual fails. As far as monitoring the NPM box, I have nothing configured at this time. Although, I am considering some sort of watchdog hardware to tie into the environmental monitoring system -- separate system with its own power subsystem and SMS connections. But, it has not a high priority at this time.
Makes sense. Although the company I work for is really small and virtualization has been an extreme plus for them.I doubt we have a 2nd physical server available to become the "watchdog", but we are standing up some equipment in a colo soon that might give us the flexibility of monitoring from the outside looking in with a VM there.
Thanks for your insight.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.