We are having issues where NetPath stops working around once or twice a week. All of the NetPath probes have a red circle with exclamation point.
I've learned that by stopping and starting all of the Orion services on the main polling engine, NetPath starts working again.
1. What service Orion service do you think is causing this issue? I could make a scheduled task to restart the offending service every day.
2. Can you think of a way to create an alert that will tell people when NetPath can no longer connect to sites?
Netpath doesn't really relate directly to a given interface necessarily. ie: if you're using Netpath to watch a bunch of cloud services, the Netpath will go out your external connection(s) and take quite a few steps that you probably don't have the ability to see if an interface is up or down. Yes, you can alert if your external connection goes down, but that's just a single step in a netpath.
I think whether you can alert on this or not, depends on what is actually happening. First off, this sounds like an issue with your server and you should probably get a support case opened to see if you can get it fixed. Many have been running Netpath for longer periods than that without issue!
But, if you go into SWQL Studio, I believe all the Netpath tables are in the Orion.Netpath group. From just glancing around a bit, the Orion.Netpath.Tests seems to be a good table to start with. there is a field "CompletionRatio" that might give an indication of what is happening, but without looking at your actual data from a time when you're having an issue, its difficult to say what. I believe this table gets populated each time a test fires off, but, if the issue is that for some reason your Netflow tests aren't even starting, you might not be getting new rows in this table. Of course, that can be a clue too, if you normally have tests firing every 10 minutes, and the last "ExecutedAt" is maybe 20 minutes ago, that might tell you that your Netpath is no longer functioning. However, if they are firing off and your "CompletionRatio" is 0%, that might be your clue too. But, lets say that all your tests are getting a few steps out before getting stopped, then your "CompletionRatio's" might not be 0%, but might be under a certain value.
The best thing to do is go look at these tables, either while an issue is happening, or figure out timeframes in which you had issues and examine the data during that time. Once you find some consistency that you can write a query for, it would just be a matter of creating an Alert using "Custom SWQL" to do this. You could potentially have the alert fire off a script to restart the service(s) or try and restart the service itself from the alert, not sure if that would work though!
As for what services do I think it uses? Most of the work seems to be done by the "Solarwinds Information Service V3", so I'd try and just cycle that service the next time. If not that, maybe the Job Engine or Collector Service. This could be something that you could ask support also!
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.