In my opinion multiple tools will always be necessary. Try as we might in many cases a vendor tool is much better at diagnosing what is wrong with their product than a third party tool. In our environment SolarWinds is our Manager of Managers. All alerts from any system flow through it. This allows the NOC a one stop shop for all systems but still allows for more granularity and the ability to delve deeper into problems with a vendor tool when needed.
The "Manager of Managers" idea is interesting, I've used something similar previously and had mixed results - some things I just ended up using another tool for. I agree that it is going to be difficult (if not impossible) for one tool to rule them all due to the varying degrees of capturing data (not every solution has a great SDK / API suite available for tapping in to). It sounds like your use of SolarWinds as the "capstone" product has been successful.
What I've found that meets most environment needs is the centralized unified dashboard that provides multi-vendor/cross platform monitoring/reporting/alerting across the entire breadth of the organization. There will always be additional monitoring requirements in any environment that go beyond what any general purpose monitoring solution can provide. In the case of your DBA you augment the capabilities of your general purpose monitoring solution with specialized DBA tools where necessary. Your general purpose solution may only tell you there's an issue in your environment but its these specialized tools that will aid in the troubleshooting, diagnosing and resolving of the issue.
So in the end it's a blended approach. Your IT group should live out of the general purpose console and only those with the necessary training and skill should need access to specialized tools. It's also important for all of IT and management to have a holistic view of how the infrastructure is performing without accessing half a dozen different point solutions. You'll find this also eases the management of alerting and reporting dramatically.
General purpose monitoring solutions are constantly evolving and improving, so you should be able to significantly limit/reduce the number of specialized tools you need to manage your environment. Also, don't be afraid to ask the tough questions, like why are you keeping a specific point product around, who's using it, and why? As members of your IT organization embrace the single pane of glass a general purpose monitoring solution should provide, you may find that something people once could not live without is now disposable.
Also, don't be afraid to ask the tough questions, like why are you keeping a specific point product around, who's using it, and why? As members of your IT organization embrace the single pane of glass a general purpose monitoring solution should provide, you may find that something people once could not live without is now disposable.
This is a great statement. Typically, I find the answer is that they have only experienced the existing tool and/or are comfortable with it, and change can introduce a bit of a learning curve. As long as the replacement tool / change is one that makes their life easier or more manageable, typically the resistance goes away. One trick I'm fond of is to "beta" the tool with a specific user and really get their honest feedback and input, and then once they get a buzz going over the product, they are your biggest evangelist for the change.
I completely agree with aLTeReGo and this exactly how we have designed things in our environment for both our internal systems and our hosted services. Orion provide our single pane of glass that we use across the entire organization. Orion provides the dashboards and alerts for our NOC, limited views for our customers, in-depth performance data for our admins, and the reports that our executives use for making important business decisions.
When problems occur we typically use the vendor tools to really dig in such as NetApp Data Fabric Manager and VMWare vCenter.
I think the vendor tools will always be necessary for the deep dive; however the single pane of glass that spans your entire infrastructure is also necessary.
So it seems like multiple tools wins the vote, with a top-down approach where a global tool is selected to be the front end, and more granular tools are used to fight fires or dig deep into hot spots?
Short answer is: you will always need multiple tools.
The question becomes: which tools? and who should pick/use these tools.
To answer those questions you need to decide how deep you want to dig into your network, and what kind of information you are interested in (redundant) oh and i almost forgot the magic word, budget!
I will give some examples:
I found Orion to be a life saver, and a career builder to say the least. i was handed the responsibility of building a noc and lead it. Choosing Orion was the best thing I did.
the level of visibility we got out of it was so granular that one glimpse at the top 10 page was enough to know if our internet slowness for instance was caused by our own network elements or the ISP.
I was then faced with the challenge of tracing packet paths within a large core, so we got Packetdesign route explorer to map our Layer 3 (routing) topology and do visual tracing which made the job easy.
Then some devices started having some high CPU utilization issues, hence the need for a good protocol analyzer to see what was hitting the control plane. After crashing a few laptops, we got some shark appliances plugged in and managed them using pilot console.
That gave an excellent view into the traffic for live troubleshooting purposes. but what about historical information? You can only have so much storage space to keep a capture job running.
Then the dreadful copyright infringement e-mails started flowing. So we had to identify which user was announcing a specific hash using torrent.
The only solution that worked against that was Languardian from netfort, which builds its reports as it captures traffic then discards the data (unless you need to keep it).
Languardian was surprisingly easy to use and to our surprise, it integrated nicely with Orion to give a comprehensive view of activities on the network.
None of the above packages could replace the other, they are complementing each other in the task of proactive/reactive fault management.
So the idea is to centralize the responsibility of network monitoring to single group who would then notify the different operating groups when action is required.
This is a really cool "deep dive" into the inner workings of tool selection.
It sounds like that Orion is kind of your front end tool, with a variety of other tools that are used on more specific tasks / use cases? I'm seeing a trend here in the discussion that this is the more common path to take.
I would agree. So the next step is being able to integrate the deep dive tools with your front end tool, that would be awesome!
Very interesting discussion going on around here!
What I try to discuss with customers when we start talking about monitoring operations is about the process and not so much for that one tool. As bkattan said earlier, there isn't going to be one tool; and even more, for some cases the tool won't exist and there's only going to be a person with a defined process and outcome to review.
And if you think about it, any other tool will require the exact same, even a dashboard. When an alarm is triggered, do we have an automated script that executes all that we need to mitigate that alarm?
Some of the examples that I use are related to the Microsoft Operations Framework (MOF), the "Reliability Workbooks": MOF Technology Library. In there you can find pretty large examples of monitoring operations defined for Microsoft technologies.