My biggest challenge has always been trying to figure out what planet my predecessor was on when he / she set up monitoring in the first place. It has always been someone who doesn't have any idea of what is important and has never applied any logic to what they have been doing. I've usually just scrapped what they have done and started again. The customer has always been happy with the end results and surprised at what the products can do when managed by someone with a brain.
My biggest challenge to date has been and will most probably continue to be will be potentially improving the inherited Custom Properties for the environment being monitored.
Making sure that all of the Nodes, do actually have a Custom Property value assigned even when it may not be applicable.
I personally try to avoid this by creating drop down lists all with positive list items not duplicated from other Custom Properties.
In addition I also try and avoid using the just Yes or No / True or False options.
Where the Custom Property header is not applicable to a Node I believe it is better to say ‘Not Applicable’ from a Drop Down List as opposed to leaving it Blank.
This is a Custom Property value that may also be included within an Alert or Event Notification and Reports to mean something by filling the Blanks.
Please see my new Feature Request for: Custom Properties Custom Order List.
Thank you in advance for Voting this Feature Request up.
Justifying resource is always a headache.
The SW docs say to add loads of memory & loads of cores, especially for large installations, but when you inspect each of the servers utilisations, with Orion, with DPA or directly on the servers, it's really difficult to find any evidence that will support the request for additional resource. Either the published resource requirements are wrong or the explanation of the effect of under-resourcing is wrong. I suspect the latter, which should be easier to fix.
Creating meaningful, usable alerts was one of the initial challenges. And prior to that, understanding snmp and snmp-v3, creating standards-based documentation and implementing those standards for every node (include naming conventions, snmp strings, etc.).
Later, upgrades were hard. And they took a lot of time. Lately they've been easy, if more time-consuming than I'd like (but they're NOTHING like they used be--so much quicker now than back in 2005!).
Obtaining funding for monitoring everything that ought to be monitored was a challenge. Then obtaining funding for the additional modules, since NPM couldn't do it all, became the challenge.
Getting training for it all--impossible. It was all DIY and OTJ, working hard to get it right the first time and not worrying about trying to CMA.
Finally, justifying the ongoing annual support/licensing expenses when my manager was a fan of home-built or best of breed products (especially those created using Open Source resources at no $ cost to us).
Trying to understand the logic (or lack thereof) used by previous staff in setting up alerts. In one case, each team set up their own alerts, so some alerts were node-specific (as in "alert me when node xyz = Down"), others were very general. Some used variables; most did not. Repeat x 400-500 alerts 🙂
I am converting the Altas component monitors to the new dashboards, and that is definitely a challenge. What I also struggled with was moving the polling engine and additional web server to new systems. We renamed the systems but kept the old IP addresses. There is documentation for "new name new IP" and "existing name existing IP", but not really anything for "new name existing IP".
We use the Atlas dashboards to display the status for all the components for one of our major application. It changes between three screens of information. I need to move that to modern dashboards, and I am hoping that Lab 89 (that was a hint in a question earlier in this month's mission) has the answers for me.
Just a quick aside - I went around and around with support on terminology. He said an issue we had was with our AWS. I said we don't have AWS, and he insisted that we did. That took several minutes back and forth. Then I finally figured out that he was not talking about Amazon Web Services but Additional Web Server. D'oh!
My biggest issue is user experience. What I mean by that is my users will ask for something to be monitored without really fully understanding what they are asking for. If that be an email at 2 am or whatever, managing a user's expectations is definitely part of the job that most admins overlook. Sometimes you have to coach your user(s) into understanding the different tools available within Orion. Not everything has to be an alert. Certain things can be done using a dashboard view or a report that runs on a specific date and time. Of course, the same can be said for any application in or outside of monitoring but as a monitoring engineer we are sometimes called upon to be that voice of reason to have users fully understand what they are asking for and what the results of that could be.
@the_ben_keen yeah I have learned the same thing. We used to offer 4 tiers of alerting schedules, but we eventually reduced it to 2: 24/7 (p1) and business hours only (p2). Usually everyone says they want P1 monitoring, and they we say, "OK, so that's going to wake you up at 2 AM if it goes down... that's what you want, right?" About 1/3 of the time, they say "Wait, maybe p2 is fine". 🙂
As with any new monitoring tool, learning the layout of the landscape and understanding how it works in order to better use it takes some time. Another important thing is learning its limitations with regard to what you are trying to do. Things like what aspects of the database are exposed for the polling engine and any scripting that is involves. Does the shell return a success if the command is issued, or is it that the shell ends successfully or is it the return code of the script that was executed. These are just a few things that are important to know. Then you get into fun times with maintenance windows, how are they handled by the tool, can you set up a template that you apply to various monitors and or alerting rules. How does it handle varying maintenance windows over a weeks time? These are some of the challenges everyone faces with new tools.
>> Other (comment below)
Absolutely our biggest headache was understanding certain facts:
There's other stuff but the above are key.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.