3 Replies Latest reply on Apr 7, 2011 11:13 AM by byrona

    The adoption of Orion as a dependable tool

    JaredC

      Good morning,

      My company is readying a data center for prime time.  By year end, there will be close to 5000 end users across several states using it for email, Internet, and security services.  Since this is a brand new operation, the support tools are also new, and for many of the staff here, Orion has never been in their everyday troubleshooting routine.

      As we prepare for the first users to enter in, we are experiencing last minute issues; much like "hell week" before the opening of a theatrical production.  The most bizarre issues are with an HP C7000 and an APC power strip, causing extreme power outages.  I have setup Orion to monitor the IP addresses on each of the APC devices in the data center, with email triggers setup to send these alerts to the engineers who can best investigate.

      Earlier this week, two APCs were disconnected from the LAN, emails were sent out, but nobody seemed to care.  The emails were ignored, and the information was second guessed as potentially inaccurate. 

      What have some of you here on the forums done to integrate Orion, its website, and the alerts into your IT environment?  

      If it can't be trusted, or won't be given the chance to be, paying for the maintenance and upkeep of Orion year over year just isn't worth it. 

      Management has said to all of us that if an end-user is how we first learn of a problem, we have failed.  Orion is a great tool to mitigate that potential failure, but if the guys don't want to use it, or don't trust it when they do, it's pointless.

      Any suggestions or stories on how you accomplished it would be greatly appreciated. 

      Knowledge is power, but you have to trust your source, no matter how critical or sensitive the information you've received may be.

      Thanks,

      -JaredC

        • Re: The adoption of Orion as a dependable tool
          Miron

          Jared,

          I think that your problems come not from the technical solution but ensuring that your proceedures and policies are well defined, communicated and understood by the staff. The roles and responsibilities should be clear.

          In your example before you went live on the system you should have run a number of incident handling tests to ensure that in the event of a failure everyone understands what they should do.

          "Nobody seemed to care" = "Nobody understood their role and the role of the technical solution"

          I am not sure of your specific environment and it really depends on whether the technical solution has been implemented as part of the service provision and fully endorsed by management or if it has been installed and is part of a whole lot of other tools that people are using.

          As part of our monitoring service we chose two monitoring solutions at the time (Nagios for Application & SW for Network) and then integrated a nagios view into the main SW portal. This was then made available for all the support staff and helpdesk with appropriate training.

          Any alerts coming out of those two solutions were directed at the relevant teams who had the responsibility to respond to the alert.

          If people don't trust the system because it has been considered unreliable and they think the system is crying wolf then make sure you are comfortable with the system and then relaunch it with appropriate training as the defacto alerting system.

          Kind Regards

           

          Miron

            • Re: The adoption of Orion as a dependable tool
              netlogix

              @Miron - "nobody seemed to care" - kinda like:

              There was an important job to be done and Everybody was sure that Somebody would do it.
              Anybody could have done it, but Nobody did it.
              Somebody got angry about that because it was Everybody's job.
              Everybody thought that Anybody could do it, but Nobody realized that Everybody wouldn't do it.
              It ended up that Everybody blamed Somebody when Nobody did what Anybody could have done

              Jared,

                Another thing to try is to find out why they feel it is unreliable.  Do they get too many alerts?  Are they informed about the reset of an alert?  Does their job depend on keeping the network operational? (especially if they were alerted of the issue prior to a user compliant)

                I think Miron might be right, accountability is key.

              Steve

            • Re: The adoption of Orion as a dependable tool
              byrona

              I have to agree with Miron, your problem isn't technical.  If you abandon Orion and move to a different solution you will likely have the same problem.

              We have Orion create tickets in our ticketing system assigning those tickets to the appropriate groups.  Those groups are then accountable for handling those tickets appropriately as part of their job.  If they don't handle the tickets appropriately, they have not done their job properly and that is then handled as an HR issue.

              In my experience even with the best NMS systems you will still get false positives, the important things are to always try and minimize the false positives and to make sure your team is trained properly to understand that there will be some false positives; however it's still critically important to properly investigate every incident that your NMS alerts on.  If you can't do this then why even have an NMS at all?

              Hope this helps!