7 Replies Latest reply on May 23, 2017 10:16 AM by RichardLetts

    Multi-Tentant Support

    spyfly

      I work for a large organization and we are looking to offer Orion as a service to other IT department across the company. We will be the owners of the Orion application however we would like other departments to be able to add/remove their devices, setup monitors and alerts. We are in the process setting up a new instance running NPM 12 and while doing this we would like to make sure that we set it up correctly and for the purpose of being again to do this.

       

      Is this something we can do with NPM 12? Is there a how-to post or KB on how to do this?

       

      Thanks

        • Re: Multi-Tentant Support
          orioncrack

          The average user should not be given rights to setup monitors and alerts.

           

          One bad config can cause epic damage.

           

          I saw it with an alert configured with an any any without and kind of logic and it brought down an entire email system with 5 million alert emails that brought down a large company's mail system for 4 hours.

           

          You cannot easily give normal non solar winds admins the ability to create any kind of monitors.

           

          None of of my DBAs or Help Desk or Engineers that do not work with Solar Winds with the slightest passion for it could manage it.

           

          It takes a dedicated guy(s) who know what they're doing.

          • Re: Multi-Tentant Support
            sja

            Hi

             

            we use a CP "owner" to multi tenant the differen users groups..

            if the company has a some type of central CMDB you could minimized the risk by automation some part. 

            orioncrack has something about who should be allowed to do what...

            but I don't see that as problem with the software and more problem with setting "ground rules to the users"

            multi tenant or not :-)

              • Re: Multi-Tentant Support
                RichardLetts

                yes, we do the same with a CP 'sector'

                 

                We make use of JIRA Software - Issue & Project Tracking to manage requests for new alerts.

                Alerts require a run-book, and using Jira we can subtask that out so someone has to write documentation on what the alert means, the <5 things to check, and the ways it can be fixed.

                No runbook, no new alert...

                1 of 1 people found this helpful
                  • Re: Multi-Tentant Support
                    spyfly

                    RichardLetts how detailed is your run book? Do you manger a noc or a system/network team? Who uses the run books? monitoring team? does you monitoring team do basic troubleshooting?

                      • Re: Multi-Tentant Support
                        RichardLetts

                        I manage a NOC for 7 different networks stretching from Tokyo to Chicago, and also four hospitals including Harborview Medical Center. We're here 24*365.25

                        the run book is NOT detailed -- see the Checklist Manifesto for the ethos behind this (https://www.amazon.com/Checklist-Manifesto-How-Things-Right/dp/0312430000  )

                        i.e. it doesn't say anything about acknowledging alerts, creating tickets, basic tasks everyone should get right every time.

                         

                        The monitoring team is the same as the NOC, and we do advanced troubleshooting, firmware upgrades, router reloads, planned traffic re-routes. i.e. the things one would expect a highly qualified NOC to do.

                        If we were just sat here waiting for alarms to trigger and then page someone I'd outsource it.

                         

                        Here is a VERY typical runbook for a low-level alert that we use for preemptive action; as you can see it says why the alert triggered, why you should care about the alert, remind people to make sure the device  is up, to read the syslog messae, to use their juniper account to see what the message means (actually now I read it I see I should remind them to open a JTAC case if the message is not documented), and then suggested appropriate actions.

                         

                        The KB# appears in the alert message, and is tied to our ticketing system so there's no ambiguity about which KB article this alert applied to. In this case this document has been referred to 22 times, which tells me it's not something we do very often.

                         

                        This is an continual service improvement process, and I try to update an article each week or otherwise clean up an alert,

                         

                        --------------------------

                        SWO: Juniper Chassis

                        KB0024064

                        22 views

                        Juniper Chassis

                        This alert triggers when a juniper router or switch reports more that ten (10) syslog messages from chassism in an hour (60 minutes).

                        The alert clears if less than ten (10) syslog messages from chassism are received in the last hour (60 minutes)

                        Impact to Customers:

                        This normally indicates some serious issue that needs to be investigated. In most cases there might not be an immediate impact but it could indicate some Access Points are not getting power, that one power supply has failed, or some other foreshadowing of something major.

                        Remediation Steps:

                        1. Check the device is reachable
                        2. Check syslog for the messages
                        3. Use your Juniper account to search for the message and determine its impact,
                          the default impact should be 3 - Low
                          if adjacent devices (Access points, UPS) are impacted increase the impact to 2 - Medium
                        4. Consider if an out of hours page is necessary and increase urgency to 1 if needed.

                        Contact Info

                        Network Core / Layer_3 for routers, and icas switches

                        NIM -- for other switches

                        Alert Definition

                        [links to JIRA and alert definition]

                        1 of 1 people found this helpful
                  • Re: Multi-Tentant Support
                    spyfly

                    Loving the ideas keep them coming, I really like the runbook idea maybe I can get this added to our change request process.

                     

                    Really liking the JIRA idea currently we run an home brew system that is outdated and the original creator is no longer around. It was designed more for networking devices and the flow does not go well with application or server monitoring.