11 Replies Latest reply on Jul 7, 2016 8:52 AM by byrona

    Success Stories of gaining operational value from LEM

    byrona

      I would really love to hear specific success stores of where people have gained operational value from LEM.  I am hoping that by sharing some stories or examples we might all be able to gain new insights from each other.

       

      Thanks in advance for anybody that takes the time to share!

        • Re: Success Stories of gaining operational value from LEM
          nicole pauls

          We just surveyed our customers and related customers using log data for SIEM/IT Ops/Compliance and got a lot of interesting insight into what features people are using. We're hoping to get some cool stories as a part of ongoing research, hopefully we'll get some we can share.

           

          Here are some that come to mind while people get participating:

           

          1. Company has a situation where downtime directly costs them money, but does not invoke any regulatory compliance issues. A virus, an outage, etc, means people are literally not spending money with them, and seconds tick by fast. A security issue causes at least an hour's worth of downtime and could put them out for the entire day. For all these reasons, they have service accounts that they have to share the passwords to (think whiteboard with passwords that have admin access to a set of servers) so that a set of operators can fix issues quickly. Their IT team had thought they restricted usage of these accounts via GPO, since they were highly privileged. Not so - we were able to audit usage of these accounts, find people logging on to them, and making unexpected changes to their own systems (like adding themselves to local admins, installing software, etc).

           

          2. We used an example in the first SolarWinds Lab episode of a customer whose firewall kept going down, down, down, regardless of what they did. Their connections were being used up so quickly they thought there was a bug in their firmware. Interface utilization was off the charts. We were able to figure out it was actually a worm - almost every machine in their infrastructure was infected. They were a healthcare org so it wasn't business crippling entirely, but all of their remote sites/clinics were isolated (connected back via VPN - which couldn't connect or maintain) which affected patient care, access to records, etc. We were able to resolve it by identifying infected machines, cleaning them up, then continuing to filter and monitor for new infections.

           

          3. When I managed IT for TriGeo before acquisition, I ran into all kinds of stuff that would have taken me forever without a system aggregating logs. However, the most amusing thing was everyone knowing we used it, and coming to me when issues happened before I really knew they were a problem - because they assumed that in a "big brother" sort of way I already knew (Sometimes I did, sometimes I didn't - yet.) I could probably drag out a bunch of stories of how logs saved my bacon or really sped up my job, I'm pretty sure if we didn't we'd have had to hire "real" IT people other than myself and a couple of people who helped on both helpdesk level support and our hardware burn-in/imprint process.

           

          Next...

          2 of 2 people found this helpful
            • Re: Success Stories of gaining operational value from LEM
              byrona

              Thanks for sharing Nicole!  I would be interested in hearing specifically how the system was configured or what it was configured to look at in the specific scenarios that lead to the success. 

              • Re: Success Stories of gaining operational value from LEM
                nicole pauls

                I've had responding with more thoughts on my to-do list for a while, but never got around to it... so here's some thoughts on the examples I listed.

                1. Company has a situation where downtime directly costs them money, but does not invoke any regulatory compliance issues. A virus, an outage, etc, means people are literally not spending money with them, and seconds tick by fast. A security issue causes at least an hour's worth of downtime and could put them out for the entire day. For all these reasons, they have service accounts that they have to share the passwords to (think whiteboard with passwords that have admin access to a set of servers) so that a set of operators can fix issues quickly. Their IT team had thought they restricted usage of these accounts via GPO, since they were highly privileged. Not so - we were able to audit usage of these accounts, find people logging on to them, and making unexpected changes to their own systems (like adding themselves to local admins, installing software, etc).

                 

                In this case, their accounts were named consistently - say "svc_XXXXX". We identified them by:

                1. Creating a filter looking for Auth Audit Events or Change Management Events with Source or Destination Account of "svc_*"
                2. Doing an nDepth search for "svc_*" (or "User Name = svc_*")
                3. Running the Resource Configuration and Authentication master reports and filtering to source or destination account of "svc_*" - sometimes we'd just run spot checks of things like UserLogonFailure or UserLogon by User instead of the full big reports that can take quite a while to run.

                 

                Since they did have a set of machines that this was allowed on, when we built rules we were more restrictive. With filters/searches/reports it was not so bad since there wasn't a TON of volume, but alerts had to be more specific. Since they had agents everywhere on servers, we created a Connector Profile of those core systems they COULD log on to (you could also use a User-Defined Group), then we had two things we were most interested in:

                1. Interactive User Logons by these users to machines that weren't in the approved list
                  1. Criteria: UserLogon.LogonType = *Interactive AND UserLogon.DestinationMachine <> List of Approved Machines AND UserLogon.DestinationAccount = svc_*
                  2. In their case they had the action set to email temporarily, then once they were comfortable they set it to use the Log Off User active response.
                2. Any changes made by these users to machines that weren't in the approved list
                  1. Criteria: Change Managment Events.SourceAccount = svc_* AND Change Management Events.DestinationMachine <> List of Approved Machines
                  2. In this case, the action was just to email, though they were thinking about logging off the user here too.

                 

                We looked at creating a rule for logons that would tell us when they might have logged on to other systems (like, their workstation), but didn't put it in production when I was working with them. In that case you'd create a second UserLogon rule that didn't have the LogonType restriction (you also can't do the Log Off User action, which is why we created a second rule - in their case they used filters pretty extensively so they set up a filter notification instead).

                 

                2. We used an example in the first SolarWinds Lab episode of a customer whose firewall kept going down, down, down, regardless of what they did. Their connections were being used up so quickly they thought there was a bug in their firmware. Interface utilization was off the charts. We were able to figure out it was actually a worm - almost every machine in their infrastructure was infected. They were a healthcare org so it wasn't business crippling entirely, but all of their remote sites/clinics were isolated (connected back via VPN - which couldn't connect or maintain) which affected patient care, access to records, etc. We were able to resolve it by identifying infected machines, cleaning them up, then continuing to filter and monitor for new infections.

                 

                In this case the "canary in the coal mine" was their firewall grinding to a halt and people complaining. Not the best warning, but that's reality, right?

                 

                What we did next was look at their Console and create a filter for ALL firewall events from their firewall (so Any Event.DetectionIP = <firewall's IP> - same idea as the stock "All Firewall Events" filter) and saw data just RIPPING through (we actually didn't even need to create the filter - it was pretty clear even LOOKING at the Console, but it was hard to tell if that was normal or not since we weren't working with them before... it just didn't smell right). After seeing that, we headed over to the "All Network Traffic" filter so that we could do some more digging. There's a stock "Network Event Trends by Source Machine" widget there, which pretty much showed a TON of Source Machines, not entirely useful since we didn't know their network beforehand. We created a similar widget that showed "Network Event Trends by Source Port" (and another for Destination Port) - for a line chart use Field: Event Name, Show: Count, Versus: Time, Split By: SourcePort (or DestinationPort). There was a huge gap for either the source or destination port (I can't remember which virus it was...) that showed ONE dominant port for most of their traffic. A quick google for that port, viola, malware.

                 

                What we did next was build a rule for several hits to that port and dropped the SourceMachine into a UDG. This was our "known bad machines" UDG and what they used for cleanup efforts.

                1. Criteria: Network Audit Events.SourcePort = <the bad port we identified before>, threshold: 5 in 30 seconds
                2. Action: Add to UDG, Network Audit Events.SourceMachine

                 

                This built a nice list of systems infected (it was bad - there were a lot of them at first). As they cleaned a system from the list, they removed it from the UDG, and slowly it dwindled. They also had a filter for Network Audit Events.SourceMachine = <known bad machines UDG> to show them whether there was still traffic coming to/from those systems. (You could also create one just for the known bad port.)

                 

                To figure out the history of what had happened, we identified the specific port and traffic, and used nDepth to dig back (e.g. a search for Network Audit Events.SourcePort = <the bad port>). We slid the window back far enough to find when it started. We also tried to dig for malware events from their firewall or AV, but we never spotted any (e.g. a search for VirusAttack events over the same timeframe).

                 

                3. When I managed IT for TriGeo before acquisition, I ran into all kinds of stuff that would have taken me forever without a system aggregating logs. However, the most amusing thing was everyone knowing we used it, and coming to me when issues happened before I really knew they were a problem - because they assumed that in a "big brother" sort of way I already knew (Sometimes I did, sometimes I didn't - yet.) I could probably drag out a bunch of stories of how logs saved my bacon or really sped up my job, I'm pretty sure if we didn't we'd have had to hire "real" IT people other than myself and a couple of people who helped on both helpdesk level support and our hardware burn-in/imprint process.

                 

                I used the console a lot, and a lot of our stock rules are really borne out of stuff that I used on a day to day basis. I'll have to rack my brain on what I was actually using, but it included:

                • Notifications on Account Lockouts (stock rule) - when I got notified of one, I'd check the machine name to see if it matched an expected system and usually just unlock them if I had time (we had an auto-unlock policy but people are impatient ).
                • Any interactive logons directly to my servers - since only a few people should actually be doing this, if I got an email and didn't know it was happening, I jumped on it immediately.
                • Any logons using domain admin accounts, especially the stock account - our domain admin was renamed and there were really only a few limited domain admin accounts, so we used runas or a logon directly to a server/DC to do account maintenance. When these accounts were used, it meant business. I would often look at the source account to see who was using it, especially if it were from a workstation and not a server (so they were using "runas").
                • Any viruses - even if they got cleaned, since it might indicate someone ... internet-promiscuous.

                 

                I had filters for....

                • Blocked web traffic (we had a web content filter, so I would watch for blocked content); allowed web traffic (here I'd use widgets that would break it down by hostname or username)
                • Network traffic, with widgets by event type, source/destination ports
                • USB-Defender stuff - who was using USB devices and what files/processes they were accessing (I could actually see if someone was running apps off their USB key, which often wasn't a big deal but a couple times we did see stuff that you could call "out of policy" )

                 

                If I think of more I'll have to add it.

                1 of 1 people found this helpful
              • Re: Success Stories of gaining operational value from LEM
                fd6232

                Nicole,

                I would also like to know how that was configured. That's the biggest challenge I have with LEM. The built in reports are great and It's an extremely powerful tool, but I would like to hear more on how your system was configured to address the scenarios listed above.

                • Re: Success Stories of gaining operational value from LEM
                  olgab

                  We have Checkpoint Firewalls, after a while trying to make the block IP work and several issues opened with Solarwinds we found that the response works wonders. Many times hackers try to issue all kinds of commands on our public servers to try to bring them down or maybe deface the web pages, I was blocking the IPs but found that I could not verify where the command was going, it was not in the firewall Smart Dashboard. Well... I found out that the command works on the SmarView Monitor Suspiciuos Activity Rules, vuala.... all the IPs I tried to block were there. From then on if I see anything suspicious going through and stopping at the cleaning rule and being repetitive, I check the IP, if it is black listed or not yet analyzed I block it. This has helped the traffic flow a lot better and just with the click of the mouse.

                  • Re: Success Stories of gaining operational value from LEM
                    George S

                    One of our challenges is that we have limited VPN access licenses. Knowing who and when gaining works for us on two levels. 1. Find out who is on and for how long 2. Our auditors love it!!!

                    • Re: Success Stories of gaining operational value from LEM
                      alaskan

                      I know the original post was a looooong time back but I wanted to provide some info on what we use LEM for and how its helped.

                      We currently use lem as a central location for all of our logs and then off that we generate alerts for changes occurring that we want or need to be made aware of.

                      For example if someone removes a user from one of the admin groups or adds a user, we wanted an alert for that.

                      We also created some alerts for user lock outs and some firewall rules ones.

                      The only issue I have is we are a small IT shop and I can't dedicate 100% of my time to LEM so when a request for a new alert comes up I have to spend a good 30 minutes to 1 hour relearning how to create the lem alerts.

                      The way the alerts/filters are created takes some getting used to because to be honest its not intuitive. (at least to us it wasn't)

                      This is not a huge hit against the product, I just wish it was a bit easier to work with.

                      Also it would be nice if it could connect to a sql database. But I imagine that would require an entire rewrite of the product...

                       

                      overall a good product though.

                      • Re: Success Stories of gaining operational value from LEM
                        agusst

                        I'll jump in on this even though it's an older post...maybe people will look back on it....

                         

                         

                        I used LEM to solve an issue that seemed to come up every now and then with our production websites.  "Some how" files were changing on production sites and causing outages...nobody knew who was doing it.  I setup a FIM monitor that triggers and sends an email alert when the production files are changed in anyway.  Next thing you know...I get a few alerts...we ID who was doing it, and corrected the issue.

                         

                        We had another issue where users were ending up in the Domain admin group...I cleaned up the group and set a rule to alert me when someone was added to the group.  A few weeks later we found the guy doing it...after reviewing permissions in our AD structure, we found that he was part of a group that was delegated way to many permissions.  We've since cleaned that up.

                        1 of 1 people found this helpful