10 Replies Latest reply on Aug 18, 2012 2:09 PM by twgraham

    Information Gluttony

    Mrs. Y.

      In a recent article from Smithsonian Magazine, Big Data or Too Much Information, the author made a comparison between the copious amounts of digital information we’re collecting and the invasive Asian perennial, kudzu, brought to the U.S. in a misguided attempt to prevent soil erosion.

       

      From the year 2003 and working backwards to the beginning of human history, we generated, according to IBM’s calculations, five exabytes–that’s five billion gigabytes–of information.

      By last year, we were cranking out that much data every two days. By next year, predicts Turek, we’ll be doing it every 10 minutes.

       

      To anyone dealing with log correlation, this should sound painfully familiar, albeit on a slightly smaller scale. The writer makes the point that our collection capabilities outpace our ability to process the data gathered.

       

      Maybe this isn’t about collecting the right amount of data, but the right kind of data, i.e. right being equivalent to useful.  The trick is how to develop algorithms and applications to determine this.  Otherwise, we’re going to end up as no more than data fetishists crushed under the weight of our own obsessive-compulsive desire to collect.

        • Re: Information Gluttony
          Scott McDermott

          Lee Damon and Evan Marcus had an interesting presentation on essentially the same topic at LISA '06.

           

          Slides: http://static.usenix.org/event/lisa06/tech/slides/damon.pdf

          MP3: http://static.usenix.org/media/events/lisa06/tech/mp3/damon.mp3

            • Re: Information Gluttony
              Mrs. Y.

              Interesting, but I'm not so concerned with keeping it all. I've written a blog post regarding my thoughts on the subject from a security perspective: Thin Slicing Security Data

                • Re: Information Gluttony
                  nicole pauls

                  I wish I could find the blog or news article I read arguing this point, but it's just a glimmer in the back of my head now. The gist of the argument was, "if you're not using the data and can't YET conceive of a concrete reason to use the data, why collect it?" AND that just because you're not collecting it NOW doesn't mean you can't start collecting it in the future as your implementation ramps up or your organization really does have a use case for it.

                   

                  I did find this post that was somewhat relevant buried in my twitter retweets, it could easily be a response to this discussion... Idoneous Security: In 50 gigabytes, turn left: data-driven security.

                    • Re: Information Gluttony
                      twgraham

                      I'm an old data base guy. It was either Jean-Dominque Warnier or Ken Orr (two of the grandfathers of structured, now relational, data base theory) that said "data that is not used will be wrong".  Too much of what we are collecting is just in case (or perhaps CYA) data.  Since we are not looking at it we don't know the accuracy of it (and in the case of NMS data completeness of it).  Can we trust it to be reliable in an after the fact investigation if we are not verifying it as it is collected?

                    • Re: Information Gluttony
                      byrona

                      Mrs. Y

                       

                      I read your blog post and really enjoyed it.  The indicators suggest that all of the data we are collecting isn't really doing much for us except maybe providing those outside firms something to work with when they come and show us where the breach is.  In our experiences as a service provider we have not yet found an effective way to use all of the log data that we collect aside from setting the occasional alert on known bad issues; the "unknown unknown" continues to be our problem as well.

                       

                      Also; with the amount of data we are generating increasing at such a fast rate it's making it difficult to meet some of the compliance requirements out there with regard to both collecting it all and then retaining it for a long period of time.

                        • Re: Information Gluttony
                          Mrs. Y.

                          Yes, this is what horrifies me. I think it's time to bring in probability experts to assist us in finding new methods. We clearly aren't successful with the status quo. This is why I'm fascinated by the application of thin slicing. Are you utilizing Netflow or Sflow much? I think this is one way to find anomalies on a network.

                            • Re: Information Gluttony
                              byrona

                              I have worked with the flow technologies before and absolutely love having that data.  Unfortunately as things currently stand my company doesn't see enough value in it to go out and purchase a tool such as SolarWinds NTA for managing that... at least not when compared to some of the other priorities we have right now.

                               

                              Regarding logs; I know that this may sound odd but I have actually found more usefulness in watching the total number of logs flowing in and watching for noticeable deviations in that quantity number than actually watching the logs themselves.  I then go take a look at which system(s) are responsible for the increase and often uncover problems.  Seems an odd way to find problems but I have had some success with it.

                      • Re: Information Gluttony
                        jswan

                        Mrs Y. -- It's nice to see you here in addition to the Twitterverse. I too enjoyed your post on thin-slicing. I'm curious to hear more about the types of data you find overwhelming in a SP security environment. Are you doing a lot of DDoS detection and mitigation? Have you been able to use flow-spec yet?

                         

                        Flow data is actually pretty small if it's compressed well. For security, the trick is knowing what to do with it and making sure that your queries are producing accurate output in a short-enough time frame that you'll actually spend time with the analyzer.

                         

                        On the traditional logging front, one "thin slicing" tidbit I found interesting from one of the public IR reports (I think it was the Verizon one) was that a decent portion of successful attacks were correlated with either with anomalous line-lengths in syslog, or a sudden increase or decrease in syslog volume. Obviously not something you can count on heavily, but since it's so easy to alert on both those heuristics, it's probably worth monitoring.

                          • Re: Information Gluttony
                            Mrs. Y.

                            Thanks for the feedback. I think logs (IDS and syslog) seem to be the most overwhelming. We're a small service provider and average 130GB every day in just firewall logs. The article I referenced in my original post indicates that this trend is only going to get worse. I've been BEGGING for flow data, but having a hard time convincing people of the value where I work now.

                             

                            Was that information in the Verizon 2012 breach report? Can you get me the reference for that? The anomalous line -length or sudden increase/decrease are perfect components to a decision tree for alerting. If you can send me the report, would really appreciate it. Thanks so much for the tip.

                              • Re: Information Gluttony
                                jswan

                                See page 54 of the Verizon 2012 DBIR. They don't say explicitly, but I would think this would be most useful with host-based logs (web server, host firewall, HIDS, mod_security, etc) that either produce interesting errors when strange stuff happens or where a successful attacker is likely to tamper with logs first.

                                 

                                Mandiant's Highlighter tool has a nice feature that allows you to visualize the line lengths for an entire file at once if you're inspecting it manually. Otherwise "wc -L" is a quick and dirty way to get line length info. It would be easy enough to write a script doing some more sophisticated analysis too... hmm.