12 Replies Latest reply on Jan 30, 2014 12:44 PM by nicole pauls

    After an undetermined period of time LEM 5.6.0 stops providing real data and shows "Sample Data"

    ian.brown

      Hi we've been using LEM v5.6.0 now for the last 3/4 months and we have noticed that after an undetermined period of time (it can be from as little as 24hrs to 15 days) the LEM console stops receiving data from LEM Virtual machine and displays "Sample Data" instead. Now the Virtual machine is still running and for what I can tell is fully functioning (I can get through the command lines etc) and we even have the events per minute still being reported but lots of other data is missing as "Sample Data". A quick reboot of the appliance via the Virtual machine console resolves this but I'm pretty sure for a monitoring tool I shouldn't have to reboot it as this is a more than uncommon occurrence.

      Here is a screen shot of the "Sample Data" -

      Sample_Data_error.JPG

       

      I've tried raising the VM memory from 8GB RAM to 16GB RAM to see if it was running out of memory but its made little to no difference in regards to this error. The server specification were running the VM on is as follows:

       

      OS Name    Microsoft Windows Server 2008 R2 Standard   

      Version    6.1.7601 Service Pack 1 Build 7601  

      OS Manufacturer    Microsoft Corporation   

      System Name    *snip*   

      System Manufacturer    HP   

      System Model    ProLiant DL380p Gen8   

      System Type    x64-based PC   

      Processor    Intel(R) Xeon(R) CPU E5-2667 0 @ 2.90GHz, 2900 Mhz, 6 Core(s), 12 Logical Processor(s)   

      BIOS Version/Date    HP P70, 8/20/2012   

      SMBIOS Version    2.7   

      Windows Directory    C:\Windows   

      System Directory    C:\Windows\system32   

      Boot Device    \Device\HarddiskVolume1   

      Locale    United States   

      Hardware Abstraction Layer    Version = "6.1.7601.17514"   

      User Name    Not Available   

      Time Zone    GMT Standard Time   

      Installed Physical Memory (RAM)    32.0 GB   

      Total Physical Memory    32.0 GB   

      Available Physical Memory    12.8 GB   

      Total Virtual Memory    63.9 GB   

      Available Virtual Memory    39.0 GB   

      Page File Space    32.0 GB   

      Page File    C:\pagefile.sys   

       

      Any one else had this as its a real pain having to reboot the appliance to ensure it keeps working. Also for what its worth LEM 5.6.0 has Hotfix 1 installed (as I had originally hoped that would solves this particular problem).

        • Re: After an undetermined period of time LEM 5.6.0 stops providing real data and shows "Sample Data"
          curtisi

          The sample data usually shows up when there's no real data for the time period.  The first thing I'd check in this scenario, though, is the date config on the LEM.  Under the APPLIANCE menu, enter DATECONFIG and then hit enter four (4) times.  The LEM will return the current date, time and timezone.  Are these correct?  Are they drifting over the course of 24 hours to 15 days?  By default the LEM syncs with the virtual machine host for the time.  Is your host set to sync with a good NTP source?  Your system info says GMT and your profile says the UK, but I would start there.

           

          There's also a big drop in your "Events per Minute" chart.  Have you run a Database Maintenance in the Reports console to see if events are still getting into the database?  If you run a DISKUSAGE under the APPLIANCE menu, are there events being queued?  Do you have a partition close to or at 100% full?

          1 of 1 people found this helpful
            • Re: After an undetermined period of time LEM 5.6.0 stops providing real data and shows "Sample Data"
              ian.brown

              curtisi wrote:

               

              There's also a big drop in your "Events per Minute" chart.  Have you run a Database Maintenance in the Reports console to see if events are still getting into the database?  If you run a DISKUSAGE under the APPLIANCE menu, are there events being queued?  Do you have a partition close to or at 100% full?

              The events per minute are still active, its just i caught it at the start of the counter (set at a constant/live update) so every minute it moves over. As for disk space its running on a a separate array from the OS drive and has about 400GB space where the image and database is all stored.

               

               

              curtisi wrote:

               

              The sample data usually shows up when there's no real data for the time period.  The first thing I'd check in this scenario, though, is the date config on the LEM.  Under the APPLIANCE menu, enter DATECONFIG and then hit enter four (4) times.  The LEM will return the current date, time and timezone.  Are these correct?  Are they drifting over the course of 24 hours to 15 days?  By default the LEM syncs with the virtual machine host for the time.  Is your host set to sync with a good NTP source?  Your system info says GMT and your profile says the UK, but I would start there.

               

              I've taken a look at this now as its actually happened again probably around an hour since I posted this thread but the time is being shown correctly although I'll have to check when it next fails. The problem is that it basically fails at any point so its hard to diagnose. Typically the error says something along the lines of failed to retrieve something (i'll find out exactly next time it goes on the blink as i know that's a bit unhelpful) when I hover over the red attention icon. Once the appliance is restarted all data in-between it working and giving the "sample data" notice is displayed which makes it a tad confusing.

            • Re: After an undetermined period of time LEM 5.6.0 stops providing real data and shows "Sample Data"
              garrethcoleman

              Hi Ian,

               

              I would agree with others when pinpointing to the time drift. I have seen this behaviour on customer's systems when their host time is significantly adrift. It is also worth noting that you should check all your log sources to ensure they are all configured with an NTP server. If you focus all your efforts on the LEM appliance and not considered your network devices generating the events, potentially with a bad time stamp, you will be missing a trick.

               

              LEM does run an NTP daemon, but you need to disable the time sync with host in the VM properties before you can set an NTP server in the command line.

               

              Also check to see what RAID array you have configured LEM to run on. RAID 5 and RAID 6 are not advised because the parity checks increase disk write latency to disk. With any database, it is advisable to use RAID 1+0 (min 4 spindles) to ensure disk performance does not impact functionality of the application on top of the database. If you are only monitoring a handful of nodes, you should not see an issue, but if your events per minute count is high and you intend to run a number of reports and searches simultaneously on top of the events being correlated, you may see high disk I/O latency if not using RAID 1+0.

               

              FWIW I would consider upgrading to LEM 5.7 as it is available at no cost if your maintenance subscription is valid. It takes all of 20 minutes to upgrade the appliance and replace the AIR console and reports application on your workstation.

               

               

              Kind Regards

              Garreth

              1 of 1 people found this helpful
                • Re: After an undetermined period of time LEM 5.6.0 stops providing real data and shows "Sample Data"
                  ian.brown

                  Thanks for the replys, I would have replied sooner but i've only returned to work this week.

                   

                  In response to Gareth we are runnining LEM on a Raid 5 array and do have quite a high number of events so it could well be latency causing the issues over time but it seems odd that sometimes once the time starts slipping it can repeatedly do so (talking in the space of a couple of hours) and other times be fine for days.

                   

                  Also as you have suggested I upgraded today to 5.7.0 and applied the .jar file hotfixs to see if this helps resolve any issues. I'll report back findings after using 5.7.0 for a bit.

                    • Re: After an undetermined period of time LEM 5.6.0 stops providing real data and shows "Sample Data"
                      garrethcoleman

                      Ian,

                       

                      If you are able to plan in the change from RAID 5 array to RAID 1+0 for the datastore hosting the LEM appliance, I strongly recommend this. If this is not feasible of course, then you will need to carefully monitor your disk I/O latency to see if problems persist in 5.7.0. If you don't have many nodes and not a high rate of events per minute, you shouldn't see an issue except potentially when running nDepth searches spanning over date ranges greater than 24 hours, as this requires compressed data to be decompressed and written to memory to then be searched. You can imagine what this does to disk I/O and in turn your latency.

                       

                      Garreth

                        • Re: After an undetermined period of time LEM 5.6.0 stops providing real data and shows "Sample Data"
                          ian.brown

                          Hi Gareth, at present I don't think we would change the array from its Raid 5 setup at present but would obviously consider to do so as a last resort depending upon the success of 5.7.0 over the next few weeks. So far I've been impressed as everything is still functioning correctly since the upgrade on Friday which has been a marked improvement but we did have similar periods before so I'll reserve my judgment till its been working consistantly for a few weeks.

                           

                          Just one quick question though with the time slip which I found slightly odd was how it became ahead of our system time? I could understand latency making the VM slower but its seems odd to make the time faster?

                    • Re: After an undetermined period of time LEM 5.6.0 stops providing real data and shows "Sample Data"
                      ian.brown

                      Well I thought I'd update this post as it seems (fingers and toes crossed) that the issue has been resolved with an upgrade of all the agents and the LEM software to the latest release of v5.7.0. I've been running the system without error now for nearly 19 days which has been a vast improvement from my original post. Thanks for the help given with this problem.