12 Replies Latest reply: Mar 14, 2012 10:39 AM by ecklerwr1 RSS

Linux Memory Utilization Monitors and You

bobross
Currently Being Moderated

Our client is wondering why the values in Solarwinds do not reflect the values found on their servers:

top - 17:58:42 up  1:44,  1 user,  load average: 0.03, 0.06, 0.06
Tasks:  94 total,   1 running,  93 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.7%us,  0.2%sy,  0.0%ni, 94.8%id,  1.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8174656k total,  1725996k used,  6448660k free,    39772k buffers
Swap:  8388600k total,        0k used,  8388600k free,   285544k cached

= ~21% Utilization

$ free -m
             total       used       free     shared    buffers     cached
Mem:          7983       1684       6298          0         39        278
-/+ buffers/cache:       1366       6616
Swap:         8191          0       8191

= ~21% Utilization

Solarwinds = 17% utilization

Figuring that this was just a case of SNMP sending slightly different data I tried a basic snmpwalk against memory:

$ snmpwalk -v 2c -c xxxxxxxxxx localhost Memory
UCD-SNMP-MIB::memIndex.0 = INTEGER: 0
UCD-SNMP-MIB::memErrorName.0 = STRING: swap
UCD-SNMP-MIB::memTotalSwap.0 = INTEGER: 8388600
UCD-SNMP-MIB::memAvailSwap.0 = INTEGER: 8388600
UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 6446020
UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 14834620
UCD-SNMP-MIB::memMinimumSwap.0 = INTEGER: 16000
UCD-SNMP-MIB::memShared.0 = INTEGER: 0
UCD-SNMP-MIB::memBuffer.0 = INTEGER: 42552
UCD-SNMP-MIB::memCached.0 = INTEGER: 285616
UCD-SNMP-MIB::memSwapError.0 = INTEGER: 0
UCD-SNMP-MIB::memSwapErrorMsg.0 = STRING:

1-(memAvailReal/memTotalReal) = ~21%

Even when I manually enter the OIDs I receive the same basic results. 

$ snmpwalk -v 2c -c xxxxxxx localhost .1.3.6.1.4.1.2021.4.5.0
UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
$ snmpwalk -v 2c -c xxxxxxx localhost .1.3.6.1.4.1.2021.4.6.0
UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 6400580

= ~21%

I'm having a hard time explaining to our client why Solarwinds is reporting a 4% lower utilization than they are seeing on the server itself.  4% could be the difference between an alert being generated or not, so you can see where the dilemma is coming from.

We have seen similar situations on Linux disk monitors, but in that case we are able to see how the values are being pulled more or less directly from SNMP.  When we can fall back on Solarwinds using the SNMP reported data we are able to explain why utilization levels in Solarwinds do not reflect those on the server itself.  In this case we are really at a loss for an explanation.

Is Solarwinds using a different OID? If so, is there a way to change the OID that is being used to the ones I just showed above without resorting to a UDP or something?  Can someone provide me with the formula that is being used to calculate Memory Used on the CPU Load & Memory Utilization module?

Thanks in advance,

Bob

  • Re: Linux Memory Utilization Monitors and You
    bobross
    Currently Being Moderated

    So I did some more investigating.  I ended up resorting to sniffing packets to find out what OIDs showed up during a repoll.

     

    1.3.6.1.4.1.2021.4.5.0 => memTotalReal
    Value (Integer32): 8174656
    1.3.6.1.4.1.2021.4.6.0 => memAvailReal
    Value (Integer32): 5961904
    1.3.6.1.4.1.2021.4.14.0 =>memBuffer
    Value (Integer32): 264164
    1.3.6.1.4.1.2021.4.15.0 => memCached
    Value (Integer32): 472976

    Taking this data I was able to approximate the 18% utilization shown by Solarwinds (Current utilization calculated in top was ~27%)...

    (memAvailReal + memBuffer + memCached) / memTotalReal = ~18%

    Is this the correct formula? If so, why was this chosen?  I would prefer to have a formula that can be verified with simple system commands like top or free, but simple confirmation that this is the correct formula would be enough to explain to the administrator why he is seeing different results.

    • Re: Linux Memory Utilization Monitors and You
      bobross
      Currently Being Moderated

      It turns out that both SNMP and free are pulling data directly from /proc/meminfo which does not contain actual utilization levels.  free calculates used space by subtracting free memory from total memory.  That is explanation enough for me to give the admin.

      I'd still like to know why it was decided to use the above formula for memory utilization in Solarwinds.

      Thanks!

      • Re: Linux Memory Utilization Monitors and You
        bobross
        Currently Being Moderated

        A system admin recently sent in a ticket which claims that Solarwinds is not reporting memory data properly for a Linux server.  Wanting to see what the issue was I decided to dive in and see what information I could find. 

        This is what the admin wrote in detailing the issue:
        Solarwinds is reporting 20% memory used but in reality almost all of the memory is used. There also do not appear to be any alerts sent on this issue.

        $ free -m
                                        total      used      free       shared buffers                 cached
        Mem:                    7983       7580       402         0              666                         5206
        -/+ buffers/cache:           1707       6275
        Swap:                    8191       0              8191


        It’s pretty easy to see where he was seeing high utilization:

        Used/total = % Utilization
        7580/7983 = 0.949 = ~95%

        Since Solarwinds was only showing ~20% utilization this is obviously cause for concern… But let’s dig a bit deeper.

        I decided to take a look at the same system that the admin was referencing and crunch some numbers of my own…

        Solarwinds says there is 18% Memory Utilization…
         
        Let’s take a look at free!
        $ free -m
                                        Total      used      free       shared  buffers cached
        Mem:                    7983       2207       5775       0              306         494
        -/+ buffers/cache:           1406       6576
        Swap:                    8191       0              8191


        Hmmm… Using our above formula for memory utilization we get ~28%!  That’s a full 10% difference.

        Let’s check top instead!

        $ top -b | head -8
        top - 09:34:01 up 1 day, 17:20,  1 user,  load average: 0.00, 0.00, 0.00
        Tasks:  94 total,   1 running,  93 sleeping,   0 stopped,   0 zombie
        Cpu(s):  0.3%us,  0.1%sy,  0.0%ni, 99.1%id,  0.4%wa,  0.0%hi,  0.0%si,  0.0%st
        Mem:   8174656k total,  2261088k used,  5913568k free,   313432k buffers
        Swap:  8388600k total,        0k used,  8388600k free,   506568k cached


        Uh oh… top is showing the same 28% utilization.  This is not looking good for us.  But, we all know that Solarwinds is just relying on SNMP data that is being returned by the system, right?

        $ snmpwalk -v 2c -c xxxxxx localhost memory
        UCD-SNMP-MIB::memIndex.0 = INTEGER: 0
        UCD-SNMP-MIB::memErrorName.0 = STRING: swap
        UCD-SNMP-MIB::memTotalSwap.0 = INTEGER: 8388600
        UCD-SNMP-MIB::memAvailSwap.0 = INTEGER: 8388600
        UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
        UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 5913956
        UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 14302556
        UCD-SNMP-MIB::memMinimumSwap.0 = INTEGER: 16000
        UCD-SNMP-MIB::memShared.0 = INTEGER: 0
        UCD-SNMP-MIB::memBuffer.0 = INTEGER: 313432
        UCD-SNMP-MIB::memCached.0 = INTEGER: 506568
        UCD-SNMP-MIB::memSwapError.0 = INTEGER: 0
        UCD-SNMP-MIB::memSwapErrorMsg.0 = STRING:


        Wait a second… There is no value for Memory Used… All SNMP is seeing are the Total and Available!?  Oh well, let’s see what happens when we calculate for utilization using these values…

        1 – (memAvailReal / memTotalReal) = ~28%

        Hmmmm… that’s not good, is it?  Solarwinds must be doing something different with that data.

        We could try Solarwinds’ not so useful SNMPWalk.exe to try and get a huge dump of all the SNMP data that is returned, but that won’t really tell us what OIDs Solarwinds is really polling against…  Let’s take a break and sniff some packets.

        Fire up Wireshark and set the filter: ip.src = xxx.xxx.xxx.xxx

        After we force a repoll on the device we get about 15 hits, but we’re only concerned with these four…

        1.3.6.1.4.1.2021.4.5.0
        1.3.6.1.4.1.2021.4.6.0
        1.3.6.1.4.1.2021.4.14.0
        1.3.6.1.4.1.2021.4.15.0

        What could these possibly relate to?

        1.3.6.1.4.1.2021.4.5.0 => memTotalReal
        1.3.6.1.4.1.2021.4.6.0 => memAvailReal
        1.3.6.1.4.1.2021.4.14.0 => memBuffer
        1.3.6.1.4.1.2021.4.15.0 => memCache


        So it looks like Solarwinds is pulling not only the data for Total and Available but also Buffer and Cache.  With a little bit of creative formulating we find this….

        1 - ((memAvailReal + memBuffer + memCache) / memTotalReal) = ~18%

        Wowee!  So it turns out that Solarwinds is actually counting Buffered and Cached memory as unutilized space. 

        But you might be asking yourself (or have an admin asking you)… “Well, why is that number different from the Used values in the free and top commands?”

        To answer that question let’s do a little dumpster errm… code diving!

        After downloading the source files for procps utilities we find a few little nuggets of wisdom:

        sysinfo.c

        Kb_main_used = kb_main_total – kb_main_free


        Oh… so the only reason that free is showing used is because some dude coded it that way?  Yep.

        Digging a little deeper we find this gem:
        free.c
        “-/+” buffers/cache: %10Lu %10Lu\n”,
        S(kb_main_used – buffers_plus_cached),
        S(kb_main_free + buffers_plus_cached


        You might be wondering why this is important… remember the free command we ran back at the start?
        $ free -m
                                        Total      used      free       shared  buffers cached
        Mem:                    7983       2207       5775       0              306         494
        -/+ buffers/cache:           1406       6576
        Swap:                    8191       0              8191


        Let’s try something creative…

        “-/+ buffers/cache.used” / Mem.Total = ~18%

        So it turns out the data was there for the admin to use the whole time… he was just looking at the wrong data.  When we run the numbers he reported originally through the same formula we get ~21% which is about on par for what he reported Solarwinds as showing  (and without having a screenshot of what he was seeing is well within the margin of error for guesstimations).

        Without getting into the details, top and vmstat are also part of the same utility package.  Vmstat is nice because it doesn’t do any utilization calculations on its own.

        Lets tie this all together and see where all of these different resources are pulling their Memory from…

        $ cat /proc/meminfo
        MemTotal:      8174656 kB
        MemFree:       5914452 kB
        Buffers:        313432 kB
        Cached:         506568 kB
        SwapCached:          0 kB
        Active:        1803944 kB
        Inactive:       333024 kB
        HighTotal:           0 kB
        HighFree:            0 kB
        LowTotal:      8174656 kB
        LowFree:       5914452 kB
        SwapTotal:     8388600 kB
        SwapFree:      8388600 kB
        Dirty:              68 kB
        Writeback:           0 kB
        AnonPages:     1317016 kB
        Mapped:          30708 kB
        Slab:            90220 kB
        PageTables:       6608 kB
        NFS_Unstable:        0 kB
        Bounce:              0 kB
        CommitLimit:  12475928 kB
        Committed_AS:  1784324 kB
        VmallocTotal: 34359738367 kB
        VmallocUsed:    267160 kB
        VmallocChunk: 34359470839 kB
        HugePages_Total:     0
        HugePages_Free:      0
        HugePages_Rsvd:      0
        Hugepagesize:     2048 kB


        Wild stuff… /proc/meminfo contains just about the rawest human readable data relating to memory utilization.  And guess what, no utilization information is included…  It is up to the end user to figure out how they want to calculate utilization.  For most system admins they decide to rely on a generic formula that simply subtracts free space from total space.  Solarwinds chose to include buffered and cached space as free space.

        For a little more in depth discussion about why buffered and cached memory are counted as unutilized, visit this handy website: http://www.linuxatemyram.com/

        Even if you don’t need it you should visit it just to see the awesome title pic.  And ya, apparently this issue is so common that some dude actually used it as the domain name.

        I hope this helps you understand Solarwinds memory utilization monitors on Linux.  Please feel free to relay this information to any admins that get into a huff about Solarwinds showing a different value than they think is correct.  Of course, be sure that SNMP is returning the correct data, but now that you know the proper formula this can be easily calculated from their top or free information.

More Like This

  • Retrieving data ...