This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Linux Memory Utilization Monitors and You

Our client is wondering why the values in Solarwinds do not reflect the values found on their servers:

top - 17:58:42 up  1:44,  1 user,  load average: 0.03, 0.06, 0.06
Tasks:  94 total,   1 running,  93 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.7%us,  0.2%sy,  0.0%ni, 94.8%id,  1.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8174656k total,  1725996k used,  6448660k free,    39772k buffers
Swap:  8388600k total,        0k used,  8388600k free,   285544k cached

= ~21% Utilization

$ free -m
             total       used       free     shared    buffers     cached
Mem:          7983       1684       6298          0         39        278
-/+ buffers/cache:       1366       6616
Swap:         8191          0       8191

= ~21% Utilization

Solarwinds = 17% utilization

Figuring that this was just a case of SNMP sending slightly different data I tried a basic snmpwalk against memory:

$ snmpwalk -v 2c -c xxxxxxxxxx localhost Memory
UCD-SNMP-MIB::memIndex.0 = INTEGER: 0
UCD-SNMP-MIB::memErrorName.0 = STRING: swap
UCD-SNMP-MIB::memTotalSwap.0 = INTEGER: 8388600
UCD-SNMP-MIB::memAvailSwap.0 = INTEGER: 8388600
UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 6446020
UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 14834620
UCD-SNMP-MIB::memMinimumSwap.0 = INTEGER: 16000
UCD-SNMP-MIB::memShared.0 = INTEGER: 0
UCD-SNMP-MIB::memBuffer.0 = INTEGER: 42552
UCD-SNMP-MIB::memCached.0 = INTEGER: 285616
UCD-SNMP-MIB::memSwapError.0 = INTEGER: 0
UCD-SNMP-MIB::memSwapErrorMsg.0 = STRING:

1-(memAvailReal/memTotalReal) = ~21%

Even when I manually enter the OIDs I receive the same basic results. 

$ snmpwalk -v 2c -c xxxxxxx localhost .1.3.6.1.4.1.2021.4.5.0
UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
$ snmpwalk -v 2c -c xxxxxxx localhost .1.3.6.1.4.1.2021.4.6.0
UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 6400580

= ~21%

I'm having a hard time explaining to our client why Solarwinds is reporting a 4% lower utilization than they are seeing on the server itself.  4% could be the difference between an alert being generated or not, so you can see where the dilemma is coming from.

We have seen similar situations on Linux disk monitors, but in that case we are able to see how the values are being pulled more or less directly from SNMP.  When we can fall back on Solarwinds using the SNMP reported data we are able to explain why utilization levels in Solarwinds do not reflect those on the server itself.  In this case we are really at a loss for an explanation.

Is Solarwinds using a different OID? If so, is there a way to change the OID that is being used to the ones I just showed above without resorting to a UDP or something?  Can someone provide me with the formula that is being used to calculate Memory Used on the CPU Load & Memory Utilization module?

Thanks in advance,

Bob

  • So I did some more investigating.  I ended up resorting to sniffing packets to find out what OIDs showed up during a repoll.

     

    1.3.6.1.4.1.2021.4.5.0 => memTotalReal
    Value (Integer32): 8174656

    1.3.6.1.4.1.2021.4.6.0 => memAvailReal
    Value (Integer32): 5961904

    1.3.6.1.4.1.2021.4.14.0 =>memBuffer
    Value (Integer32): 264164

    1.3.6.1.4.1.2021.4.15.0 => memCached
    Value (Integer32): 472976

    Taking this data I was able to approximate the 18% utilization shown by Solarwinds (Current utilization calculated in top was ~27%)...

    (memAvailReal + memBuffer + memCached) / memTotalReal = ~18%

    Is this the correct formula? If so, why was this chosen?  I would prefer to have a formula that can be verified with simple system commands like top or free, but simple confirmation that this is the correct formula would be enough to explain to the administrator why he is seeing different results.

  • It turns out that both SNMP and free are pulling data directly from /proc/meminfo which does not contain actual utilization levels.  free calculates used space by subtracting free memory from total memory.  That is explanation enough for me to give the admin.

    I'd still like to know why it was decided to use the above formula for memory utilization in Solarwinds.

    Thanks!

  • A system admin recently sent in a ticket which claims that Solarwinds is not reporting memory data properly for a Linux server.  Wanting to see what the issue was I decided to dive in and see what information I could find. 

    This is what the admin wrote in detailing the issue:
    Solarwinds is reporting 20% memory used but in reality almost all of the memory is used. There also do not appear to be any alerts sent on this issue.

    $ free -m
                                    total      used      free       shared buffers                 cached
    Mem:                    7983       7580       402         0              666                         5206
    -/+ buffers/cache:           1707       6275
    Swap:                    8191       0              8191


    It’s pretty easy to see where he was seeing high utilization:

    Used/total = % Utilization
    7580/7983 = 0.949 = ~95%

    Since Solarwinds was only showing ~20% utilization this is obviously cause for concern… But let’s dig a bit deeper.

    I decided to take a look at the same system that the admin was referencing and crunch some numbers of my own…

    Solarwinds says there is 18% Memory Utilization…
     
    Let’s take a look at free!
    $ free -m
                                    Total      used      free       shared  buffers cached
    Mem:                    7983       2207       5775       0              306         494
    -/+ buffers/cache:           1406       6576
    Swap:                    8191       0              8191


    Hmmm… Using our above formula for memory utilization we get ~28%!  That’s a full 10% difference.

    Let’s check top instead!

    $ top -b | head -8
    top - 09:34:01 up 1 day, 17:20,  1 user,  load average: 0.00, 0.00, 0.00
    Tasks:  94 total,   1 running,  93 sleeping,   0 stopped,   0 zombie
    Cpu(s):  0.3%us,  0.1%sy,  0.0%ni, 99.1%id,  0.4%wa,  0.0%hi,  0.0%si,  0.0%st
    Mem:   8174656k total,  2261088k used,  5913568k free,   313432k buffers
    Swap:  8388600k total,        0k used,  8388600k free,   506568k cached


    Uh oh… top is showing the same 28% utilization.  This is not looking good for us.  But, we all know that Solarwinds is just relying on SNMP data that is being returned by the system, right?

    $ snmpwalk -v 2c -c xxxxxx localhost memory
    UCD-SNMP-MIB::memIndex.0 = INTEGER: 0
    UCD-SNMP-MIB::memErrorName.0 = STRING: swap
    UCD-SNMP-MIB::memTotalSwap.0 = INTEGER: 8388600
    UCD-SNMP-MIB::memAvailSwap.0 = INTEGER: 8388600
    UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
    UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 5913956
    UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 14302556
    UCD-SNMP-MIB::memMinimumSwap.0 = INTEGER: 16000
    UCD-SNMP-MIB::memShared.0 = INTEGER: 0
    UCD-SNMP-MIB::memBuffer.0 = INTEGER: 313432
    UCD-SNMP-MIB::memCached.0 = INTEGER: 506568
    UCD-SNMP-MIB::memSwapError.0 = INTEGER: 0
    UCD-SNMP-MIB::memSwapErrorMsg.0 = STRING:


    Wait a second… There is no value for Memory Used… All SNMP is seeing are the Total and Available!?  Oh well, let’s see what happens when we calculate for utilization using these values…

    1 – (memAvailReal / memTotalReal) = ~28%

    Hmmmm… that’s not good, is it?  Solarwinds must be doing something different with that data.

    We could try Solarwinds’ not so useful SNMPWalk.exe to try and get a huge dump of all the SNMP data that is returned, but that won’t really tell us what OIDs Solarwinds is really polling against…  Let’s take a break and sniff some packets.

    Fire up Wireshark and set the filter: ip.src = xxx.xxx.xxx.xxx

    After we force a repoll on the device we get about 15 hits, but we’re only concerned with these four…

    1.3.6.1.4.1.2021.4.5.0
    1.3.6.1.4.1.2021.4.6.0
    1.3.6.1.4.1.2021.4.14.0
    1.3.6.1.4.1.2021.4.15.0

    What could these possibly relate to?

    1.3.6.1.4.1.2021.4.5.0 => memTotalReal
    1.3.6.1.4.1.2021.4.6.0 => memAvailReal
    1.3.6.1.4.1.2021.4.14.0 => memBuffer
    1.3.6.1.4.1.2021.4.15.0 => memCache


    So it looks like Solarwinds is pulling not only the data for Total and Available but also Buffer and Cache.  With a little bit of creative formulating we find this….

    1 - ((memAvailReal + memBuffer + memCache) / memTotalReal) = ~18%

    Wowee!  So it turns out that Solarwinds is actually counting Buffered and Cached memory as unutilized space. 

    But you might be asking yourself (or have an admin asking you)… “Well, why is that number different from the Used values in the free and top commands?”

    To answer that question let’s do a little dumpster errm… code diving!

    After downloading the source files for procps utilities we find a few little nuggets of wisdom:

    sysinfo.c

    Kb_main_used = kb_main_total – kb_main_free


    Oh… so the only reason that free is showing used is because some dude coded it that way?  Yep.

    Digging a little deeper we find this gem:
    free.c
    “-/+” buffers/cache: %10Lu %10Lu\n”,
    S(kb_main_used – buffers_plus_cached),
    S(kb_main_free + buffers_plus_cached


    You might be wondering why this is important… remember the free command we ran back at the start?
    $ free -m
                                    Total      used      free       shared  buffers cached
    Mem:                    7983       2207       5775       0              306         494
    -/+ buffers/cache:           1406       6576
    Swap:                    8191       0              8191


    Let’s try something creative…

    “-/+ buffers/cache.used” / Mem.Total = ~18%

    So it turns out the data was there for the admin to use the whole time… he was just looking at the wrong data.  When we run the numbers he reported originally through the same formula we get ~21% which is about on par for what he reported Solarwinds as showing  (and without having a screenshot of what he was seeing is well within the margin of error for guesstimations).

    Without getting into the details, top and vmstat are also part of the same utility package.  Vmstat is nice because it doesn’t do any utilization calculations on its own.

    Lets tie this all together and see where all of these different resources are pulling their Memory from…

    $ cat /proc/meminfo
    MemTotal:      8174656 kB
    MemFree:       5914452 kB
    Buffers:        313432 kB
    Cached:         506568 kB
    SwapCached:          0 kB
    Active:        1803944 kB
    Inactive:       333024 kB
    HighTotal:           0 kB
    HighFree:            0 kB
    LowTotal:      8174656 kB
    LowFree:       5914452 kB
    SwapTotal:     8388600 kB
    SwapFree:      8388600 kB
    Dirty:              68 kB
    Writeback:           0 kB
    AnonPages:     1317016 kB
    Mapped:          30708 kB
    Slab:            90220 kB
    PageTables:       6608 kB
    NFS_Unstable:        0 kB
    Bounce:              0 kB
    CommitLimit:  12475928 kB
    Committed_AS:  1784324 kB
    VmallocTotal: 34359738367 kB
    VmallocUsed:    267160 kB
    VmallocChunk: 34359470839 kB
    HugePages_Total:     0
    HugePages_Free:      0
    HugePages_Rsvd:      0
    Hugepagesize:     2048 kB


    Wild stuff… /proc/meminfo contains just about the rawest human readable data relating to memory utilization.  And guess what, no utilization information is included…  It is up to the end user to figure out how they want to calculate utilization.  For most system admins they decide to rely on a generic formula that simply subtracts free space from total space.  Solarwinds chose to include buffered and cached space as free space.

    For a little more in depth discussion about why buffered and cached memory are counted as unutilized, visit this handy website: http://www.linuxatemyram.com/

    Even if you don’t need it you should visit it just to see the awesome title pic.  And ya, apparently this issue is so common that some dude actually used it as the domain name.

    I hope this helps you understand Solarwinds memory utilization monitors on Linux.  Please feel free to relay this information to any admins that get into a huff about Solarwinds showing a different value than they think is correct.  Of course, be sure that SNMP is returning the correct data, but now that you know the proper formula this can be easily calculated from their top or free information.

  • bobross,

    Thank you for this excellent piece of code archeology. I'll get with Dev to confirm and determine how best we can provide info more clearly and more in-depth for you and your users/customers.

    Thanks,

  • FormerMember
    0 FormerMember in reply to pacetti

    Hi,

    This is something that's annoyed and plagued us for ages.

    I'm looking at a linux box now and the CPU and Memory stats show 322mb used, 751mb available but then the volume info shows Physical memory 964mb used.

    I was told once that the Memory statistics area was more about the amount of memory the running processes were using which wasn't the same info as reported under the Volume information. Linux allocates all physical memory to itself and then dishes it out after and this was what gave the different results.

    Hopefully the Solarwinds Dev's can clarify once and for all the mystery of where all the memory stats come from and which one to trust for when a server should be upgraded.

    Thanks

    Jase

  • From my above investigation I concluded (personally of course) that the standard CPU & Memory Utilization monitor is a better indicator of overall memory utilization.  I didn't have a chance to test the Physical Memory 'Volume' monitor, but from your numbers I would assume that it is counting Buffered and/or Cached Memory as "Used".

    With that said, I would also like clarification as to whether or not I have the formula correct.  It is handy that we can now tell some of our more difficult system admins about how Net-SNMP is reporting the data, but a firm answer from SW about how SW itself is calculating the output would be great.

  • bobross, you hit it right on the money, though I've never seen it so beautifully laid out and articulated. If you don't mind, I'd love to have one of our technical document writers steal pretty liberally from your posting to create a KB article that describes this in as much detail. This truly is very helpful information for new customers who wonder how memory usage is calculated in Orion NPM and SAM. Thank you for sharing!

  • Feel free... but I doubt that a technical writer will be able to capture the adventurous tone :D

  • I've been informed about this lovely posting by the SAM PM, AlterEgo. As the SAM tech writer, I will indeed write a KB on this next week, "borrowing" your hard work. And I was a fan of the real Bob Ross...I know how he expresses himself. (Unfortunately, tech-writing doesn't permit a great deal of flare. I'll be sure to beat the devil out of it though.) Thanks.

  • bobross,

    We do what we can...sometimes we even go outside to see what that might be like...