20 Replies Latest reply on Jul 6, 2018 10:27 AM by bishopolis

    Linux Memory Utilization Monitors and You

    bobross

      Our client is wondering why the values in Solarwinds do not reflect the values found on their servers:

      top - 17:58:42 up  1:44,  1 user,  load average: 0.03, 0.06, 0.06
      Tasks:  94 total,   1 running,  93 sleeping,   0 stopped,   0 zombie
      Cpu(s):  3.7%us,  0.2%sy,  0.0%ni, 94.8%id,  1.2%wa,  0.0%hi,  0.0%si,  0.0%st
      Mem:   8174656k total,  1725996k used,  6448660k free,    39772k buffers
      Swap:  8388600k total,        0k used,  8388600k free,   285544k cached

      = ~21% Utilization

      $ free -m
                   total       used       free     shared    buffers     cached
      Mem:          7983       1684       6298          0         39        278
      -/+ buffers/cache:       1366       6616
      Swap:         8191          0       8191

      = ~21% Utilization

      Solarwinds = 17% utilization

      Figuring that this was just a case of SNMP sending slightly different data I tried a basic snmpwalk against memory:

      $ snmpwalk -v 2c -c xxxxxxxxxx localhost Memory
      UCD-SNMP-MIB::memIndex.0 = INTEGER: 0
      UCD-SNMP-MIB::memErrorName.0 = STRING: swap
      UCD-SNMP-MIB::memTotalSwap.0 = INTEGER: 8388600
      UCD-SNMP-MIB::memAvailSwap.0 = INTEGER: 8388600
      UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
      UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 6446020
      UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 14834620
      UCD-SNMP-MIB::memMinimumSwap.0 = INTEGER: 16000
      UCD-SNMP-MIB::memShared.0 = INTEGER: 0
      UCD-SNMP-MIB::memBuffer.0 = INTEGER: 42552
      UCD-SNMP-MIB::memCached.0 = INTEGER: 285616
      UCD-SNMP-MIB::memSwapError.0 = INTEGER: 0
      UCD-SNMP-MIB::memSwapErrorMsg.0 = STRING:

      1-(memAvailReal/memTotalReal) = ~21%

      Even when I manually enter the OIDs I receive the same basic results. 

      $ snmpwalk -v 2c -c xxxxxxx localhost .1.3.6.1.4.1.2021.4.5.0
      UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
      $ snmpwalk -v 2c -c xxxxxxx localhost .1.3.6.1.4.1.2021.4.6.0
      UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 6400580
      

      = ~21%

      I'm having a hard time explaining to our client why Solarwinds is reporting a 4% lower utilization than they are seeing on the server itself.  4% could be the difference between an alert being generated or not, so you can see where the dilemma is coming from.

      We have seen similar situations on Linux disk monitors, but in that case we are able to see how the values are being pulled more or less directly from SNMP.  When we can fall back on Solarwinds using the SNMP reported data we are able to explain why utilization levels in Solarwinds do not reflect those on the server itself.  In this case we are really at a loss for an explanation.

      Is Solarwinds using a different OID? If so, is there a way to change the OID that is being used to the ones I just showed above without resorting to a UDP or something?  Can someone provide me with the formula that is being used to calculate Memory Used on the CPU Load & Memory Utilization module?

      Thanks in advance,

      Bob

        • Re: Linux Memory Utilization Monitors and You
          bobross

          So I did some more investigating.  I ended up resorting to sniffing packets to find out what OIDs showed up during a repoll.

           

          1.3.6.1.4.1.2021.4.5.0 => memTotalReal
          Value (Integer32): 8174656
          
          1.3.6.1.4.1.2021.4.6.0 => memAvailReal
          Value (Integer32): 5961904
          
          1.3.6.1.4.1.2021.4.14.0 =>memBuffer
          Value (Integer32): 264164
          
          1.3.6.1.4.1.2021.4.15.0 => memCached
          Value (Integer32): 472976

          Taking this data I was able to approximate the 18% utilization shown by Solarwinds (Current utilization calculated in top was ~27%)...

          (memAvailReal + memBuffer + memCached) / memTotalReal = ~18%

          Is this the correct formula? If so, why was this chosen?  I would prefer to have a formula that can be verified with simple system commands like top or free, but simple confirmation that this is the correct formula would be enough to explain to the administrator why he is seeing different results.

            • Re: Linux Memory Utilization Monitors and You
              bobross

              It turns out that both SNMP and free are pulling data directly from /proc/meminfo which does not contain actual utilization levels.  free calculates used space by subtracting free memory from total memory.  That is explanation enough for me to give the admin.

              I'd still like to know why it was decided to use the above formula for memory utilization in Solarwinds.

              Thanks!

                • Re: Linux Memory Utilization Monitors and You
                  bobross

                  A system admin recently sent in a ticket which claims that Solarwinds is not reporting memory data properly for a Linux server.  Wanting to see what the issue was I decided to dive in and see what information I could find. 

                  This is what the admin wrote in detailing the issue:
                  Solarwinds is reporting 20% memory used but in reality almost all of the memory is used. There also do not appear to be any alerts sent on this issue.

                  $ free -m
                                                  total      used      free       shared buffers                 cached
                  Mem:                    7983       7580       402         0              666                         5206
                  -/+ buffers/cache:           1707       6275
                  Swap:                    8191       0              8191


                  It’s pretty easy to see where he was seeing high utilization:

                  Used/total = % Utilization
                  7580/7983 = 0.949 = ~95%

                  Since Solarwinds was only showing ~20% utilization this is obviously cause for concern… But let’s dig a bit deeper.

                  I decided to take a look at the same system that the admin was referencing and crunch some numbers of my own…

                  Solarwinds says there is 18% Memory Utilization…
                   
                  Let’s take a look at free!
                  $ free -m
                                                  Total      used      free       shared  buffers cached
                  Mem:                    7983       2207       5775       0              306         494
                  -/+ buffers/cache:           1406       6576
                  Swap:                    8191       0              8191


                  Hmmm… Using our above formula for memory utilization we get ~28%!  That’s a full 10% difference.

                  Let’s check top instead!

                  $ top -b | head -8
                  top - 09:34:01 up 1 day, 17:20,  1 user,  load average: 0.00, 0.00, 0.00
                  Tasks:  94 total,   1 running,  93 sleeping,   0 stopped,   0 zombie
                  Cpu(s):  0.3%us,  0.1%sy,  0.0%ni, 99.1%id,  0.4%wa,  0.0%hi,  0.0%si,  0.0%st
                  Mem:   8174656k total,  2261088k used,  5913568k free,   313432k buffers
                  Swap:  8388600k total,        0k used,  8388600k free,   506568k cached


                  Uh oh… top is showing the same 28% utilization.  This is not looking good for us.  But, we all know that Solarwinds is just relying on SNMP data that is being returned by the system, right?

                  $ snmpwalk -v 2c -c xxxxxx localhost memory
                  UCD-SNMP-MIB::memIndex.0 = INTEGER: 0
                  UCD-SNMP-MIB::memErrorName.0 = STRING: swap
                  UCD-SNMP-MIB::memTotalSwap.0 = INTEGER: 8388600
                  UCD-SNMP-MIB::memAvailSwap.0 = INTEGER: 8388600
                  UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
                  UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 5913956
                  UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 14302556
                  UCD-SNMP-MIB::memMinimumSwap.0 = INTEGER: 16000
                  UCD-SNMP-MIB::memShared.0 = INTEGER: 0
                  UCD-SNMP-MIB::memBuffer.0 = INTEGER: 313432
                  UCD-SNMP-MIB::memCached.0 = INTEGER: 506568
                  UCD-SNMP-MIB::memSwapError.0 = INTEGER: 0
                  UCD-SNMP-MIB::memSwapErrorMsg.0 = STRING:


                  Wait a second… There is no value for Memory Used… All SNMP is seeing are the Total and Available!?  Oh well, let’s see what happens when we calculate for utilization using these values…

                  1 – (memAvailReal / memTotalReal) = ~28%

                  Hmmmm… that’s not good, is it?  Solarwinds must be doing something different with that data.

                  We could try Solarwinds’ not so useful SNMPWalk.exe to try and get a huge dump of all the SNMP data that is returned, but that won’t really tell us what OIDs Solarwinds is really polling against…  Let’s take a break and sniff some packets.

                  Fire up Wireshark and set the filter: ip.src = xxx.xxx.xxx.xxx

                  After we force a repoll on the device we get about 15 hits, but we’re only concerned with these four…

                  1.3.6.1.4.1.2021.4.5.0
                  1.3.6.1.4.1.2021.4.6.0
                  1.3.6.1.4.1.2021.4.14.0
                  1.3.6.1.4.1.2021.4.15.0

                  What could these possibly relate to?

                  1.3.6.1.4.1.2021.4.5.0 => memTotalReal
                  1.3.6.1.4.1.2021.4.6.0 => memAvailReal
                  1.3.6.1.4.1.2021.4.14.0 => memBuffer
                  1.3.6.1.4.1.2021.4.15.0 => memCache


                  So it looks like Solarwinds is pulling not only the data for Total and Available but also Buffer and Cache.  With a little bit of creative formulating we find this….

                  1 - ((memAvailReal + memBuffer + memCache) / memTotalReal) = ~18%

                  Wowee!  So it turns out that Solarwinds is actually counting Buffered and Cached memory as unutilized space. 

                  But you might be asking yourself (or have an admin asking you)… “Well, why is that number different from the Used values in the free and top commands?”

                  To answer that question let’s do a little dumpster errm… code diving!

                  After downloading the source files for procps utilities we find a few little nuggets of wisdom:

                  sysinfo.c

                  Kb_main_used = kb_main_total – kb_main_free


                  Oh… so the only reason that free is showing used is because some dude coded it that way?  Yep.

                  Digging a little deeper we find this gem:
                  free.c
                  “-/+” buffers/cache: %10Lu %10Lu\n”,
                  S(kb_main_used – buffers_plus_cached),
                  S(kb_main_free + buffers_plus_cached


                  You might be wondering why this is important… remember the free command we ran back at the start?
                  $ free -m
                                                  Total      used      free       shared  buffers cached
                  Mem:                    7983       2207       5775       0              306         494
                  -/+ buffers/cache:           1406       6576
                  Swap:                    8191       0              8191


                  Let’s try something creative…

                  “-/+ buffers/cache.used” / Mem.Total = ~18%

                  So it turns out the data was there for the admin to use the whole time… he was just looking at the wrong data.  When we run the numbers he reported originally through the same formula we get ~21% which is about on par for what he reported Solarwinds as showing  (and without having a screenshot of what he was seeing is well within the margin of error for guesstimations).

                  Without getting into the details, top and vmstat are also part of the same utility package.  Vmstat is nice because it doesn’t do any utilization calculations on its own.

                  Lets tie this all together and see where all of these different resources are pulling their Memory from…

                  $ cat /proc/meminfo
                  MemTotal:      8174656 kB
                  MemFree:       5914452 kB
                  Buffers:        313432 kB
                  Cached:         506568 kB
                  SwapCached:          0 kB
                  Active:        1803944 kB
                  Inactive:       333024 kB
                  HighTotal:           0 kB
                  HighFree:            0 kB
                  LowTotal:      8174656 kB
                  LowFree:       5914452 kB
                  SwapTotal:     8388600 kB
                  SwapFree:      8388600 kB
                  Dirty:              68 kB
                  Writeback:           0 kB
                  AnonPages:     1317016 kB
                  Mapped:          30708 kB
                  Slab:            90220 kB
                  PageTables:       6608 kB
                  NFS_Unstable:        0 kB
                  Bounce:              0 kB
                  CommitLimit:  12475928 kB
                  Committed_AS:  1784324 kB
                  VmallocTotal: 34359738367 kB
                  VmallocUsed:    267160 kB
                  VmallocChunk: 34359470839 kB
                  HugePages_Total:     0
                  HugePages_Free:      0
                  HugePages_Rsvd:      0
                  Hugepagesize:     2048 kB


                  Wild stuff… /proc/meminfo contains just about the rawest human readable data relating to memory utilization.  And guess what, no utilization information is included…  It is up to the end user to figure out how they want to calculate utilization.  For most system admins they decide to rely on a generic formula that simply subtracts free space from total space.  Solarwinds chose to include buffered and cached space as free space.

                  For a little more in depth discussion about why buffered and cached memory are counted as unutilized, visit this handy website: http://www.linuxatemyram.com/

                  Even if you don’t need it you should visit it just to see the awesome title pic.  And ya, apparently this issue is so common that some dude actually used it as the domain name.

                  I hope this helps you understand Solarwinds memory utilization monitors on Linux.  Please feel free to relay this information to any admins that get into a huff about Solarwinds showing a different value than they think is correct.  Of course, be sure that SNMP is returning the correct data, but now that you know the proper formula this can be easily calculated from their top or free information.

                  1 of 1 people found this helpful
                    • Re: Linux Memory Utilization Monitors and You
                      pacetti

                      bobross,

                      Thank you for this excellent piece of code archeology. I'll get with Dev to confirm and determine how best we can provide info more clearly and more in-depth for you and your users/customers.

                      Thanks,

                        • Re: Linux Memory Utilization Monitors and You
                          jase4772

                          Hi,

                          This is something that's annoyed and plagued us for ages.

                          I'm looking at a linux box now and the CPU and Memory stats show 322mb used, 751mb available but then the volume info shows Physical memory 964mb used.

                          I was told once that the Memory statistics area was more about the amount of memory the running processes were using which wasn't the same info as reported under the Volume information. Linux allocates all physical memory to itself and then dishes it out after and this was what gave the different results.

                          Hopefully the Solarwinds Dev's can clarify once and for all the mystery of where all the memory stats come from and which one to trust for when a server should be upgraded.

                          Thanks

                          Jase

                            • Re: Linux Memory Utilization Monitors and You
                              bobross

                              From my above investigation I concluded (personally of course) that the standard CPU & Memory Utilization monitor is a better indicator of overall memory utilization.  I didn't have a chance to test the Physical Memory 'Volume' monitor, but from your numbers I would assume that it is counting Buffered and/or Cached Memory as "Used".

                              With that said, I would also like clarification as to whether or not I have the formula correct.  It is handy that we can now tell some of our more difficult system admins about how Net-SNMP is reporting the data, but a firm answer from SW about how SW itself is calculating the output would be great.

                                • Re: Linux Memory Utilization Monitors and You
                                  neurovish

                                  Hi Bob, I realize this is digging up an old post, but I didn't really see anybody say if your formula was correct ( 1-(memfree+membuffer+memcache)/totalmem = memused% ). Since you dove into pretty great detail, even unleashing the wireshark, I am surprised you did not hit upon the answer (or maybe you did and I missed it). The formula is correct, and nobody should ever trust what Linux reports as "free memory" since it is being very pedantic. True, that will be the amount of memory not used by anything, but it is not the amount that is available for use by an application if needed. You need to look at Linux's disk caching algorithms, which can be summarized as "Is there memory available? yes? Well, lets cache some data!". Memory that is not used for anything is not helping you at all. Linux will use this to cache data from disk. Since the memory was unused to begin with, there is no loss, and if you want to read one of those blocks again, then it is already in memory and will save you a disk read. If an application comes along and needs more memory than the system has 100% free, then it will flush cache pages and hand them over. The buffer value is similar, but contains data waiting for writes. Flushing these values will involve the disk, so it is more costly than just dumping the cache. These are both used by the kernel on an as-needed and as-available basis. If a server is well and truly running low on memory, then there will be very little in cache if anything. I normally watch these three values and swap utilisation.

                                   

                                  ...and the reason I ended up on this page in the first place is related, but I did not find the answer here. I am trying to setup an snmp memory alert in solarwinds SAM without using an agent, but I don't see a way to use anything other than what SNMP reports...so I am not able to create anything that says 1-(memfree+membuffer+memcache)/totalmem = memused%. I guess I'll keep looking.

                                   

                                  Edit: ...and I think I found what I was looking for under the NPM, which seems more useful for monitoring servers than the SAM.

                            • Re: Linux Memory Utilization Monitors and You
                              aLTeReGo

                              bobross, you hit it right on the money, though I've never seen it so beautifully laid out and articulated. If you don't mind, I'd love to have one of our technical document writers steal pretty liberally from your posting to create a KB article that describes this in as much detail. This truly is very helpful information for new customers who wonder how memory usage is calculated in Orion NPM and SAM. Thank you for sharing!

                              • Re: Linux Memory Utilization Monitors and You
                                rschroeder

                                Even as Syndrome (in The Incredibles) claimed to be "geeking out" over Mr. Incredible's creative hiding behind a dead super hero, from a robot searching for a live super hero, so too am I geeking out over your techno-sleuthing.

                                 

                                 

                                Nicely done!

                                • Re: Linux Memory Utilization Monitors and You
                                  pratikmehta003

                                  Hi bobross

                                   

                                  Since u mentioned that SW calculates the memory which includes buffer, cache and free components, is there any way that we can poll only the free component and alert based on that?

                                   

                                  We have been recently questioned by our Linux admins that SW is showing very high utilization, whereas as per them the free% is very much available and they don't want us to consider the cache and buffer part in it....

                            • Re: Linux Memory Utilization Monitors and You
                              Leon Adato

                              In the interest of leveraging new NPM features, I've created a poller (not UnDP, but actual replacement poller) that you can use INSTEAD OF the built-in SolarWinds RAM poller that uses the "simpler" calculation. You can download it here: linuxatemyram

                               

                              Enjoy!