
Our client is wondering why the values in Solarwinds do not reflect the values found on their servers:
top - 17:58:42 up 1:44, 1 user, load average: 0.03, 0.06, 0.06 Tasks: 94 total, 1 running, 93 sleeping, 0 stopped, 0 zombie Cpu(s): 3.7%us, 0.2%sy, 0.0%ni, 94.8%id, 1.2%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8174656k total, 1725996k used, 6448660k free, 39772k buffers Swap: 8388600k total, 0k used, 8388600k free, 285544k cached
= ~21% Utilization
$ free -m total used free shared buffers cached Mem: 7983 1684 6298 0 39 278 -/+ buffers/cache: 1366 6616 Swap: 8191 0 8191
= ~21% Utilization
Solarwinds = 17% utilization
Figuring that this was just a case of SNMP sending slightly different data I tried a basic snmpwalk against memory:
$ snmpwalk -v 2c -c xxxxxxxxxx localhost Memory UCD-SNMP-MIB::memIndex.0 = INTEGER: 0 UCD-SNMP-MIB::memErrorName.0 = STRING: swap UCD-SNMP-MIB::memTotalSwap.0 = INTEGER: 8388600 UCD-SNMP-MIB::memAvailSwap.0 = INTEGER: 8388600 UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656 UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 6446020 UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 14834620 UCD-SNMP-MIB::memMinimumSwap.0 = INTEGER: 16000 UCD-SNMP-MIB::memShared.0 = INTEGER: 0 UCD-SNMP-MIB::memBuffer.0 = INTEGER: 42552 UCD-SNMP-MIB::memCached.0 = INTEGER: 285616 UCD-SNMP-MIB::memSwapError.0 = INTEGER: 0 UCD-SNMP-MIB::memSwapErrorMsg.0 = STRING:
1-(memAvailReal/memTotalReal) = ~21%
Even when I manually enter the OIDs I receive the same basic results.
$ snmpwalk -v 2c -c xxxxxxx localhost .1.3.6.1.4.1.2021.4.5.0 UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656 $ snmpwalk -v 2c -c xxxxxxx localhost .1.3.6.1.4.1.2021.4.6.0 UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 6400580
= ~21%
I'm having a hard time explaining to our client why Solarwinds is reporting a 4% lower utilization than they are seeing on the server itself. 4% could be the difference between an alert being generated or not, so you can see where the dilemma is coming from.
We have seen similar situations on Linux disk monitors, but in that case we are able to see how the values are being pulled more or less directly from SNMP. When we can fall back on Solarwinds using the SNMP reported data we are able to explain why utilization levels in Solarwinds do not reflect those on the server itself. In this case we are really at a loss for an explanation.
Is Solarwinds using a different OID? If so, is there a way to change the OID that is being used to the ones I just showed above without resorting to a UDP or something? Can someone provide me with the formula that is being used to calculate Memory Used on the CPU Load & Memory Utilization module?
Thanks in advance,
Bob
So I did some more investigating. I ended up resorting to sniffing packets to find out what OIDs showed up during a repoll.
1.3.6.1.4.1.2021.4.5.0 => memTotalReal Value (Integer32): 8174656 1.3.6.1.4.1.2021.4.6.0 => memAvailReal Value (Integer32): 5961904 1.3.6.1.4.1.2021.4.14.0 =>memBuffer Value (Integer32): 264164 1.3.6.1.4.1.2021.4.15.0 => memCached Value (Integer32): 472976
Taking this data I was able to approximate the 18% utilization shown by Solarwinds (Current utilization calculated in top was ~27%)...
(memAvailReal + memBuffer + memCached) / memTotalReal = ~18%
Is this the correct formula? If so, why was this chosen? I would prefer to have a formula that can be verified with simple system commands like top or free, but simple confirmation that this is the correct formula would be enough to explain to the administrator why he is seeing different results.
It turns out that both SNMP and free are pulling data directly from /proc/meminfo which does not contain actual utilization levels. free calculates used space by subtracting free memory from total memory. That is explanation enough for me to give the admin.
I'd still like to know why it was decided to use the above formula for memory utilization in Solarwinds.
Thanks!
A system admin recently sent in a ticket which claims that Solarwinds is not reporting memory data properly for a Linux server. Wanting to see what the issue was I decided to dive in and see what information I could find.
This is what the admin wrote in detailing the issue:
Solarwinds is reporting 20% memory used but in reality almost all of the memory is used. There also do not appear to be any alerts sent on this issue.
$ free -m
total used free shared buffers cached
Mem: 7983 7580 402 0 666 5206
-/+ buffers/cache: 1707 6275
Swap: 8191 0 8191
It’s pretty easy to see where he was seeing high utilization:
Used/total = % Utilization
7580/7983 = 0.949 = ~95%
Since Solarwinds was only showing ~20% utilization this is obviously cause for concern… But let’s dig a bit deeper.
I decided to take a look at the same system that the admin was referencing and crunch some numbers of my own…
Solarwinds says there is 18% Memory Utilization…
Let’s take a look at free!
$ free -m
Total used free shared buffers cached
Mem: 7983 2207 5775 0 306 494
-/+ buffers/cache: 1406 6576
Swap: 8191 0 8191
Hmmm… Using our above formula for memory utilization we get ~28%! That’s a full 10% difference.
Let’s check top instead!
$ top -b | head -8
top - 09:34:01 up 1 day, 17:20, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 94 total, 1 running, 93 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3%us, 0.1%sy, 0.0%ni, 99.1%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8174656k total, 2261088k used, 5913568k free, 313432k buffers
Swap: 8388600k total, 0k used, 8388600k free, 506568k cached
Uh oh… top is showing the same 28% utilization. This is not looking good for us. But, we all know that Solarwinds is just relying on SNMP data that is being returned by the system, right?
$ snmpwalk -v 2c -c xxxxxx localhost memory
UCD-SNMP-MIB::memIndex.0 = INTEGER: 0
UCD-SNMP-MIB::memErrorName.0 = STRING: swap
UCD-SNMP-MIB::memTotalSwap.0 = INTEGER: 8388600
UCD-SNMP-MIB::memAvailSwap.0 = INTEGER: 8388600
UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 8174656
UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 5913956
UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 14302556
UCD-SNMP-MIB::memMinimumSwap.0 = INTEGER: 16000
UCD-SNMP-MIB::memShared.0 = INTEGER: 0
UCD-SNMP-MIB::memBuffer.0 = INTEGER: 313432
UCD-SNMP-MIB::memCached.0 = INTEGER: 506568
UCD-SNMP-MIB::memSwapError.0 = INTEGER: 0
UCD-SNMP-MIB::memSwapErrorMsg.0 = STRING:
Wait a second… There is no value for Memory Used… All SNMP is seeing are the Total and Available!? Oh well, let’s see what happens when we calculate for utilization using these values…
1 – (memAvailReal / memTotalReal) = ~28%
Hmmmm… that’s not good, is it? Solarwinds must be doing something different with that data.
We could try Solarwinds’ not so useful SNMPWalk.exe to try and get a huge dump of all the SNMP data that is returned, but that won’t really tell us what OIDs Solarwinds is really polling against… Let’s take a break and sniff some packets.
Fire up Wireshark and set the filter: ip.src = xxx.xxx.xxx.xxx
After we force a repoll on the device we get about 15 hits, but we’re only concerned with these four…
1.3.6.1.4.1.2021.4.5.0 1.3.6.1.4.1.2021.4.6.0 1.3.6.1.4.1.2021.4.14.0 1.3.6.1.4.1.2021.4.15.0
What could these possibly relate to?
1.3.6.1.4.1.2021.4.5.0 => memTotalReal
1.3.6.1.4.1.2021.4.6.0 => memAvailReal
1.3.6.1.4.1.2021.4.14.0 => memBuffer
1.3.6.1.4.1.2021.4.15.0 => memCache
So it looks like Solarwinds is pulling not only the data for Total and Available but also Buffer and Cache. With a little bit of creative formulating we find this….
1 - ((memAvailReal + memBuffer + memCache) / memTotalReal) = ~18%
Wowee! So it turns out that Solarwinds is actually counting Buffered and Cached memory as unutilized space.
But you might be asking yourself (or have an admin asking you)… “Well, why is that number different from the Used values in the free and top commands?”
To answer that question let’s do a little dumpster errm… code diving!
After downloading the source files for procps utilities we find a few little nuggets of wisdom:
sysinfo.c
Kb_main_used = kb_main_total – kb_main_free
“-/+” buffers/cache: %10Lu %10Lu\n”, S(kb_main_used – buffers_plus_cached), S(kb_main_free + buffers_plus_cached
bobross,
Thank you for this excellent piece of code archeology. I'll get with Dev to confirm and determine how best we can provide info more clearly and more in-depth for you and your users/customers.
Thanks,
Hi,
This is something that's annoyed and plagued us for ages.
I'm looking at a linux box now and the CPU and Memory stats show 322mb used, 751mb available but then the volume info shows Physical memory 964mb used.
I was told once that the Memory statistics area was more about the amount of memory the running processes were using which wasn't the same info as reported under the Volume information. Linux allocates all physical memory to itself and then dishes it out after and this was what gave the different results.
Hopefully the Solarwinds Dev's can clarify once and for all the mystery of where all the memory stats come from and which one to trust for when a server should be upgraded.
Thanks
Jase
From my above investigation I concluded (personally of course) that the standard CPU & Memory Utilization monitor is a better indicator of overall memory utilization. I didn't have a chance to test the Physical Memory 'Volume' monitor, but from your numbers I would assume that it is counting Buffered and/or Cached Memory as "Used".
With that said, I would also like clarification as to whether or not I have the formula correct. It is handy that we can now tell some of our more difficult system admins about how Net-SNMP is reporting the data, but a firm answer from SW about how SW itself is calculating the output would be great.
bobross, you hit it right on the money, though I've never seen it so beautifully laid out and articulated. If you don't mind, I'd love to have one of our technical document writers steal pretty liberally from your posting to create a KB article that describes this in as much detail. This truly is very helpful information for new customers who wonder how memory usage is calculated in Orion NPM and SAM. Thank you for sharing!
Feel free... but I doubt that a technical writer will be able to capture the adventurous tone :D
I've been informed about this lovely posting by the SAM PM, AlterEgo. As the SAM tech writer, I will indeed write a KB on this next week, "borrowing" your hard work. And I was a fan of the real Bob Ross...I know how he expresses himself. (Unfortunately, tech-writing doesn't permit a great deal of flare. I'll be sure to beat the devil out of it though.) Thanks.
bobross,
We do what we can...sometimes we even go outside to see what that might be like...
Great article!
Outside... what's that?
Great post by the way...