Extreme Variations in information collected/shown by Cisco Prime vs. Solarwinds

Question

We have used SW for several years.  Recently (apx 6 mos ago) our VAR gave us a free copy of CiscoPrime.   But I'm seeing discrepancies between the two.   Today I can see that one of my switches in an office where there are complaints of slow performance, is showing 1 of 4 CPUs on the 3850 was pegged numerous times at 98% utilization.     This according to CiscoPrime.  But Solarwinds says no...it's never been that high...not ever.  when I zoom into the time range of the highest spike according to CiscoPrime it was at 7:15 this morning.  Solarwinds...says it never happened.

Someone isn't being honest with the information they are getting.   I thought maybe it had to do with averaging over a period of time and also how often stats are collected.  CP collects every 15 minutes.  SW collects every 10.  Both use the SNMP pollers to pull the statistics on health.  Both default to a 24 hour reporting period.  But if I zoom in on SW to pulling information in a 1 hour window around the suspected event...it NEVER shows the CPU was ever in any kind of crisis.  If I continued to use only SW, might I be missing something I should be paying attention to?

Looking at the show processes output, indeed CPU1 is high and seems to stay that way...

MIA_23FL_CORE#show processes cpu sort | exclude 0.0

Core 0: CPU utilization for five seconds: 14%; one minute: 18%;  five minutes: 17%

Core 1: CPU utilization for five seconds: 97%; one minute: 96%;  five minutes: 93%

Core 2: CPU utilization for five seconds: 3%; one minute: 12%;  five minutes: 15%

Core 3: CPU utilization for five seconds: 98%; one minute: 90%;  five minutes: 47%

PID    Runtime(ms) Invoked  uSecs  5Sec     1Min     5Min     TTY   Process

5855   3326174     33968608 151    26.37    26.36    26.41    1088  fed

11947  518512      97207917 155    2.39     3.60     3.39     0     iosd

5857   1915298     28724628 154    0.49     0.46     0.46     0     stack-mgr

11940  3252699     20479147 13     0.10     0.13     0.14     0     wcm

but SW says it isn't happening now and never has:

What's up with this?  NPM is 12.1

jbrannen · Answer

In this case i would believe Prime and the CLI for actual data on the cores. However your 3850 may have 1 CPU with 4 cores and Prime and the CLI are reading data for each individual core. Solarwinds usually uses a higher tree MIB that reports the utilization of the CPU as a whole, not the cores.

From the screenshot of your Solarwinds charts, I would say something is not right with the way solarwinds is recognizing the device as you are showing different types of cpus in the chart, the cpus names do not match and are not complete. You can perform a MIB walk and probably find the OID which has a table of the utilization by core for the CPUs and then create a custom poller to collect that data and present in the custom poller chart to track the indivudal cores. but if i was you I would try a rediscovery followed by examing the resources (list resources) to make sure there is no legacy or orphaned resources attached to that node in Solarwinds.

Cisco has a doc on 3850 with high CPU--- Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco

d09h · Answer

This may not scale for you, but you could validate the polled information with any traps that you can get sent.  Could show you which polling to believe.

neomatrix1217 · Answer

Did you at one point have 6 CPUs and then reduce the number to 3?