This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Errors in Event Log from servers with > 32 logical CPUs

I have been noticing a lot of EventID 2006 entries in the Application Event Logs of our 40 core Dell R910 servers.  These machines run Windows 2008 R2 Enterprise SP1.

"Unable to read Server Queue performance data from the Server service. The first four bytes (DWORD) of the Data section contains the status code, the second four bytes contains the IOSB.Status and the next four bytes contains the IOSB.Information."

I believe these are coming from the Orion polling engine, as it seems to be a 32-bit process.  This KB article from Microsoft seems to explain what I'm dealing with: 32-bit application cannot query performance "Server Work Queues" counters on Windows Server 2008 R2-based computer that has more than 32 processors

My guess would be that Orion encounters this issue when doing the individual CPU core polling for the machine view in Managed Nodes.  For these systems, SAM only finds data for the first 32 cores.  I thought at first that it was some limitation of the chart type and forgot about it a long time ago, but now I'm not so sure.

Is this a known issue with Orion SAM and many-core machines? 

  • This does appear to be an issue with WMI-based polling and >32 CPUs. Unfortunately I don't have a 40-core box to test with off hand- if you switch to SNMP polling, are we able to to return data for all CPUs?

  • Good idea.  I have another 40-core server that was using SNMP already, but I hadn't thought to check.  Indeed I do see data on all 40 cores for it in Manage Nodes.  Strange though, that server also has all the 2006 entries filling up the App Event Log.  Maybe they're coming from somewhere else after all?  

    Next I will unmanage the node and see if the messages are still logged on that machine.  If they are, then I guess I can't blame SAM.

  • It may be possible that these EventIDs are being generated by Windows Service or Process component monitors in SAM. You may want to try changing the applications assigned to that node to use the 64bit Job Engine and see if the Windows Events no longer continue to appear. Another option might be to change the fetching method for those process and service monitors to use RPC instead of WMI.

    64bit.png

    RPC.png

  • Hey, interesting.  I knew about the 32/64bit polling option a long time ago when writing a bunch of PowerShell components, but forgot all about it.  emoticons_happy.png  I'll give that a shot.

  • Hello,

    Just one note: The node properties like CPU, memory, volumes, interfaces, hardware health information, asset inventory information are polled through 32-bit process. You can switch to use 64-bit process only for SAM applications. We will consider the possibility to use 64-bit process also for node properties (it is tracked internally under FB275825).

    Lukas Belza (SolarWinds development)

  • Thank you for the information.  My test of unmanaging the node did prove that the events were coming from SAM polling.  I also ran the same test with our Nagios server and found that Nagios's check_nt plugin causes the same issue, so it's not like this is uncommon.

  • I see this post is a year old now but is it still the case that you cannot upgrade the node properties like CPU, memory, volumes, interfaces, hardware health information, asset inventory information to 64-bit?

    I'm having the same problem as the original poster but with just the basic node properties from NPM.  The host that is being polled has 40 logical processors so the 32-bit process is returning the above error every 5 minutes.

    MS released the following patch for the 64-bit processes if you have more then 64 logical processors but I don't see one for 32-bit. http://support.microsoft.com/kb/2733563

  • I am seeing a lot of these as well. Wondering what a fix might be?

  • You may consider changing the node polling method from WMI to SNMP for nodes that exhibit this behavior.

  • The problem is even when you use SNMP it still access WMI after it connects to the node.  So SNMP connects and then uses WMI to grab performance data which generates the error.