12 Replies Latest reply: Nov 7, 2013 10:12 AM by bwicks RSS

SAM component login randomly fails on certain queries

hilbertd21

Current config is: Orion Platform 2013.1.0, SAM 5.5.0, NCM 7.1.1, NPM 10.5, NTA 3.11.0, IVIM 1.6.0, VNQM 4.0.1

 

1.)  I'm running a custom application monitor template using SQL Server User Experience Monitor Components, 14 total, each with a unique query, and consistently 3-4 specific components randomly go to 'Down', 'falsely' triggering alerts

 

Template I am using includes use of SQL Authentication first. Query is nothing special. logs in and tests fine.

 

SAM EDI BIS template.PNG

The monitor credentials on target machine as used in SAM credential library are SQL Authenticating, have full sysadmin privs, target machine has unlimited connections enabled.

 

I was able to reproduce the same error from a separate NPM instance, producing similar errors, down state events, but at slightly different times.

 

Running debug logging on the application monitor, and subsequently reviewing both the login errors as seen from the target instance, and the security log, it appears that when poll requests go out, these 3-4 queries are reporting login failures (others log in just fine), thus denying access and going down state. Its totally random.

 

One of our server admins shared the following (i've edited out accnt and domain)

 

"This is pretty standard log for audit failure. Really it is pretty cut and dry, the user account or password are wrong. There are 16,000 of these going back just to 9/5 for this account. Almost all of these are from dxxxxxxxx. Some have the user account in the request like this:

 

Account For Which Logon Failed:

  Security ID: NULL SID

  Account Name: example

  Account Domain: 10.10.10.10 (example)

 

 

Failure Information:

  Failure Reason: Unknown user name or bad password.

  Status: 0xc000006d

  Sub Status: 0xc0000064

 

Others the account info is missing:

 

Subject:

  Security ID: NULL SID

  Account Name: -

  Account Domain: -

  Logon ID: 0x0

 

I have only seen this when the information is in fact missing or incorrect in the login request."

 

From SAM 5.5 debug logging, shows SQL Authentication is tried first, fails, then tries Windows Authentication. What our admin means by 'others' I think is those attempts by SAM to use Window Auth, which doesn't parse anything, so it see's blank. But the first 'failure reason'  for unknown username or bad password I think is associated with the first SQL Authentication attempts.

SAM EDI BIS login failure.PNG

 

 

 

 

 

 

 

 

 

I'm wondering if this is perhaps somehow performance related,

 

1.) I do have a modestly high disk queue length on the Orion SQL instance, and it's RAID 5 set up with NTA, tables are in need or rebuild

2.) This target node in particular has four application monitors running on it, all using the exact same SQL Authenticated account, so I intentionally disabled an unrelated SQL 'health' monitoring components, thinking it would reduce frequency of 'down' states, but just saw one of the four custom queries go down again. argh!

 

I'm thinking this may be performance related, so I'm attacking the problem from that angle, but could having, say four different SQL Authenticated applications, with roughly a total of only 33 component monitors cause some timeout, or bork parsing the username and password at times?

 
  • Re: SAM component login randomly fails on certain queries
    aLTeReGo

    By default SQL authentication is tried first. If that fails, Windows Authentication is used. If the credentials you are using for the SQL User Experience Monitor are in fact Windows credentials I recommend enabling "Use Windows Authentication first, then SQL authentication" within the component monitor settings. This should eliminate the authentication errors from appearing in the debug logs and Windows Event Viewer of the monitored SQL server.

     

    SQL User Windows Authentication.png

    • Re: SAM component login randomly fails on certain queries
      hilbertd21

      Thanks for writing, however

      1.) It is a SQL Authenticated account, and I have 'false' for 'Use Windows Authentication first, then SQL authentication.' (shows blank checkbox as shown by you)

      2.) It works fine for awhile, then randomly will fail amongst 3-4 specific component.

      3.) I ran duplicate monitoring to same target node, on a different NPM instance, using same template, and it also has similar errors, however, at different times.

      4.) Errors on main NPM instance from today show locking with a process ID of 188

       

      Orion Loop is !=finished error 2nd one.PNG

      and the debug log show the failure. Looks like it's attempting SQL Authentication first, then tries Windows based on the order.

       

      Orion loop log file finished error 2nd one.PNG

      Also should note, that I'm running multiple application monitors, that coincidentally use the same 'SQL Authenticated' service account, logging into the same 'instance', but different catalogs.

      Orion loop is != example of multiple app mons-diffrnt-catalogs-same-sql-acct.PNG

      Differences are some have SQL Server Instance name filled with instance name, while others don't, and simply use port number to connect. Not sure if that matters, as each one is polling something unique (already checked for duplicates), or maybe they are locking each other?!? The odd thing, I'm not seeing 'Down' state indications on any of the other two application monitor components events. Only on the 3 or 4 on the 'EDI SQL BIS PROD DB-Instance' randomly going 'Down'.

       

      Here's the log generated as seen from the SQL instance that was the target of the query at same time errors for locking occur.

       

      Orion Loop is !=finished error from SQL instance 2nd one.PNG

      I'm going to disable the 2nd NPM polling I set up to see if errors were happening at same time, and found some overlapped around same time, some did not. Not particularly helpful.

       

      What do you think?

    • Re: SAM component login randomly fails on certain queries
      hilbertd21

      Thanks. just posted new info. appreciate your time and interest. It's

      driving me nuts!

  • Re: SAM component login randomly fails on certain queries
    hilbertd21

    Didn't work. Same components going to 'Down' state. Will plan on trying the upgrade angle for SAM 6. Thanks for the feedback and input.

  • Re: SAM component login randomly fails on certain queries
    ariabell

    I'm having the same problem suddenly with sql user experience monitors on just one sql server.  In the sql logs it looks like Solarwinds is passing  NULL credentials

     

    Subject:

                    Security ID:                            NULL SID

                    Account Name:                     -

                    Account Domain:                 -

                    Logon ID:                               0x0

     

    Logon Type:                                          3

     

    Account For Which Logon Failed:

                    Security ID:                            NULL SID

                    Account Name:                     solarwinds_sql

                    Account Domain:                 10.10.10.107

     

    Failure Information:

                    Failure Reason:                      Unknown user name or bad password.

                    Status:                                    0xc000006d

                    Sub Status:                             0xc0000064

     

    Process Information:

                    Caller Process ID:  0x0

                    Caller Process Name:            -

     

    Network Information:

                    Workstation Name:              BUR2SW02

                    Source Network Address:    -

                    Source Port:                           -

     

    Detailed Authentication Information:

                    Logon Process:                     NtLmSsp

                    Authentication Package:     NTLM

                    Transited Services:                -

                    Package Name (NTLM only):              -

                    Key Length:                           0

     

    I have upgraded to SAM 6.0 and the components are set to not use windows authentication first.  Also, it's not a named instance. 

    Each time something alerts, if I test it live I don't get an error I keep trying to catch it but can't recreate the error.  It appears that intermittently and randomly this error will be returned and then clear on the next poll.

     

    Physical memory usage is high but that is nothing new; I double check the graph to make sure.  The error that is returning when it alerts is:

     

    SQL Server returned an error. A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)

     

    Another one I saw:

    Unexpected error occurred. A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.10.10.107:1433