6 Replies Latest reply on Feb 22, 2018 9:07 PM by th3rush

    Services stop for no reason

    th3rush

      Upgraded to version 12.2 yesterday - upgrade went through smooth as silk and have to say a very painless process.

       

      Everything was working as expected when I left the office - got in the next day and website was a no go.

       

      Jumped on the server and all the services are stopped - click the "make it go" button - services restarted and SW was up and going again.

       

      Looked in event logs and nothing other than "Services Stopped" recorded - zero information I can find in event viewer and the Orion logs as to what may have caused it.

       

      Looking for some ideas or point in the right direction as not ideal to have your monitoring system down when you need it most!!!

       

      Cheers,

       

      Garreth

        • Re: Services stop for no reason
          neomatrix1217

          Strange that all the services were stopped in the Orion logs do you see an entry for business layer one that might have a time stamp after you left the office?

            • Re: Services stop for no reason
              th3rush

              I can't find anything specific - I did find this in the log file: BusinessLayerHost-SOLARWINDS.NPM.BUSINESSLAYER.DLL.log however I'm not sure if it's reporting there is an error while stopping or that it's stopping due to an error.

               

              2018-02-21 23:01:19,610 [83] INFO  SolarWinds.BusinessLayerHost.PluginInstanceAppDomain - Stopping plugin: "NPM Business Layer"

              2018-02-21 23:01:20,860 [83] ERROR SolarWinds.BusinessLayerHost.PluginInstanceAppDomain - Error detected while stopping "NPM Business Layer" plugin: System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '01:00:00'. ---> System.IO.IOException: The write operation failed, see inner exception. ---> System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '01:00:00'. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host

                 at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)

                 at System.ServiceModel.Channels.SocketConnection.Write(Byte[] buffer, Int32 offset, Int32 size, Boolean immediate, TimeSpan timeout)

                 --- End of inner exception stack trace ---

                 at System.ServiceModel.Channels.SocketConnection.Write(Byte[] buffer, Int32 offset, Int32 size, Boolean immediate, TimeSpan timeout)

                 at System.ServiceModel.Channels.BufferedConnection.WriteNow(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout, BufferManager bufferManager)

                 at System.ServiceModel.Channels.BufferedConnection.Write(Byte[] buffer, Int32 offset, Int32 size, Boolean immediate, TimeSpan timeout)

                 at System.ServiceModel.Channels.ConnectionStream.Write(Byte[] buffer, Int32 offset, Int32 count)

                 at System.Net.Security._SslStream.StartWriting(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)

                 at System.Net.Security._SslStream.ProcessWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)

                 --- End of inner exception stack trace ---

                 at System.Net.Security._SslStream.ProcessWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)

                 at System.Net.Security.SslStream.Write(Byte[] buffer, Int32 offset, Int32 count)

                 at System.ServiceModel.Channels.StreamConnection.Write(Byte[] buffer, Int32 offset, Int32 size, Boolean immediate, TimeSpan timeout)

                 --- End of inner exception stack trace ---

               

               

              Server stack trace:

                 at System.ServiceModel.Channels.StreamConnection.Write(Byte[] buffer, Int32 offset, Int32 size, Boolean immediate, TimeSpan timeout)

                 at System.ServiceModel.Channels.StreamConnection.Write(Byte[] buffer, Int32 offset, Int32 size, Boolean immediate, TimeSpan timeout, BufferManager bufferManager)

                 at System.ServiceModel.Channels.FramingDuplexSessionChannel.OnSendCore(Message message, TimeSpan timeout)

                 at System.ServiceModel.Channels.TransportDuplexSessionChannel.OnSend(Message message, TimeSpan timeout)

                 at System.ServiceModel.Channels.OutputChannel.Send(Message message, TimeSpan timeout)

                 at System.ServiceModel.Dispatcher.DuplexChannelBinder.Request(Message message, TimeSpan timeout)

                 at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)

                 at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)

                 at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

               

               

              Exception rethrown at [0]:

                 at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)

                 at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)

                 at SolarWinds.InformationService.Contract2.IInformationService.Delete(String uri)

                 at SolarWinds.InformationService.Contract2.InfoServiceProxy.DeleteImpl(String uri)

                 at SolarWinds.InformationService.Contract2.InfoServiceProxy.Delete(String uri)

                 at SolarWinds.NPM.BusinessLayer.NPMBusinessLayerPlugin.Stop()

                 at SolarWinds.BusinessLayerHost.Contract.BusinessLayerPlugin.Stop()

                 at SolarWinds.BusinessLayerHost.PluginInstanceAppDomain.StopPlugin(BusinessLayerPlugin plugin)

                • Re: Services stop for no reason
                  neomatrix1217

                  Did it happen again today? I would agree with dgsmith80  and open a support ticket.

                  • Re: Services stop for no reason
                    cahunt

                    Core.BusinessLayer.Log might have some correlating messages about(possibly before) that time of 11:01 PM. If you can find an associated msg with either a node ID, Name or IP it might point you to the device that did not respond so well. If the device is the issue you would more than likely see evidence in the device log files as well.

                     

                    Logs have been adjusted in new releases, so for a quick check find the log files with the next time stamp after your 11 PM spot and see if there are errors just before or at 11 PM that point you in a certain direction.

                    *I would start with the BusinessLayerHost-SOLARWINDS.Core.Businesslayer.dll.log ; then check other logs that look like they might have entry (based on timestamp)

                     

                    If you have any jobs running at that time, check your job logs.

                • Re: Services stop for no reason
                  David Smith

                  Hi Garreth,

                  These cases are hard to diagnose without digging through your logs. Best scenario might be to run a diagnostic and open a support case with SolarWinds so they can examine your logs for you. Or if you have a maintenance partner you can contact to do something similar.

                  • Re: Services stop for no reason
                    th3rush

                    Thanks for the replies all.

                     

                    In response to them - the issue did not occur again today. This is what I suspected might happen and the issue was a once off sort or qwerk of some sort.

                     

                    There are no jobs running - first job which is switch config backups kicks off around 1am so room to spare on that one.

                     

                    Agree support ticket is likely the best route on this sort of issue.

                     

                    Thanks again for the feedback - good to know you're not going mad when looking at "fun" issues