57 Replies Latest reply on Jun 25, 2008 5:48 PM by leemcclendon

    Netflow v3 Service Crash

    bbusbey

      My Netflow v3 service crashed today.


       Event Type: Error
      Event Source: NetFlowService
      Event Category: None
      Event ID: 0
      Date:  4/30/2008
      Time:  9:53:58 AM
      User:  N/A
      Computer: ORION1-CORP
      Description:
      Critical error in NetFlow listener: Exception of type 'System.OutOfMemoryException' was thrown.


      For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.


      Event Type: Error
      Event Source: NetFlowService
      Event Category: None
      Event ID: 0
      Date:  4/30/2008
      Time:  9:54:01 AM
      User:  N/A
      Computer: ORION1-CORP
      Description:
      Exception: error occurred during packet processing. Exception of type 'System.OutOfMemoryException' was thrown.


      For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.


      Event Type: Information
      Event Source: NetFlowService
      Event Category: None
      Event ID: 0
      Date:  4/30/2008
      Time:  9:54:09 AM
      User:  N/A
      Computer: ORION1-CORP
      Description:
      NetFlow Receiver Service [ORION1-CORP] Stopped


      For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.


        • Re: Netflow v3 Service Crash
          mark wiggans

           Are you actually seeing that the system is running out of memory?  What about the service growing in memory size during the day?  Just need to determine if this might be a memory leak issue or just an exception from some resource that is running out of resources.  Are there a lot of flows coming to the receiver?  If so, about how many?

          • Re: Netflow v3 Service Crash
            davidmaltby

            This may be an issue that we saw in the beta where the SQL Server wasn't able to keep up with all the NetFlow traffic that we were trying to store.  The database bottleneck was causing NTA to queue up more and more data that it was trying to feed to the database, but over time it caused the NTA service to grow and grow in memory size until the system finally ran out of memory.

              • Re: Netflow v3 Service Crash
                bbusbey

                David,


                This could be the problem, netflowservice.exe is now 836meg. 


                • Re: Netflow v3 Service Crash
                  bbusbey

                  David,


                  I can't use Neflow and I'm still waiting for support to call me.


                  What is going on? 


                   


                    • Re: Netflow v3 Service Crash
                      mark wiggans

                       We just received your diagnostics and from the first indications it appears that the transaction log is full and subsequently kicking off the remaining errors. The DEV team should have a better understanding of what is causing this shortly.

                        • Re: Netflow v3 Service Crash
                          davidmaltby

                          Yes, reviewing your diagnostics, I see that in your Summary2 table, you have over a billion rows.  Because of the issue in 2.2.1 where table in Summary2 never got collasped into Summary3 and hence never was groomed out of the system, this table has grown to this size.  I've given your tech support rep. instructions and T-SQL on how to clean this up.  Keep in mind that the T-SQL that I sent will groom all data from all 3 summary tables that is older than 7 days.  Running that will be a big hit on your database.  I've written the script so that you can adjust those numbers.  So that you don't impact your transaction log greatly, you may first may to delete all data that is 180 days or older.  Then adjust the numbers down and run the script again.  Eventually adjust that down enough to your desired 7 days or older.


                           


                          Thanks,


                          David


                            • Re: Netflow v3 Service Crash
                              bbusbey

                              David,


                              The grooming did not solve the problem. NetflowService continues to build up memory and crash.


                                • Re: Netflow v3 Service Crash
                                  davidmaltby

                                  Can you open Microsoft's PerfMon by going to the Start\Run and type in PerfMon.  Then add a new counter called SolarWinds\Raw Packet Queue.  Start the service and see it if keeps rising as the memory goes up.  This will indicate if the NTA service is receiving so much flow traffic that it can't send it fast enough to the SQL database.


                                   


                                  Thanks,


                                  David


                                    • Re: Netflow v3 Service Crash
                                      bbusbey

                                      What object is it under?

                                      • Re: Netflow v3 Service Crash
                                        bbusbey

                                        I need to get Netflow working ASAP!


                                        Are you going to have a patch for v3 today?


                                        Can I go back to v. 2.2.1 without damage? 


                                          • Re: Netflow v3 Service Crash
                                            davidmaltby

                                            To this point, you're an isolated issue.  We still need to figure out the cause of your crash.  I've provided an image of the performance counter that I'm talking about.  I am not convinced yet, that this isn't just an issue where SQL Server can't keep up with NTA.  We probably have a slight performance problem with the NTA service and because of the new support for sFlow, v9, etc... With the large number of interfaces that you're collecting from and possibly a SQL Server that was near its resource limits, we think that this degradation in performance is enough to cause the NTA service's queue that sends its data to SQL server to become overrun.


                                            Yes, you be able to back to 2.2.1.


                                            If you could just look at the attached image and try to find the performance counter that I mentioned before downgrading, we would appreciate it.


                                              • Re: Netflow v3 Service Crash
                                                davidmaltby

                                                There is one other thing that we could try, that I forgot about.  In the NetFlowService.exe.config file, there is a line which looks like the following:

                                                <processingPool threadsPerProcessor="10" packetQueueSize="0" />

                                                Now that I think about it, we didn't have the 10 thread limit on thread pool size in 2.2.1.  This actually may be the bottleneck.  If we could get you to try and increase that to 20.  That should allow NTA to process the queue quicker.  The second thing that you could change here is to set a limit on the size that the queue can grow to.  For example...  in order to set the limit to 500 MB, the line should be changed to The line is <processingPool threadsPerProcessor="10" packetQueueSize="500" />

                                                 I'm thinking that if we just change the thread pool limit, you might not experience the crash anymore...   Can you try that before going back to 2.2.1.

                                                 

                                                Thanks,

                                                 

                                                 

                                                 

                                                 

                                                  • Re: Netflow v3 Service Crash
                                                    denny.lecompte

                                                     Bruce,

                                                    I spoke with Support, and you do have a ticket open.  David M. is working with Support on your issue through that ticket.  We will probably need more info from you, but we we should keep the conversation in Support so that all of the information is tracked with your ticket.  Thwack is a great for some problems, but you've got a serious and urgent issue, and Thwack is not the best venue for getting that resolved. 

                                                    Could you send the information you've posted here via email via your support ticket?

                                                     

                                                    Thanks,

                                                    Denny

                                                     


                                                     

                                                      • Re: Netflow v3 Service Crash
                                                        jeff.stewart

                                                        Any word on this?  I have a case open already but Im really trying to get this fixed ASAP.  My Netflow services seems to be crashing every two hours or so. 

                                                          • Re: Netflow v3 Service Crash
                                                            Andy McBride

                                                            Hi Jeff,


                                                            Customer Support and Dev are working this as a top priority. I'll continue to track and do whatever I can to help you get NTA stable again.


                                                            Andy

                                                              • Re: Netflow v3 Service Crash
                                                                davidmaltby

                                                                Jeff,


                                                                We have tested a fix for one customer and it allowed him to run 27 hours, instead of the few hours like you are seeing.  Not sure if his new issue is related or not.  It appears to be different and we're investigating that now.  There is one thing that you can play around with to help mitigate the issue...


                                                                In the NetFlowService.exe.config file, look for the following:


                                                                <processingPool threadsPerProcessor="10" packetQueueSize="0" />


                                                                 If you raise the threadPerProcessor value, more CPU will be used to help control the memory from growing... Also, if you set the packetQueueSize to anything other than 0, it will put a max in MB of memory that this one queue in the service will grow to.  Any data collected over that max queue size will be thrown away until the queue data gets set to the SQL server..   Don't know if that is good enough. 


                                                                Playing with the threadPerProcessor is probably your best bet for now... We were using numbers like 40 and 50 with the other customer...  to help him out.. but still there will need to be a fix to improve performance here..


                                                                 


                                                                Thanks,


                                                                  • Re: Netflow v3 Service Crash
                                                                    jeff.stewart

                                                                    threadsPerProcessor=50 now, I'll let you know if the services crashes again.  Thanks.

                                                                    • Re: Netflow v3 Service Crash
                                                                      achrich

                                                                      My service grew to over 1.5gb of RAM and i`m only monitoring 3/4 routers with it. 

                                                                       
                                                                      I`ve asked for this on previous posts but didn`t get any response - can Solarwinds provide any detail for a roll back to version 2.2.1

                                                                      Frankly the new "features" are not worth the performance hit and crippling NPM completely.  I`ve had to disable it now just so we can get basic alerts and monitoring working again.

                                                                       

                                                                      Also will Solarwinds be releasing to all customers any scripts that reduce the size of the database from the previous versions ? ( i.e orphaned data etc )

                                                                       

                                                                       

                                                                       
                                                                       

                                                                        • Re: Netflow v3 Service Crash
                                                                          davidmaltby

                                                                          Tech support has assisted with rolling back to 2.2.1.  I'd call in a ticket for that. 


                                                                          As for reducing the size of the database, where is most of your data, based on row counts?  Are you really just talking about orphaned data?  If so, to get a list of monitored detail tables, you can run the following T-SQL:


                                                                          SELECT DISTINCT 'NetFlowDetail_' + CONVERT(nvarchar, i.NodeId)
                                                                                                       FROM NetFlowSources nfs
                                                                                                       JOIN Interfaces i ON nfs.InterfaceID = i.InterfaceID and nfs.Enabled = 1


                                                                           All you need to do then is do a TRUNCATE TABLE on the tables NOT in this list that match the table name like NetFlowDetail_%


                                                                           Otherwise, if most of your data is in the summary tables, then you can groom them by running the following T-SQL


                                                                           DECLARE @DeleteDaysOlderSummary1 int
                                                                          DECLARE @DeleteDaysOlderSummary2 int
                                                                          DECLARE @DeleteDaysOlderSummary3 int


                                                                          SET @DeleteDaysOlderSummary1 = 7
                                                                          SET @DeleteDaysOlderSummary2 = 7
                                                                          SET @DeleteDaysOlderSummary3 = 7


                                                                          DELETE FROM NetFlowSummary1 WHERE StartTime < DATEADD(day, -1 * @DeleteDaysOlderSummary1, GetDate())
                                                                          DELETE FROM NetFlowSummary2 WHERE StartTime < DATEADD(day, -1 * @DeleteDaysOlderSummary2, GetDate())
                                                                          DELETE FROM NetFlowSummary3 WHERE StartTime < DATEADD(day, -1 * @DeleteDaysOlderSummary3, GetDate())


                                                                          Set the variables to any number of days that you'd like to keep for the 3 different summary tables before running this.  Now if the row counts in the Summary tables are many hundreds of millions, then you might not want to DELETE all those rows at once.  If so, then I can get you some modified T-SQL that will delete it in chunks of rows at a time.  This will help prevent your transaction log from growing like crazy.


                                                                          Thanks,


                                                                          • Re: Netflow v3 Service Crash
                                                                            Andy McBride

                                                                            achrich,


                                                                            Remove 3.0 and install 2.2.


                                                                            Andy

                                                                              • Re: Netflow v3 Service Crash
                                                                                achrich

                                                                                 David,

                                                                                Is there anyway we can just get rid of all the data in Netflow and effectively start again ?

                                                                                We don`t use it for anything other then real time checks anyway so history is not important to us.

                                                                                Im happy to wait for 3.0 SP2.

                                                                                Cheers

                                                                                  • Re: Netflow v3 Service Crash
                                                                                    davidmaltby

                                                                                    I would set the values for compression and data retension down to their minimum values.  To set them as such, go to the NetFlowGlobalSettings table and change the following rows to the following values:


                                                                                    RetainUncompressedDataInMinutesDefault = 16 (minutes)
                                                                                    CollapseTrigger2InHours =2 (hours)
                                                                                    CollapseTrigger3InDays=2 (days)
                                                                                    RetainCompressedDataInDays= 3(days),


                                                                                    After changing these values, then restart the NetFlow service.


                                                                                    Hopefully you won't have to wait for SP2 very long.  It is about to go to QA for testing.


                                                                                     


                                                                                    Thanks,


                                                                                    David

                                                    • Re: Netflow v3 Service Crash

                                                      Mine is behaving similarly and our workaround was to use a scheduled task to restart the service every so often (daily in this case).  The counter you mentioned has ,after about 19 hours, climbed to ~206,000+ and is growing about 5-10K per hour.  I am only collecting flows from three routers though so our traffic is not what it would be if we collected traffic from more devices.  Only one of the router has any real traffic on it.  The other two have very little at all.


                                                      Before the daily service restarts, it would run 3-4 days and have allocated ~1.3GB of RAM and the Orion machine would begin to grind to a halt due to the low free memory condition.


                                                      I will set the threads to 20 and see if things change for the better...


                                                       


                                                      For what is is worth, the NetFlow V5 Packet counter is running only 2-3 packets per second and the flows are running less than 100  (~75 now) and yet the numbers for the other counter and the Memory usage are still climbing...

                                                        • Re: Netflow v3 Service Crash

                                                          After changing the thread count to 20, the number of threads in the NetFlow Service went from a solid 30 (according to PerfMon) to fluctuating between 42-45.  Still though the Raw Packet Queue length grows...  It is up to about 1000 in a few minutes... (~5)


                                                           


                                                          Setting it to 30 raised the process thread count to 52 and still the packet queue is rising.  I raised it to 50 and that brought the thread count up to 73 and still the packet count is rising...  The rate of rise in the Raw queue does not seem to be much affected by this change...


                                                           


                                                           


                                                            • Re: Netflow v3 Service Crash
                                                              davidmaltby

                                                              We should have the service pack out very soon, it is currently in testing and it addresses this issue.  Can you run the SQL Server Profiler for a minute or so and attach that to this thread?  I would like to make sure that our fix to the service addresses your issue.


                                                               Thanks,


                                                                • Re: Netflow v3 Service Crash
                                                                  jeff.stewart

                                                                  Looks like we are still having the same problem even after changing the thread count and with the new config file.  It does allow the service to run longer without crashing.  This last time we went two days.  Have you alll figured anything else out?   

                                                                    • Re: Netflow v3 Service Crash
                                                                      Andy McBride

                                                                      SP1 is in test now and should address the issue

                                                                        • Re: Netflow v3 Service Crash
                                                                          davidmaltby

                                                                          Right.  We have identified a performance issue and corrected it for SP1.  The changing of the thread count is a means to use extra CPU cycles to try to mitigate around the performance problem, but does not prevent the performance slow down.  It took a code change in the service to fix that problem.


                                                                          Thanks,


                                                                            • Re: Netflow v3 Service Crash

                                                                              Do you know yet when it will actually be released? It is annoying having to restart the service constantly, not to mention the loss of data when I forget to check it.

                                                                                • Re: Netflow v3 Service Crash

                                                                                  As a work around, I just use a scheduled task to stop and restart the service before it runs the server out of memory... 


                                                                                   Not perfect, but lets me sleep better for now.  My real concern in this is that if I am getting ~350K unprocessed blocks of data when I restart, am I really losing that information?  If so, what information is it that is being lost?

                                                                                    • Re: Netflow v3 Service Crash

                                                                                      For what it is worth, SP1 does not seem to really help much if any...  The service restarted at 4am this morning and already has ~45K Raw packets in the queue...  @ 30 flow updates per UDP/PDU packet that is ~1.3 million flow updates not being processed so far today...  This is with an input packet rate of only 8 packets per second...?  That seems a mighty low flow rate to cause such a backup of data on the SQL side...

                                                                                        • Re: Netflow v3 Service Crash

                                                                                          Are you seeing a siginificant increase in SQL CPU utilization since NetFlow V3?

                                                                                            • Re: Netflow v3 Service Crash

                                                                                              Indirectly. What we are seeing is an increase in I/Os.  This SQL server we are hitting now is very small and would have a much busier CPU if the I/O subsystem could handle more workload.  Possibly it is too small given the way it is behaving, but with only 8 Netflow packets per second from only 1 production router we didn't think we would kill the SQL server like we are.


                                                                                               


                                                                                              These are just my opinions though...

                                                                                                • Re: Netflow v3 Service Crash

                                                                                                  We are giving up.  Our NetFlow service must now be disabled until a fix is found.


                                                                                                  NetFlow V3 with or without SP1 is causing so many performance and timeout issues with deadlocked processes, a SQL server that grinds to a halt due to the large number of I/Os, etc...  We get timeouts on startup with the services,  also VoIP and Orion Website hangs, etc... that the pain is greater than the benefit, potential or otherwise...


                                                                                                   A performance fix or whatever in SP2 would be much appreciated as the product right now is simply no longer usable for us with even one router on a 10Mbps WAN pipe...  To me at least this seems a pretty serious problem for such a small traffic flow rate...