This can be down to so many things but i can tell you it is not down to netpath as i have tested this numerous times. The most common things that cause issues like this would be a misconfigured load balancer, proxy server or even an incorrect vlan configuration. I would setup a new probe in a different location to the destination and see if you get the same issue.
1 of 1 people found this helpful
I kinda agree that it has to be something between Orion and the destination.
No load balancers or proxy servers involved - just an ASA with no NAT between our sites on an internal (not internet) MPLS WAN. We have 3 different Orion installs at different WAN sites with different local topologies.
We do use LACP to connect Orion to a dual core switch at each location, and we see the expected 2 cores on the diagram. But that's the 1st hop. The problem I have is with the last hop.
I finished upgrading the rest of the Orion modules (NCM, NTA, SAM) on one install. and behavior did not change. I also added the following to the ASAs:
set connection decrement-ttl
so that the hop for the ASA is visible.
I've tried the following ports so far:
None of those have inspects on an ASA. So I tried ASA inspect netbios (TCP 139). No joy.
So I added an additional poller that has a single NIC, and the last hop symptom disappeared. So I suspect the problem is with LACP etherchannel on the Orion server. Our LACP is set to hash on source and dest IP. That might be a curve ball for NetPath.
Other than the oddity on the last hop, I really love NetPath. I do think NetPath will reveal other issues like this besides delays.
Yes, it is very good alright, i love the ability to see the hops within the isp. Try from another probe. It is very good for diagnosing drops on the atm path and verifying vlan configuration as well.
It does appear to be related to LACP on the Orion probe server. Our switch chassis is set:
# show port-channel load-balance
Non-IP: src-dst mac
IP: src-dst ip rotate 0
I tried killing one ethernet port of the LACP port channel of a different probe server, and it made no difference.
A probe server connected to edge switches with a single NIC works fine to the same destinations.
I have opened a medium priority support case # 1048405 on this.
I'll post results as time permits.
Any differences in the transit to the first L3 hop (including the first L2 segment, as you're talking about) should not impact the performance numbers NetPath provides for the last hop.
Out of curiosity, are the min, median, and max latency values for the endpoint within 20% of each other or do they very wildly?
I suspect we'll have to look at packet captures to solve this, so Support case as you've done is probably the best bet.
No difference to 1st hop - under 1msec.
I have a case open. We turned on some debugs in Orion, and did a couple of packet captures from the Orion server. I'm awaiting results.
I learned some stuff - like how to enable debug level logging in Orion (run LogAdjuster.exe), and how to see the NetPath debugs on the web page (append ?debug to the URL for the particular "service").
I don't like calling them "services". I'd rather call them "operations" or "instances".
ah, what device is the latency showing on?
The last device.
Oh i know, but is it a firwall, router, wan swich?
I think I found a fix for this particular symptom:
Edit C:\ProgramData\Solarwinds\Orion\NetPath\NetPathAgent.cfg and change:
"SendResetPerRoundStandard": to true. It is normally false.
It's the very last setting in the file. I then stopped and started all Orion services via OrionService Manager, though I wish I knew which exact ones need to be restarted for an edit to this file.
We had noticed that our ASA's were seeing hits from Orion to the NetPath target for TCP SYN attacks - part of ASA Attack Guard settings. It's not clear why this should happen more with LACP connected servers than single NIC servers. But now the ASA is happy, and NetPath is happy.
This might help with other NetPath operations traversing an ASA with Attack Gaurds enabled. It's worth a try.
Yes. I came across a similar case before and used the same workaround.
The cause of this is due to:
1. NetPath standard probing sends multiple rounds of SYN packets to the endpoint. The endpoint responds to each SYN with SYN-ACK.
2. Typically, OS on NetPath probe catches SYN-ACK and responds with RESET packet which clears the half-open connection in ASA. So the ASA treats the next SYN for a new half-open connection.
3. But in some rare cases, OS doesn't send RESET. ASA would catch an extra SYN for the existing half-open connection and may consider an anomaly.
"SendResetPerRoundStandard" flag basically forces NetPath probe to send RESET packet by itself so it clears out the half-open connection in ASA.
Embryonic TCP sessions that traverse an ASA will remain open for minutes, I believe. Since we use TCP intercept to defend against DoS attacks, there's a limit to how many embryonic sessions. The Orion server has TCP sessions to other things via the ASA as well. That's why some sites may see this problem, and others may not.
Sessions that have been ACKed stay open longer. So ASA messes with your head in the model of end-to-end TCP session packet crafting.
Look at embryonic sessions and threat-detection statistics tcp-intercept at:
A lot depends on how many packets are sent how often, and how far into the path the ASA is. In our case, it's pretty early for a site that's 10 hops or more away.
The one puzzling thing is how this differs on LACP from a single NIC probe node on an edge switch. I would have needed to capture packets at both the ASA and the probe node to see. Since our LACP hashes based on src and dest IP, returns from intermediate hops where TTL has expired may traverse a different path in the core switch. The ASA has
Earlier in the problem, I had also enabled decrement-ttl to see the ASA itself:
set connection decrement-ttl
That's not the default setting.
I also tried enabling and disabling inspect icmp and inspect icmp-error. I saw no difference.
All this also inspires me to see if I can do a UnDP for TCP Intercepts to catch stuff like this, plus real DoS attacks, faster.
Aha, it seems enabling that setting makes us into the good net citizen we should be, rather than turning it off.
I don't know what version of ASA code you're running, but in later versions, the embryonic limits don't control a hard cap but instead control when TCP cookies are used instead of TCP intercept. With TCP cookies, the ASA derives the state information of a TCP session from an ack (as the third packet in the TCP 3 way handshake) so that it doesn't have to store state information. This allows the ASA to protect from syn floods up to the limits of its CPU and bandwidth, rather than the limits of some state table.
Lan and I will have to talk some more about how we deal with latency impact of syn flood protection. Just another example where simple latency numbers can be more than a little bit complex to get right!
I think you're on the right track about the LACP. After LACP, as a control protocol, selects the links for the data plane to use, simple hashing to select egress port is the only way the traffic is functionally affected. However, when a specific port is selected, i means the traffic hashes in a specific way that likely also defines what links will be used in load balancing decisions later along the transit path.
We generally run very recent recommended code.
I'd like to see a Cisco reference on the SYN cookies thing, if you have one handy. I may have some reading to do ...
The idea that you had to turn on a setting for NetPath to be a good net citizen bothered me. I spoke to Dev about it more. Turns out in most cases the operating system, through some twist of fate, ends up sending a reset anyway even though it was not aware of our conversation (actually, that is why it sends the reset) so NetPath is acting appropriately. However, we have found in some strange cases the operating system does not send the reset, so we added this setting so we can have NetPath do it if needed. Now I feel better.
To confirm, I opened a case with SolarWinds, and sent them sniffer dumps. I learned from them about C:\ProgramData\Solarwinds\Orion\NetPath\NetPathAgent.cfg and I experimented and changed "SendResetPerRoundStandard": to true. It is normally false.as I noted in my post obove, dated Sep 30, 2016 5:43 PM.
I cannot say that I see any downside to this setting.
I don't see this issue with probes that are configured on our Orion server, but I do see it with the NetPath Agent probe running on an external PC. I set C:\ProgramData\Solarwinds\Orion\NetPath\NetPathAgent.cfg "SendResetPerRoundStandard": to true on the Orion server, but it didn't make any difference on the external NetPath probe. I don't see this setting on the external PC. Is there a way to force TCP RST on an external PC running the NetPath Agent?
Sorry, I missed your post.
Is this still acting the same way?
Sometimes, I wonder when we make a setting change, how does it propagate. What services must be restarted, In your case, it sounds like the external PC is an agent. I wonder if the agent needs to restart.
When in doubt, reboot the whole state .... if I could just find the command ....