This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

How does Node Down Work?

I'm not certian I understand how NPM determines if a node is down.

Node is pinged in a regular ping cycle(120 seconds). If the node does not respond, NPM sets the node status to warning and starts a fast ping cycle for that node. NPM fast pings the device for the time set in the Setting interface as Node Warning Level. This is 120 seconds by default.

That means that the node would be down for 240 seconds before we get a node down alert.Am I correct? How many ping packets are sent during each ping sweep?

Any clarification would be appreciated.

-John

  • John,

    Mike's link is correct in holding the answer. Interpretting it for your environment (default ping of 120 seconds), you'll see this behavior and alert time (numbers indicate poll count):

    • Every 120 second poll fails (#1, max elapsed downtime = 120 seconds, assuming node failed immediately after last poll)
    • Enters fast-polling state and sends ping 10 seconds later (#2, max elapsed downtime = 130 seconds)
    • Fast polling continues (#3, 140 secs; #4, 150 secs; #5, 160 secs; #6, 170 secs; #7, 180 secs; #8, 190 secs; #9, 200 secs)
    • 10th failed poll marks node down and alerts (#10, 210 seconds)

    Granted, your downtime could be as small as 90 seconds, if the node failed just before the first poll, so your range is 90-210 seconds before alert (if my math is correct).

    --Chris

  • Do you know how many ping packets are sent on a normal poll and how many on a fast poll?

    Also,  do you know the size of the packets? Thank you for the explanation.

    -John

  • This SW KB article seems to infer one ICMP packet of up to 1KB size per poll. It also shows the bandwidth used by other polling elements... As I understand it, each poll is one packet.

    SolarWinds Knowledge Base :: How much bandwidth does SolarWinds require for monitoring?

    --Chris

  • Orion status poll is sending out one ICMP Echo Request packet out. If ICMP Echo Reply is received within defined timeout (2.5s by default), it means that Node is responding and Orion marks its status as UP. If no reply is received, Orion applies retries policy and tries sending ICMP Echo Request again. Default number of retries is 1, which means Orion sends out two ICMP packets, but just in case when the first one got no response. The same methodology applies to fast poll as well.

    ICMP packet length is defined by header size (fixed size) and packet payload (variable size). Packet payload is configurable on Polling Settings page. You can specify custom string, which is sent as ICMP packet payload in "ICMP Data" text-box.

    If you lower Node Warning Level down from 120 to 30, Orion will be able to fit just 3 polls into 30s interval. Orion remembers time when first fast poll was send out, adds Node Warning Level to it and keeps sending ICMP Echo Requests out until this new time is reached.

    Please notice, that each not responded ICMP packet increases packet loss by 10%. If you change Node Warning Level to 30, won't be able to reach 100% within 2 minutes, like with default configuration. It will grow to 30% real quick during fast poll, but then each additional +10% increase will take 2 minutes (default normal node status poll interval). I'm just mentioning this fact, because default thresholds are set for 120s Node Warning Level.

  • Hi Cmgurley,

    Please correct me if I'm wrong but if I understand your explanation very well, I want to clear out something. It's like giving back to a teacher, an explanation made by the teacher, just for the teacher to vet if his/her protege understands what he/she is saying. I want to assume two scenarios: first is when a node responds at the first poll of the polling engine. Second is when the node failed at the first poll of the polling engine.

    Also, I would like to assume a default polling interval of 120 seconds.

    Let me begin with the first scenario aforementioned. During the first poll, a node responds and immediately after that, it fails. This means that the polling engine would have to wait for the first 120 seconds to conduct another poll. During this time, first downtime = 120 seconds.

    Now, this is where my question comes in. At the end of that first 120 seconds, does it enter the fast polling cycle state by polling immediately and then wait for its first 10 seconds to conduct another poll?

    or does it wait its first 10 seconds to commence its first fast polling cycle state and then wait for the next 10 seconds to continue? 

    For the second scenario, when the node failed to respond at the first poll, the first downtime of the node should still be 120 seconds because the polling engine has to wait for its default interval (120 seconds) for it to commence the next poll. This is my second question: Just at the start of the next polling interval, does the polling engine enter the fast polling cycle state immediately and then wait for its first 10 seconds to conduct another poll? or does it wait its first 10 seconds to commence its first fast polling cycle state and then wait for the next 10 seconds to continue?

    I just hope you understand my questions.

    Many thanks as you respond.

  • 1) At the end of the first 120 second it does a regular poll, it finds out that node is down and goes to the fast poll mode. In fast poll mode it waits 10 seconds and then polls the node, waits next 10 seconds and the polls the node and so on.

    2) I can not see any difference here. Because when the node failed to respond on a regular poll, then immediately polling is set to fast poll mode. It means waits for 10 seconds and polls the node.

    When node responds in fast poll mode then Node status is set as UP and fast poll ends. The next regular poll will be commence after 120 seconds.


  • Hi Tisonet,

    Many thanks for your clarification.

    I'm no more confused.

    Regards,