Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials
Store

Better Pingdom results on failed requests

When Pingdom fails, it does not provide context on the call even after timeout occurs. This does not help identify root cause easily.

Scenario:

Pingdom health check is set up and running fine
Pingdom health check begins to fail
Results Log and Test log show no fine-grained details on the why
The only result shown could be "Socket timeout" or the infamous "tslv1" response

This does not help root-cause an issue. Even with a 30 second timeout, Pingdom health checks should still try and wait for the response, say with a global 5 minute timeout (configurable). This would allow users to see response headers, which may help indicate or provide useful unique id's to help track down misc requests.

We used a competitor to help gather this information, because they still allowed health-checks to finish even after they crossed the 30 second threshold. Using them helped us track down a potential root-cause much quicker, than with Pingdom alone.

It would be a powerful capability to introduce to the platform.

Find more posts tagged with

Status: None

Comments

ttorkelson

@rdeleonzebra I would also like to see this, as of right now, the root cause feature is quite useless, and I have also noticed that some competitors already have better detail/feedback on failed uptime alerts.

We only need the the failed response status code and body to consider this a useful feature. Right now, my alerts are indituigishable from any non-200 status code; this is what is called a "False negative", and for responses to downtime for an SLA, every second is potentially priceless!

srs

Also jumping on this bandwagon on behalf of the company I work for.

We seem to have intermittent errors and for that kind of error the root cause analysis is useless - this is even well documented at https://help.pingdom.com/hc/en-us/articles/115003437069-Known-Issues and https://help.pingdom.com/hc/en-us/articles/203810521. The test result log error messages and timestampt (as referred to in both of those pages) are NOT detailed enough to provide any meaningful debugging.

There are two possible feature requests that could solve this issue:

Root cause analysis of the actual web request that caused the error.
Analysis results for all test log including GET response headers and body (when available - doesn't make sense if the server isn't reponded in which cause a traceroute should suffice).