Hi,
I ran into a frightening scenario today which I hope you can shed some light on.
We use Solarwinds Orion NPM and APM to monitor our network and applications. When i tried to dig some statistics out of APM today, I discovered it had completely stopped recording data!
All the nodes were still up, but every single component had no response time or statistic value since 4pm April 6th.
When i tried to 'Test' any component, it just sat there and spun. I found I could simulate the tests fine myself from the monitoring server by telnetting to the target on port 80, and running a GET. APM, however, could not connect, and further to this a netstat revealed only administrative network connections being established (remote desktop, and such) - nothing monitoring related.
Stopping and starting all the solarwinds processes yielded no results.
I noticed the server was running low on disk on c: (about 7% free), so starting clearing stuff up. The second i got greater than 10% free, everything leapt into life. The APM test i had running in a browser window suddenly completed, i started getting alerts for about 15 triggers which had gone off, statistics were being recorded, and a netstat revealed hundreds of connections, which is as I'd expect.
Now, the 10% thing is a bit annecdotal - it may be that as I was clearing stuff up i deleted a stale lock file (I was only deleting files from 2009 though, and we were working up until april 6th, so this seems unlikely), or triggered something else that woke the system up. It seemed to happen right at that pointin time though, so seems suspicous.
Summary of facts:
- APM fails to make outbound TCP connections
- Can make outgoing TCP connections manually from the server just fine
- APMs 'Test' attempts appear to just hang, waiting
- No alerts triggered
- Restarting the service didn't help
- Clearing up disk space resolved the issue without any other intervention
Are there any known issues that would cause APM to behave like this? We're obviously very concerned by this, as we've had a total lack of monitoring for the last 2 weeks, were completely unaware of the problem (and indeed anything less drilling as far down as response times graphs gave no indication), and only stumbled on it by accident.
We've now put an additional alert in place to let us know if one our monitoring servers drops below 12% free disk on C:\ - but this is not ideal.