nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials
Store

Application Performance Monitor silently stops recording data

PLCOperations

Hi,

I ran into a frightening scenario today which I hope you can shed some light on.

We use Solarwinds Orion NPM and APM to monitor our network and applications. When i tried to dig some statistics out of APM today, I discovered it had completely stopped recording data!

All the nodes were still up, but every single component had no response time or statistic value since 4pm April 6th.

When i tried to 'Test' any component, it just sat there and spun. I found I could simulate the tests fine myself from the monitoring server by telnetting to the target on port 80, and running a GET. APM, however, could not connect, and further to this a netstat revealed only administrative network connections being established (remote desktop, and such) - nothing monitoring related.

Stopping and starting all the solarwinds processes yielded no results.

I noticed the server was running low on disk on c: (about 7% free), so starting clearing stuff up. The second i got greater than 10% free, everything leapt into life. The APM test i had running in a browser window suddenly completed, i started getting alerts for about 15 triggers which had gone off, statistics were being recorded, and a netstat revealed hundreds of connections, which is as I'd expect.

Now, the 10% thing is a bit annecdotal - it may be that as I was clearing stuff up i deleted a stale lock file (I was only deleting files from 2009 though, and we were working up until april 6th, so this seems unlikely), or triggered something else that woke the system up. It seemed to happen right at that pointin time though, so seems suspicous.

Summary of facts:

- APM fails to make outbound TCP connections

- Can make outgoing TCP connections manually from the server just fine

- APMs 'Test' attempts appear to just hang, waiting

- No alerts triggered

- Restarting the service didn't help

- Clearing up disk space resolved the issue without any other intervention

Are there any known issues that would cause APM to behave like this? We're obviously very concerned by this, as we've had a total lack of monitoring for the last 2 weeks, were completely unaware of the problem (and indeed anything less drilling as far down as response times graphs gave no indication), and only stumbled on it by accident.

We've now put an additional alert in place to let us know if one our monitoring servers drops below 12% free disk on C:\ - but this is not ideal.

Find more posts tagged with

apm_disk_space_10_10%_ten_percent_stops_recording_outgoing_tcp_connections

Accepted answers

All comments

jiri.tomek

Hello,
what you described is really strange. I'm not aware of any issues related to low disk space on monitoring server so far. However 10% is not an exact information. How much space was it in real? 10GB, 100MB? APM should work just fine if all other parts of system are OK. Please, open a support ticket so we can look into it and try to find reason of this failure.

Thank you

FormerMember

how many compoenents do you monitor, and what do your previous events look like? (i.e. tons of up and downs, without triggered alarms?)

My biggest problem with APM is the scalability issue... it can get in a right mess and writes GB's of event and performance data to the the APM process_detail and stats tables.

I purged all of these tables, and APM came back to life again

PLCOperations

We actually monitor a very small set - I wouldn't have thought scaling would be an issue at the level we monitor at.

We typically have a very good ratio of alerts vs actually reaching trigger thresholds (or seem to - I haven't done an audit to confirm).

APM:

Total Number of Component Monitors:

521

Polling interval varies from 2 minutes to 2 hours - tip of the bell would be about 5 minutes, with very steep edges.

NPM:

Version	9.5.0
Release	Orion 9.5.0 May 2009

Network Elements	273
Nodes	114
Interfaces	30
Volumes	129

PLCOperations

Our C: is 19.5gb, so it would have been at about the 1.9-2 gb free stage that things came back to life.

I'll open a ticket now.

PLCOperations

I should also point out that this is just the system disk.

We have a separate 50gb drive for the database, backups, and such, and that still has 70% free.

chris.lapoint

Windows may reserve disk space that prevents APM from writing to the system (e.g. app/OS swap space), so not having enough disk space could certainly create problems for Windows and applications (not just APM). The problem is if the application can't write to disk, how can it write to the Windows Event Log or Orion Event Log. In any case, definitely worth capturing diagnostics and investigating further to ensure this is the case.

The recommended percentage free disk space will differ based on environment based on amount of physical memory and size of disk.

Anybody watching this thread have any rough guidelines around this?

PLCOperations

There are no gaps in the eventlogs that i can see, and no other indication of out-of-disk related issues - operating system, other applications, SQL server - even NPM - all seemed to be chugging along just fine. Just APM, which couldn't connect out at all.

It's possible the 10% thing is just coincidence - this is the only experience i've had of this issue, and a sample set of one does not conclusive evidence make! All theories welcome!

For example - could an APM thread be alive but dormant waiting on some lost file system activity, and previous APM connections erroneously marked as still open and accounted to this thread? We would then hit the OS limit for connections per process, untill the thread in question died, freeing them up.

Convoluted and contrived, I'll grant you, but something along those lines?

chris.lapoint

Have you already opened a support case on this item?

familyofcrowes

I am having a very similar experience. I am in a meeting showing the capabilities to management and there is no data. If I unmanage and then remanage it starts working again, but that is unacceptable.

How can I alert that this is occuring? How can I prevent it? I have over 400 application monitors and over 1000 components so I need something fast.

We are removing Tivoli Jan1st and Orion will assume all alerting for all servers and applications. It CANNOT fail.

danielleh

Hi familyofcrowes--

I have marked this for the PM to review and hopefully have some answers and feedback for you soon.

Thanks,
DH

aLTeReGo

familyofcrowes, can you open a case with support and post the case number here? I'll ensure your case is handled appropriately.