Need Help with Cortex consuming all disk space.

lcsw2013

Hi,

With our windows systems we have them all on agent monitoring. And recently we've noticed an increase in alerts with disk space being fully consumed. When we checked the Cortex part of all agents had over 30gigs of disk space consumed and in many servers this was enough to consume all available disk space. Meaning solarwinds was crashing our tools and servers because the cortex service out of the blue decided to crap out.

More specifically the files seem related to cache for volume polling. We've already opened two tickets with support but aren't getting anywhere with support so I wanted to ask the community.

The cache files appear to be DB files. That balloon out of control. And as far as we've been able to tell the agent doesn't lose connection so we can't understand why it's not flushing this cache and causing this problem to happen. Support had given us a script to run along with reboot of the agents and we did step by system what they recommended to no avil. The issue continues.

I feel like we just scratched the surface instead of attacking the root of this problem.

When we chart disk space we see slow steady increase till one day all disk space is taken. Temporarily we have been deleting the files manually. This seems to partially make the agents unstable because as soon as we delete the files it says that it cannot flush the files because it doesn't exist. Which doesn't make sense if the file is only growing.

I can provide more details if needed.

Find more posts tagged with

problem

orion

Database

solarwinds

npm

Polling

Alerts

cortex

monitoring

Accepted answers

All comments

lcsw2013

@mesverrum Not sure if you could help. I'm trying to reach out to see if if anyone potentially had encountered this before? I suspect that cortex services are corrupt in our environment but I have no smoking gun or root cause. And I myself don't understand the cortex service in depth enough to understand how it works on the system. Parsing logs only strengthens my confusion. As the logs seem to loop and go no where.

This is attached to agent monitoring. I hate agents as solarwinds haven't done enough to get agents to the same stability as wmi or snmp monitoring. We have all our windows servers in agent monitoring that's a 100 plus servers all causing me a huge headache.

Part of me thinks that security is affecting the function of solarwinds. Our customer is federal and has layer upon layers of security tied to the entire infrastructure. And recently with solarwinds removing there stance on adding exceptions it only has caused more of a headache causing our team to lock heads with our security folks because they are applying full force security on our system.

anyways. My focus right now has been attempting to fix this problem as solarwinds is actually crashing other servers by consuming all disk space. And it seems related to volume polling. unfortunately this is as far as I've gotten.

thanks in advance for your help if you have any to offer. Thanks!

jpaluch

We had a similar issue about year ago, not sure about the exact build number. We were given the following workaround:

For the workaround can we perform the following:

1) Clean Volumes from Cortex DB with following query:
delete from Cortex_Documents where Data LIKE '%"ModelType": "Orion.Volume"%' OR Data LIKE '%"Type": "Orion.NodeToVolumes"%'
2) Stop Agent/Cortex on Agent machine
3) Delete all *.db files in c:\programdata\SolarWinds\Cortex_Agent\
4) Start Agent/Cortex again

Please note that it is safe to delete the Volume information from Cortex_Documents tables because in this case, the Cortex is used only for realtime polling. Realtime polling is able to create correct records as they are needed. The *.db files are just cached data. They probably contain incorrect records and the Cortex is polling Volumes even when it shouldn't.

I would suggest upgrading to the current version if you are not on it as it was fixed later on.

lcsw2013

Support has us do this. Even though it looks like your query is just a little different than the one we were provided. We did that. We saw all the disk space regained among all our servers. But the issue came right back. It wasn't resolved.