A recent crisis happened within our environment. A drive filled up and caused a database server to crash its web app... and we had no idea, even though the drive was monitored via NPM and alerts were created to e-mail out when the space reached 2 separate thresholds.
How is this possible? I had no idea until I started investigating. A couple things of note are as follows:
1. Visually on your Orion Website, you will have no visual indication that a drive is not reporting any data.
2. If the drive fills up when its in this state, you will receive no alerts.
3. There is no "out of box" report to tell you which drives need attention.
4. If a monitored drive fails, your boss WILL come to you asking wtf?
So, what happens you might ask? I don't know the "behind the scenes" things but basically (what I was told) if the serial number changes on the drive, Orion is unable to continue to monitor it. In a virtual environment where some drives are re-sized or moved around on different resources/hosts, this happens quite often. With our situation, when I looked at the server in question thru Orion's eyes, all 4 drives were healthy and at < 80% full. Yet when I clicked on the drive to get detailed graphs about the usage, I received the message "No Data for Selected Time Period".
Needless to say, this shocked me a bit. The report was defaulted to "Last 7 Days" so I started expanding the timeline and it turns out the data stopped being retrieved from the drive almost 3 months earlier. The graph has data 83 days and older but nothing newer. The next thing that struck me as odd was when I went into the "List Resources" section. None of the 4 drives were checkmarked. Back on the Node Page, all 4 drives were healthy and within the "Green" threshold. I went back in and checked the drives so they would be monitored. Back on the Node page, I now had duplicate names for each of the 4 drives in question. The same drive names for each one. The same size, same serial number, everything was the same, except the amount of space consumed on each drive. One of them was in the red, full.
At this point, the application was repaired, drive was cleared of useless data, questions were answered etc...
I began to investigate this situation and opened a ticket. The support rep said he would contact the devs regarding this issue. I was told that if this (orion loses data collection from a hard drive) happens, you will have no idea and there is no way of finding out what is going on until its too late. He was very nice and said he would put in a feature request to notify the user if a drive "disappears" from a node.
Disgruntled by this, I began to investigate on my own and found two very interesting columns in the "Volumes" table. They are called "VolumeIndex" and "VolumeResponding". An interesting note is (from what I could see) every single drive with a VolumeIndex of "0" has a VolumeResponding of "N", so the query can be shortened to either one if you choose.
I then created a simple query as follows:
Select Nodes.NodeID, Nodes.IP_Address, Nodes.DNS, Nodes.SysName, Nodes.Application, Volumes.VolumeID, Volumes.VolumeIndex, Volumes.Caption, Volumes.VolumeDescription, Volumes.VolumeResponding, Volumes.FullName
INNER JOIN Nodes ON (Volumes.NodeID = Nodes.NodeID)
Where Volumes.VolumeIndex = 0 and Volumes.VolumeResponding = 'N'
I have over 2000 servers in Orion and the result of that query was a whopping 300. Now, to be fair, this resulting list also included both Physical & Virtual memory as Orion sees each these as a volume. Still, if a node moves to another resource or the drive is re-sized, Orion loses the ability to track the usage of them.
This is just an FYI for those out there who have Orion working with a clustered environment.
On a side note, Could some of you please run the query above and post here the results? I want to make sure this isnt just a simple problem with my installation. I'm very curious to see the results of other setups.
Thanks for reading this long post and I hope my experience has assisted someone out there
PS: If any of the information I supplied here is in error, please dont hesitate to correct me.
I'm experiencing the same issue. On several occassions my server team would ask me if I'm monitoring a certain volume that was recently added, or ask why I'm still monitoring a volume that was moved. My initial thought was always that I had forgot to check that particular resource when the server was added, but turns out that SW does not keep track of volumes that are added/moved. I have setup the above alert from smargh (thanks by the way) and I have to explain to my server team that there may still be volumes out there that SW has basically forgotten about because it isn't adapting to changing volumes.
The problem is not that we aren't alerted for volumes that "stop responding", it's that Orion cannot handle a Windows volume either being renamed on the monitored host itself, i.e. from "OS" to "Windows", nor if the volume ID changes. This is, of course, assuming that this is actually what the root cause of this is for your particular case. I haven't tested this problem this year, so it is a possibility that I'm now wrong.
You can work around this by using a script to update the Volumes table with the new volume description that's returned in SNMP. Orion should really handle this natively - it's quite embarrasing. Every time I've raised this with Support, they've always insisted that this is a feature and they wouldn't escalate it. Perhaps if you have more money invested in Orion products, they'll give this bug - and I consider it a bug - some attention. Maybe it's just difficult to fix from a software architecture perspective.
I started writing a Powershell script to poll a node by SNMP and compare it with what's in the Volumes table, then do an UPDATE query it if they dont match, but I got too scared that I might do something bad to the database and delete everything, or forget a WHERE clause. I have, however, tested manually renaming a volume in the Volumes table and it does indeed make Orion resume monitoring. I think the last time I did some sniffing of the traffic to/from my Orion poller, it does retrieve all volume info, including renamed volumes, but Orion only seems to look at the full returned name of the volume (it's not just "C" or "C:" - it's the volume description as in the SNMP response) and ignores it if the contents don't match exactly with the existing record in the Volumes database table.
I use this alert to inform me of volumes which are renamed or have the ID changed, causing Orion to forget about them:
Vendor is equal to Windows
Full Name contains 😕
Node Status is equal to Up
Trigger alert when ANY of the following apply:
Volume Status is equal to Down
Volume Status is equal to Unknown
Volume Responding is equal to N
Do not trigger this action until condition exists for 40 minutes.
Our NPM maintenance with SolarWinds expired a while ago, so I can't chase support about this on your behalf - sorry.
I have been experiencing this issue much more frequesntly now that my company is preparing for a data center migration. The issue is two fold as I see it.
1. Loss of monitoring. Clearly this is critical for the reasons sited by the original poster. When the volume data changes, Orion loses sight of it and effetively ignores it. This can and does cause alert status definitions to fail to trigger. Yes, I can run a SQL query to MANUALLY go in and add the newly named volumes but that leads to the second issue...
2. Loss of data history. Once the volume name changes and I add the new name via "List Resources", I am left with a nonresponsive volume that just sits there (taking up an Orion license I might add). Another simple matter of just deleting that volume via the Database Manager. Problem is that I then lose ALL historical data from that volume. The new name is exactly that, brand new. No more trend analysis is possible. I would not like to even think of the messy data merges that would require.
What i do not understand is why Orion locks in on the SNMP returnd VALUE rather than allowing it to be dynamic. The OID is static (.126.96.36.199.188.8.131.52.184.108.40.206) and each volume is just the GETNEXT starting at 220.127.116.11. The DATA is dynamic based on the returned value of the SNMP query, why can the NAME not be handled the same way?
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.