A recent crisis happened within our environment. A drive filled up and caused a database server to crash its web app... and we had no idea, even though the drive was monitored via NPM and alerts were created to e-mail out when the space reached 2 separate thresholds.
How is this possible? I had no idea until I started investigating. A couple things of note are as follows:
1. Visually on your Orion Website, you will have no visual indication that a drive is not reporting any data.
2. If the drive fills up when its in this state, you will receive no alerts.
3. There is no "out of box" report to tell you which drives need attention.
4. If a monitored drive fails, your boss WILL come to you asking wtf?
So, what happens you might ask? I don't know the "behind the scenes" things but basically (what I was told) if the serial number changes on the drive, Orion is unable to continue to monitor it. In a virtual environment where some drives are re-sized or moved around on different resources/hosts, this happens quite often. With our situation, when I looked at the server in question thru Orion's eyes, all 4 drives were healthy and at < 80% full. Yet when I clicked on the drive to get detailed graphs about the usage, I received the message "No Data for Selected Time Period".
Needless to say, this shocked me a bit. The report was defaulted to "Last 7 Days" so I started expanding the timeline and it turns out the data stopped being retrieved from the drive almost 3 months earlier. The graph has data 83 days and older but nothing newer. The next thing that struck me as odd was when I went into the "List Resources" section. None of the 4 drives were checkmarked. Back on the Node Page, all 4 drives were healthy and within the "Green" threshold. I went back in and checked the drives so they would be monitored. Back on the Node page, I now had duplicate names for each of the 4 drives in question. The same drive names for each one. The same size, same serial number, everything was the same, except the amount of space consumed on each drive. One of them was in the red, full.
At this point, the application was repaired, drive was cleared of useless data, questions were answered etc...
I began to investigate this situation and opened a ticket. The support rep said he would contact the devs regarding this issue. I was told that if this (orion loses data collection from a hard drive) happens, you will have no idea and there is no way of finding out what is going on until its too late. He was very nice and said he would put in a feature request to notify the user if a drive "disappears" from a node.
Disgruntled by this, I began to investigate on my own and found two very interesting columns in the "Volumes" table. They are called "VolumeIndex" and "VolumeResponding". An interesting note is (from what I could see) every single drive with a VolumeIndex of "0" has a VolumeResponding of "N", so the query can be shortened to either one if you choose.
I then created a simple query as follows:
Select Nodes.NodeID, Nodes.IP_Address, Nodes.DNS, Nodes.SysName, Nodes.Application, Volumes.VolumeID, Volumes.VolumeIndex, Volumes.Caption, Volumes.VolumeDescription, Volumes.VolumeResponding, Volumes.FullName From volumes INNER JOIN Nodes ON (Volumes.NodeID = Nodes.NodeID) Where Volumes.VolumeIndex = 0 and Volumes.VolumeResponding = 'N'
I have over 2000 servers in Orion and the result of that query was a whopping 300. Now, to be fair, this resulting list also included both Physical & Virtual memory as Orion sees each these as a volume. Still, if a node moves to another resource or the drive is re-sized, Orion loses the ability to track the usage of them.
This is just an FYI for those out there who have Orion working with a clustered environment.
On a side note, Could some of you please run the query above and post here the results? I want to make sure this isnt just a simple problem with my installation. I'm very curious to see the results of other setups.
Thanks for reading this long post and I hope my experience has assisted someone out there
PS: If any of the information I supplied here is in error, please dont hesitate to correct me.