Hi all,
I'm implementing an NPM and APM environment to monitor many hundreds of servers.
I'm trying to work out the best way to alert on servers. Most servers will use the default thresholds for disk space, memory etc. but some machines will either want to ignore those alerts or override the default with a different value (i.e. 99% disk utilized rather than 95%).
For example for some Oracle servers, the memory utilization will always be 100% (preallocated) so we'll want to ignore this in the alert. On another machine a page file is preallocated on a disk to use 98% so we'll want to raise the threshold of the disk utilization alert for this particular server.
If i want to do this do i need to create a new alert for each machine which we need different thresholds for? This list could become huge if this is the case. I could potentially create a complex rule for the default alert which triggers on different thresholds but this would become a massive rule if we add lots of overrides to it (and i assume probably won't perform very well).
Does anyone have an experience on this and can provide some tips/advice?
Thanks,
Chris.