As a quick item to check, I would suggest doing a query against the alertlog table in the DB and seeing if the time of the actions and the alert triggering actually lines up with the time that node went down or the email was sent. This could at least confirm that the alerts are triggering late and eliminate the possibility that it is just emails getting delayed somehow.
Have you verified there is no Delay in your SMTP server settings where in your email server is processing your requests from the Orion server? seems like every update I get some that route to my outlook Spam folder.
Also i would ask are you using the alerting feature in Orion or the Advanced alerting features on the Orion Server it self? Advanced alerting has many features which can delay sending if misconfigured.
Have you Checked the Alert properties,
Name of alert:
Email me when a Node goes down
Description of alert:
Evaluation Frequency of alert:
Severity of alert:
No Alert Custom Properties defined
Alert Limitation Category
Also make sure there are no time of Day settings against the alerts that would prevent alerts from being sent.
Only if the DB is present on this box, but it is a small install... I think the CPU and RAM can take care of this small setup.
We are currently monitoring total of 91 nodes and 498 sensors.
I wonder what the polling completion rate looks like. I suspect it falls off through the day based on fazl-e-azeem comment that alerts are timely after a reboot and are delayed later.
I agree the cpu and ram are likely the big culprits. If they could increase it to at least 2 cores and 6GB of ram there should be improvement.
A few things off the top
- Not sure about your changing IP Addresses on the node and what that's all about... maybe your setting them up with in band IP, and changing to an out of band address on the node, or another back door, loopback, etc.
- Your VM reset tells me you are losing resources, or as things start to run and query you have long run times on your filters, alters, reports, etc.... specifically alerts if too in dept or nested will cause you an issue of not being able to get through the entire DB before your 'Check this every X minutes (or Seconds)' is set to - Check your Server Application Logs for Issues there - and you can also check out link to an alert for long running queues... don't remember where i came across that, but you will want to edit the conditions - it's currently set with a check of 15 minutes.. so any query running longer than 15 minutes will trigger the email.... I am sure you want to edit that.
- Verify your actions and what time they trigger - you can use a simple query - customize or adjust to make it fit or show off a specific alert, or other items..
SELECT TOP 1000 *
where ActionType = 'EMail'
order by LogDateTime desc
yes indeed... that reminds me to suggest checking polling...... via settings or with something like......
convert(varchar, round(nodes.systemuptime/60/60, 2, 1))+' hrs' as Uptime,
Engines.Elements as Elmts, Engines.Nodes, Engines.Interfaces as Int, Engines.Volumes as Vol,
c.custpolls as UnDP, a.samct as SAM,
N.Down_node, I.Down_Int, V.Down_vol, A2.Down_sam,
s.failed as noSNMP,
Engines.PollingCompletion as "%complt",
nodes.nodeid, nodes.CPUload as "%CPU", nodes.percentmemoryused as "%RAM",
e1.PropertyValue as NPM_Rate, e2.PropertyValue as SAM_Rate
join nodes on engines.ip = nodes.ip_address
left join (select engineproperties.engineid, EngineProperties.PropertyValue from EngineProperties where engineproperties.propertyname = 'Orion.Standard.Polling') e1
on engines.engineid = e1.engineid
left join (select engineproperties.engineid, EngineProperties.PropertyValue from EngineProperties where engineproperties.propertyname = 'APM.Components.Polling') e2
on engines.engineid = e2.engineid
or possibly with - UDT Job Status report by polling engine
From your original comments, and from analysis by folks like Jfrazier, it seems apparent the Orion VM does not have sufficient resources allocated to accomplish its tasks.
Assign it more memory and CPU and see if the problem decreases.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.