Incident Epicenter Report

I think it would be really awesome if Orion had the ability to product incident reports based on the exact time an incident occurs.

Here is my (likely delusions of grandeur) vision of what this would look like...

A button for Incident Epicenter Report on a Node level view, once clicked I am presented with a form with the following options...

  • Date/Time of Incident
  • Range (both before and after specific incident date/time) in number of hours
  • Which data-points I am interested in (check boxes in different categories)
    • This would include options from modules such as APM components as well as UnDP's
    • This would also include Syslog, Traps, and Events
  • Percent Deviation from Standard for a time range
    • Here you would input the specific time range in hours, days, weeks, months

After filling out this form a report, or a temporary one time Orion dashboard (with standard output to PDF option) is generated with all of the data-points you selected (in graph form for performance data) ranging both before and after the specified incident time in a number of hours selected as the Range.  For each performance data-point it would indicate what the percentage of deviation from standard based on your selected time range for the Percent Deviation from Standard and if it was higher or lower.

I hope others can picture what I am attempting to explain.  Any questions, improvements, or comments of any type are welcome!

The use case for this is as follows...

I often have folks (both internal and customers) call me and ask me if something happened on some system at some specified time in the past.  What I end up doing is digging through all of the different data in Orion for that time to see if anything odd jumps out at me.

This feature request would basically take all of the manual work out of it and put all of the data together for me to look at.

I'm also wishing for the tell me what happened at X time button.

Your report idea would be one way but it would need to be able to work at the group level or grab dependencies.

What happen to a single node is already available just tiresome to find. I'm not aware of the ability to find what went wrong with multiple items unless I guess you go off the event log.

My similar idea was "Time Travel" in atlas.

my use case is I may get a particular time to start a hunt but the hunt would involve 5-50 different servers that make up a single app instance that need to be checked.

I have real time status for these collections of nodes in atlas already so it would make sense if I could just enter my time frame atlas could quickly show me lot and lots of data at once.

Your suggestion for Time Travel in Atlas was what got me thinking about this feature request in the first place.  The difference is I wanted to try and articulate what an actual implementation might look like in addition to specifying the use case.

I agree it would be very cool to have this type of feature across multiple systems but I can't picture what that would look like or how that would be done.  This feature request satisfies a very specific use case that I often find myself spend many hours doing; having it automated would be a huge time save for me.

