I've thought about doing something like this but have never been in a position to actually execute. Unfortunately, i'm more accustom to actual outages hitting often enough that there's no opportunity to dig out...that being said, i'm very interested in the responses that come back!
Who to involve: If the escalation policy is outlined well, this might be self governing...and depending on the scope of the test, maybe escalation is the first milestone where a decision is made to end the test and evaluate the response to the simulated outage (or the second escalation, or third). If that's all spelled out, then you know exactly who needs to know it's a test and who doesn't....if you're testing levels 1 - 3, make sure escalation level 4 and above are aware?
Preventing harmful response: Depending on the device(s) involved and what you've got for authorization control, maybe you could just revoke access to execute certain commands that could be harmful? Of course, that could be self defeating...the tech being tested may see that they can't execute a command and think it's a symptom of the problem, walk over to the physical device (if they have access) and "manually power cycle" it.
In my former life as a NOC manager I would "test" my teams from time to time by injecting a false route into the local routing table on my SolarWinds server. This would cause SolarWinds to believe a node was down (even though it wasn't) and allow me to watch my teams response. Stopping them before they actually called the location and gauging how long they took to respond to the alert and handle the issue.
I even had customers who would be onsite for a network upgrade and instead of informing us they would down the equipment to see how long it took us to call...at that point they would inform us of the outage.
Sneaky customers. That can be annoying, but one does have to look at it from their point of view. They want to make sure they're getting the monitoring they're paying for. It's certainly better than the customer getting a call from the NOC half an hour after one of our engineers showed up for disruptive work because someone forgot a change control. Oops.
Widening the scope of this conversation, what about generating the response procedures in the first place?
While as a former integrator we had to include these in projects prior to "handing over the keys", we usually just started with tabletop exercises where you walk through a mock issue and document. These documents become the response procedures and a few of these procedures can be nominated for surprise dry runs.
If people don't know what to do then you can only hope they do the right thing.
When I managed our NOC we did "fire drills". When we ran these we didn't really stage a true problem; instead we would send a message through the system indicating a problem and they would respond to an email address in the message with what steps they would take to resolve this problem. We would measure both their response time and the accuracy of their repines and results of all techs would be posted to try and promote a friendly competition.
While this method is not as accurate as simulating a real problem, it's much easier to do without any potential negative side effects.
During my time as a network engineer for a large engineering company in Austin, we created a pretty intensive lab environment mimicking all the various types of configurations we had out in the wild. This lab included all the same routers and switches and was conveniently cabled to be a part of the actual corporate network. We had the ability to route specific traffic through the lab when we wanted to test configurations before going fully live in production. The lab included some open source traffic shaping and traffic generation software running on a host of linux machines so emulating traffic and internet connection types was possible. We were also able to manipulate latency, packet loss, bandwidth, routing and so on while also being able to simulate hardware failures. As you can imagine, it wasnt cheap but management was fully on board luckily.
Having this definitely made simulating events much simpler and thus allowed us to develop processes and documentation around handling them for the various levels of teams throughout IT.