cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 8

"Checking of directory existence failed. "

Jump to solution

Just setting up monitoring on my UNIX servers. In this case, it's an HP-UX 11.23 box as part of an MC/Serviceguard cluster.

Now, here's the question. I wrote a script (ksh) that runs fine manually. All the places where it can be tested from within Orion work fine. It runs fine manually on the box. It also runs fine on other boxes. However, in automated mode, that is when run automatically from the software on this one box, I get this error.

I've done a bit of research to find the cause and all indications direct me to add time to the SSH timeout. Essentially, you modify a config file on the Orion server (Windows) and restart the JobEngine V2 services. What this is supposed to do is increase the timeout value for SSH connections. What the error boils down to is that it is taking too long (more than the default 2 seconds) to log in to run the script on the box.

Now here's where it gets troublesome. As I am not an admin on the orion box (I'm a UNIX guy, not a Windows admin), I am very limited to what I can do there. However, my thought was that if I could go into the registry and manually change the value of this key to say, 10, instead of two and then test the script again, I would have something to go to the admin with to convince him of the problem. However, when I went to the registry, this key is no where to be found.

I read a suggestion that said to create another user on the box and run the script as that user (creating the necessary credentials as well). In this case, the script must run as root as the command the script relies on is only executable as root due to it being a cluster command.

Suggestions?

The "fix" said to edit the following line in the Solarwinds.APM.Probes.dll.config file:

<appSettings>

    <add key="SSH.Monitor.PromptWait" value="2" />

changing the "2" to a "10" to increase the timeout from 2 to 10 seconds.

Tags (1)
0 Kudos
1 Solution
Level 8

I talked to the admin for the Orion server and he agreed to increase the timeout for SSH as described in the knowledge base article I mentioned in the beginning of this thread. Essentially, it says to modify the config file and restart the service.

That being done, it appears to have fixed the issue. The application has now been up for over an hour. This has never happened. So I think it's safe to assume it's fixed.

I'll still maintain that it's not a DNS issue. It is certainly a timing issue, but unless someone can convince me that using your local files involves DNS, I'll hold that it's not DNS timing out and causing this issue. As I mentioned, this system is configured to use local files first, BEFORE using DNS, and it has entries for both systems involved in it's host table (itself and the Orion server). So for this, DNS shouldn't even be involved.

But it is what it is. And it appears to be fixed at this point.

Thanks Alterego for you help. Glad you decided to look at this for me.

-G.G

View solution in original post

6 Replies
Level 8

Update -

After more research and further systems added, this is happening off/on with about 3-4 servers now, out of about 60 or so that I have configured. What's interesting is that of the 3-4 I'm having this issue, they are all Solaris boxes with the exception of one, this HP MC/SG system. But I'm not having the issue on the other side of the cluster either. Just this one in this cluster.

Also, after some thought, is there a possibility that I can't find the key in the registry due to not having the proper permissions? Not being a Windows admin, I'm not sure if it will show you the key or not if you don't have the permissions. But then again, I can get into the registry and edit it. So I'm wondering if maybe I just don't have the proper permissions to see/edit certain keys.

I'm really surprised no one has made any suggestions on this at all! This seems like such a basic question...

0 Kudos
Product Manager
Product Manager

Slow SSH session connections that require alterations to the timeout values on the Orion server are typically the result of reverse lookups occurring when clients connect to the host via SSH. This additional time, waiting for the DNS query to timeout causes unnecessary lag in the session connection and can be rectified by either adding a proper DNS entry of the Orion server into the DNS server the Linux/Unix host is configured to use for reverse lookups, or disable reverse lookups on SSH connections entirely, which is what many people opt to do.

0 Kudos
Level 8

I gave this idea some consideration and decided to test it. So I added the appropriate entries in both system's hosts tables (the orion server and the system that is failing) to eliminate DNS altogether. It still looks it up, but goes to the local files first, which is essentially instantaneous. Still no-go.Still the failure.

Also, other applications that run against this box work fine. There are 8 systems using this particular application. This is the only one of the 8 that fail. And again, other applications that run against this box run fine. It's only this one. So that too removes support for the idea that it's a DNS issue. If that were the case, other apps against this box would fail as well and they don't. And I don't think it can be an issue with the app as for the other 7 systems that use this app, they all run fine.

0 Kudos
Product Manager
Product Manager

Adding an entry to the host file will work for forward zones, but will not work for reverse lookups. At least not without without some additional labor. Given you have a fair number of what I'll assume are Unix/Linux/Nagios script monitors running against this machine the issue could be concurrency. I've seen this issue in some environments where the number of simultaneous SSH sessions a single user can have is capped at a fixed number. Double check there are no settings defined in the SSH server that cap concurrent or total number of simultaneous SSH sessions. 

Also, above you reference " I get this error" but never state the error you are receiving. If the steps above don't help I would recommend placing the application monitor into debug mode as pictured below and then review the debug log output in the Orion log directory "C:\ProgramData\Solarwinds\Logs\APM\ApplicationLogs" after a failure occurs. If you find nothing of value in these logs I suggest you gather a SolarWinds Diagnostic and open a case with support to troubleshoot the issue further.

Application Debug.png

Level 8

Alterego, I appreciate your help, but I apparently didn't explain it well enough.

First, the error that I'm getting is what I named this topic, "Checking of directory existence failed." Again, when I run the script manually, all is fine. It doesn't matter if I'm logged on the system directly or running the script via the various testing spots within the Orion software. If I run it manually, it works fine. It's only when Orion does it's automated poll (300 seconds) that it times out. And as I said, this application works on other systems, and other applications against this box work as well.

Second, I also checked the network configs to make sure that the DNS lookups weren't being used before posting what I had about it not being a DNS issue. In both cases, forward and reverse lookups, the host that I'm having an issue with is using only the local host table (it's configured to use local first, then DNS). I verified this manually by both pinging both the IP address and the hostname before posting that it not using DNS, because, well, it's not!

The debug logging might be of some help though and I'll try that to see. Otherwise, I had posted all pertinent info needed and didn't understand the reason for your questions.

Hope this helps.

0 Kudos
Level 8

I talked to the admin for the Orion server and he agreed to increase the timeout for SSH as described in the knowledge base article I mentioned in the beginning of this thread. Essentially, it says to modify the config file and restart the service.

That being done, it appears to have fixed the issue. The application has now been up for over an hour. This has never happened. So I think it's safe to assume it's fixed.

I'll still maintain that it's not a DNS issue. It is certainly a timing issue, but unless someone can convince me that using your local files involves DNS, I'll hold that it's not DNS timing out and causing this issue. As I mentioned, this system is configured to use local files first, BEFORE using DNS, and it has entries for both systems involved in it's host table (itself and the Orion server). So for this, DNS shouldn't even be involved.

But it is what it is. And it appears to be fixed at this point.

Thanks Alterego for you help. Glad you decided to look at this for me.

-G.G

View solution in original post