cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 9

JobEngine35.sdf - Design Flaw?

Why is SolarWinds using a SQL Server Compact Edition file as part of its "enterprise-class" Orion APM job engine? 

C:\Documents and Settings\All Users\Application Data\SolarWinds\JobEngine\Data\JobEngine35.sdf is limited in size to 4 GB by Microsoft (see http://en.wikipedia.org/wiki/SQL_Server_Compact).  SolarWinds lowers the maximum size still farther to 257 MB (I assume as part of the db connection string in the code), creating an inability to assign a large number of application monitors to a large number of nodes.  This kills the Job Scheduler Service (repeatedly), and results in the following error on the Orion website:

The database file is larger than the configured maximum database size. This setting takes effect on the first concurrent database connection only. [ Required Max Database Size (in MB; 0 if unknown) = 257 ]

Why isn't the full version of SQL Server being utilized for this purpose, since Orion uses it anyway?  This is severely impacting our ability to implement this tool in our large environment.  We have multiple tickets open on various aspects of this issue, and cannot move forward until it is resolved.

0 Kudos
16 Replies
Level 19

First, I'm sorry you're having a problem.  Dev is going to look at your ticket to see if we can help you work through it. 

Now why do we have an embedded database in the first place?  Well, we use it to store job scheduler configuration information.  We do not store APM (or any other) polling results in the CE database. 

The user of embedded databases in products like Orion is very common.  In our case, we chose it because it has a very small footprint, certainly smaller than a full-blown SQL server.  We didn't use the main SQL Server because there are circumstances where we need the data in question to be on the same box, which is not usually the case.

We have run into very few problems with this strategy.

The memory limit we used is the default.  It should be big enough for everything you need to do, so we can increase it, but there's probably something else going on.

0 Kudos

I see.  Can the job scheduling be extended to the pollers (i.e. poller-bound)?  That would at least spread out the load a little.

Thanks for the assurance that your team is looking into our problem.  I'm glad to hear there's a fall-back option (increasing the maximum db size) if nothing else works. 

We're currently encountering the problems while attempting to monitor ~1250 servers.  We're planning to expand monitoring to ~5000 servers, and we're very concerned about how Orion will scale.  If we end up increasing the maximum db size to 4 GB, will we hit that maximum also? 

e.g.  For one critical aspect of monitoring, we want to assign 45 monitors (76 total components):

5k servers x 76 components = 380,000 components.  Will the scheduling information storage for these exceed 4 GB? 

We also want to monitor many other aspects  or our environment, including applications.  Hopefully, as you say, there is something else going on.

0 Kudos

I see.  Can the job scheduling be extended to the pollers (i.e. poller-bound)?  That would at least spread out the load a little.

Job scheduler can't be, but the job engine can by buying an APM extra poller.

We're currently encountering the problems while attempting to monitor ~1250 servers.  We're planning to expand monitoring to ~5000 servers, and we're very concerned about how Orion will scale.  If we end up increasing the maximum db size to 4 GB, will we hit that maximum also? 

This is the real source of the problem.  You're using more than 10,000 component monitors, and that's around the upper limit of what APM can do with a single poller.  Adding another poller should help a ton.

0 Kudos

Couldn't find your ticket.  You could open another one, but it seems pretty clear that you've just reached the limits on a single poller.

0 Kudos

OK.  You've got tickets on other things, but you should open one on this specific issue.

0 Kudos

We have 6 APM additional pollers and 2 additional web servers.  Our case number for this issue is #111026.  Polling Engine Mode is set to 'Poller-Bound'.

0 Kudos

We've looked into your case, and we found that you are filling up the CE database because you are using the same script in many places, and we are storing a copy each time, so that is a design flaw, but it's one that very few people will hit because they don't use scripts in the volume that you do.  The workaround is for you to increase the size of the SQL CE memory.  We'll need to make some changes to prevent the multiple copies of scripts.

0 Kudos

Update: 

The SolarWinds developers kindly provided a method for increasing the maximum size of the SQL CE file:

  1. Stop all Orion services on the main poller. 
  2. Go to C:Program Files\Common Files\SolarWinds\JobEngine\SWJobSchedulerSvc.exe.config .  Make a backup copy of this file - just in case.  After that's done, right-click and open that file in Notepad. 
  3. There is a section called Appsettings with a line that says:
     
    “<add key="ConnectionString" value="Data Source=|DataDirectory|\JobEngine35.sdf"/>” 
     
    Replace it with this:
     
    <add key="ConnectionString" value="Data Source=|DataDirectory|\JobEngine35.sdf;Max Database Size=512"/>
  4.  Restart Orion services

Unfortunately, we quickly maxed out 512 MB, so we increased the maximum database size still more and tried again.  The database error did not reappear, but as the db approached 1.4 GB in size, the Job Scheduler and Orion Module Engine services became very unstable.  When we tried assigning more monitors, only a few could be assigned at a time before one or the other or both of the services would stop.

In a later conversation with the SW developers, we were told that the service instability was likely due to the fact that the CE db is entirely copied into memory.  When the db grew so large, it strained the server's memory resources and that caused the services to fail.  The SW developers indicated that these issues would be addressed by a future hotfix or service pack that would eliminate the duplication of scripts stored in the CE db. 

In the interim, we will be working with them to create a copy of each script on all of the pollers that can be referenced in the monitor components.  Since only the reference will be stored multiple times (not each script), the size issues can be ameliorated - a pretty clever workaround.  We'll see how well it works out.

0 Kudos

The "short" reference script that SolarWinds provided is:

Set shl = CreateObject("WScript.Shell")
c = "cscript /nologo c:\wineventlog.vbs"
Set a = WScript.Arguments
For n = 0 to a.Length - 1
c = c + " " + a(n)
Next
wscript.echo "Exec " & c
Set e = shl.Exec(c)
Set o = e.StdOut
While Not o.AtEndOfStream
wscript.echo o.ReadLine
Wend
wscript.Quit(e.ExitCode)

Where "c:\wineventlog.vbs" is the path to the "long" script on the poller (must be the same for all pollers).

Copying the "long" script to the pollers and referencing it with the "short" script helped with the CE db size, but we still expect Orion to exceed the limit when we import all the remaining servers to be monitored and add all the other layers of monitoring that we want. 

To help us bridge the gap until a new version of Orion or a hotfix is issued that fixes the underlying problem, SolarWinds worked with us (Kate and Sham are great to work with, btw) to re-write our "long" script that monitors Windows event logs.  The re-write enables the script to run once with a long parameter list of events to search for in the event log, rather than having to run (and store) the script multiple times - once for each event to search for in the event log. 

It's a big improvement.  We were finally able to assign the monitor to all the Windows nodes currently in Orion.  The 4 GB CE db limit still looms ahead, but we now have a little breathing room.  Thanks SolarWinds!

0 Kudos

Has this problem been fixed in the NPM 9.5.1/APM 3.1 releases?

0 Kudos

Hi dayley--

I've marked this for the product manager to see and review.

M

0 Kudos

We've done a number of things that should help your particular case.  We made some changes in the job engine in NPM 9.5.1 that should make it scale higher.  More importantly, APM 3.1 introduced a first-class event log monitor so you shouldn't need to use the script that you were using.  It was the heavy use of the script that was causing your original problem.

Finally, later this year we expect to make some further changes to the job scheduler that will allow APM performance to scale more linearly as number of pollers are increased.

0 Kudos

I thought the new Windows event log monitor was RPC-based, not WMI-based:

 

 

We have had some serious CPU usage problems on some of our older servers when we use WMI-based Orion monitors.

0 Kudos

No, it uses WMI to get the logs.

0 Kudos

Please confirm that as of NPM 9.5.1 / APM 3.1 Orion is no longer storing multiple copies of the scripts in the scheduling database.

What is the new maximum size of the JobEngine35.sdf file?

We were no longer pulling event logs with a script anyway.  We have some other scripts that are affected by this though.  We need to run them because the available Orion/Thwack templates are inadequate for our needs. 

For example:  We need to monitor hard drive free space according to eight different thresholds:

System drive MB free space remaining - Warning

System drive MB free space remaining - Critical

System drive % free space remaining  - Warning

System drive % free space remaining  - Critical

Non-system drive MB free space remaining - Warning

Non-system drive MB free space remaining - Critical

Non-system drive % free space remaining  - Warning

Non-system drive % free space remaining  - Critical

There is a relationship between the threshold used and the size of the drive being monitored that needs to be followed.  At the same time, we also need to look up the monitored server in a separate inventory database to get its the support level, current maintenance status, owner contact information, etc.,  and use it to augment / suppress Orion alerts as necessary. 

Furthermore, if there is a problem, then a case needs to be auto-generated in our trouble ticketing system.  The recipient of the case depends on the type of alert (e.g. system/non-system drive), the type and level of thresholding used, and time zones and SLA windows. 

We have similar requirements for other types of monitors, such as heartbeats.  All of this can be accomplished through scripting, but not with the built-in monitoring templates alone.

0 Kudos

Answer to my previous question:

"Also, I meant to answer your question about the job engine needing the smaller scripts - for now, this is still the case, but it may change in the next implementation of it.  We'll know more once the beta & QA testing  progress a little further.

 

Regards,

Kate

SolarWinds Support Team"

 

0 Kudos