What's wrong now?
Well, I think I ‘borked’ this whole upgrade and I really don't want to start over, because I'm going to lose everything I've set up so far, so...
Did you try some shananana, dadadadada. No, seriously, did you get the log files?
Yeah, I did. I started to look at them, but there's a few places it could be. So actually, I may break down and open a support ticket at this point.
That wouldn't be a bad idea. There's a lot of things that you could do. Okay, I've heard of having support on speed dial, but that is completely ridiculous.
Yeah, tell me about it. I was on my way to the cafeteria to get some breakfast tacos before they were out. I must have taken a wrong turn. How can I help you Leon?
Well, I was just jumping into this because...
Hold that, one second. There is a whole episode of content here easily. Let's go ahead and kick this off. Welcome to SolarWinds Lab. I'm Patrick Hubbard.
I'm Leon Adato.
And I'm Jason Ferree.
You know what?
I think it's probably going to be easier for you guys to actually do a demo and follow it with a two shot, instead of a three shot, so I'll just wander off here for a minute and then come back at the end. And we're going to show you something that's super new and super-secret in the Orion UI that I think you're really going to want to see.
No, I think you're going to go have breakfast tacos and let us do all the work. So what's going on?
Okay, well, before we actually dig into this, I want to thank you for coming on the show. You know, a lot of the ideas we get for Lab come from viewer comments, but the other thing that we really draw on are the kinds of tickets, and things that you guys hear about, that feeds what we want to show on upcoming episodes. But even so, we can always spend more time on troubleshooting, and there really isn't anybody at the company who knows more about it than you folks.
Well, thanks. The way I look at it is anything I can answer for you here is one less ticket that the customers will open later.
Right, and it's a good point. Now, for everyone watching, you can use live chat that you see over here to ask questions now. In fact, we're probably going to have more than the usual number of support guys watching the show just because we're doing the shout out. But meanwhile, if you don't see a chat window over there, that's because you aren't watching this episode live. And to do that, you want to head over to lab.solarwinds.com, and sign up for reminders so that you can join us for the next show. And you can also leave us some comments, ideas for topics that you think we should cover in future episodes, or just to say hi. But, back to the task at hand. One of the things that confounds SolarWinds admins, even folks who've been using this for a while, is where to go to look for detailed information when things go wrong, or when things just aren't working quite right.
I think what you're looking for is a tour of the log files.
Right, log files. Hang on. I have just the thing. All ready for the log files. [Chainsaw revving]
Okay, log files. One of the most commons things that a support rep's going to do is ask you for a diagnostic. Why? Because it dumps a ton of log files into one nice little package that you can then send up to us. And then, based on what behavior you're describing, we'll look at whatever log file we feel we need to.
So what problem are we supposedly troubleshooting here?
So in this particular case, we had a customer going, "My website's slow." So, kind of vague. Could be any number of things. So, at this point, I'm going to start looking around for clues, which hopefully will direct me to some problems, or at least some things to correct. Because many times, it might be, okay, well, I see a problem here. This could account for the slowness, let's fix that and see if we need to keep digging. So, first thing I'm going to look at--the Windows event logs. So, even in the event viewer, I could do this. If I'm live on a system, it's also in the diagnostic package. You can see the most common ones, plus SolarWinds. That's our own, but let me start at the top. We'll look at application first. And so I'm not really concerned about the informational messages. I'm more concerned about the error messages.
And so I see here, where the NetFlowService threw an error. Let me expand this a little bit, and it tells me that it can't connect, or can't open the database. Okay, interesting, request the login, login failed. This was the NetFlowService, happened about 2:55 in the afternoon. Down here, another error message a minute earlier. Looks like about the same thing. Again, it's trying to open a connection to database and can't. So, that's something now that's in the back of my mind. We're trying to connect to the database and we're able to, is what it looks like. Nothing in particular there. Again, NetFlow, NetFlow, NetFlow. So, I'm seeing a handful of cases where, in this particular case, NetFlow is trying to connect to the database and having problems. So now let's actually look at the SolarWinds logs, see if we get, again, some more information, more details. Error message. There's the flag, and this is again— now I'm back to 2:55 again. Service was unable to open a new database when requested, so again, one of our services is trying to connect to the database and can't. So, that could tie into— okay, if the website, through IS, is trying to build the webpage, and it's opening a connection to pull all that data, it's populating all the resources and it can't do it. Well, it's going to try again; it's going to keep trying. But again, the perception is going to be, "It's slow, what's going on here?"
So that's where we're at.
And this is a representation of the directory structure in program files, or program data, or you know, any of those. So it's pulling all that together into that diagnostic zip file.
That's right, right.
That's right, and that's what makes it so handy, because then everything's right here, so sometimes it might be frustrating for a customer who's like, "Ah, I've got to send you diagnostics, but it's such a nice little package." It's almost a bundle, so...
And we're going to talk about it later but, you know, one of the first things that I always do when I see that there's a problem happening is, even before I start making the ticket, I start running the diagnostic file. It just has become second nature for me, so. Because I know that that's what you guys really work on, but we'll talk more about that in a little bit.
Then the advantage is even if we fix the issue, we have the package, we can send it to development and go look, hey, we did have this issue. Yeah, we fixed it with whatever. Can you prevent it from happening next time? So having those diagnostics when you see them is great. So, now I'm going into the Orion diagnostics and we were looking at that BusinessLayer log. So, let's go into that. And now again, I'm going to start looking for error messages and timestamps, but already, I'm at the top. 2:55, I've got an error message. This is where it showed that it put it in the event log, and again, unable to connect to the database. So, now we've got specific services trying to open its connection using this connection string.
So at this point, I mean, I know that Tom isn't on this show but is it time for me to blame the database? Because I love doing that.
Well, and so do we, to be honest. Because, you know, we like to, but...
Because we're network guys and it's not the network.
It's not the network and, you know, the problem is, we've learned, we go to the DBA and he's going to be, nah, SQL server's just fine, it's something else. So while there may be something going on in the database, maybe there are other databases that are running on it and something's overloading it from that end, who knows. But you know what? I still have some breadcrumbs, and some ideas and places to go look to find out, well maybe it is on our end, or maybe it's something's configured wrong, and I'm causing my own issues. So, there's another log that ties in more specifically with the database, and that's the nightly maintenance routine. So every night, by default, about 2:00, 2:15 or so, a database maintenance rolls up some of the old data. So this is a good indication, if I look at this log file, what is actually going on with the data. How long is it taking it to do all that maintenance? Is it even finishing? Because in some cases, it could still be going, and if it's still going, well, that's taking up cycles where it could be pulling my data from my website. So, I'm going to skip the top log because it's 3 p.m. in the afternoon, and it's very tiny. You can see it at only 11KB. I know there's nothing going to be in there. And if you'll look at the timestamp here--2:55--that's the latest modified date for this log file, which kind of tied into the other issues we saw. So let's look in here. All right, at the top it shows at 2 o'clock in the morning is about when it kicked off, and it's got some of the retention settings you can see here. But I'm going to scroll. I typically go to the very bottom because I want to see how long it took, and I can also tell that this is a really large log file. And for database maintenance, it shouldn't be this long, and you can see why. One, at the very bottom, I don't see an indication that the maintenance finished. The last line in this log file should be, "we're done."
Means it's complete, right?
Yeah, and it's not, so it's still going on 12 hours, 13 hours later, what's going on? Something is really crushing this thing. But again, as we can see, some of the same indications, these different events or maintenance routines are trying to run, and they can't run. So, again, some additional evidence that there might be something wrong.
And also, you can imagine that, at this point, you know— 2 o'clock in the morning, the database isn't particularly busy. But at 2 o'clock in the afternoon, that thing is chugging along, depending on the size of this customer's environment. So there's a lot of conflict— conflicts that are happening here.
That's right. Everybody's trying to get a piece of the pie, and there's just not enough going around. So let's take a closer look at the SQL environment, the database. Which is also in these log files. Database info. SQLServerInfo might give me some information as to, you know, say, how much physical memory it has. Looks like it's got plenty. This actually tells me the name, the server name. And I'll show why this is important here in a minute. But, really, what I want to see is if SQL's installed in the same machine as the Orion Platform.
Okay, so if they're polling engine and the database, because that would definitely mean that things are getting too busy.
That's right. Again, sharing hardware. So ideally, and in most cases, the SQL box is on its own server. So the last thing that I want to look at when looking specifically at the database is the actual tables themselves. And so I've got this table row count, and in this case, I'm going to sort it by size. I want to look at the large ones. I'm not really concerned with the small ones, and this is a big one.
Wow, that's a little bit large.
That's a little large. We've got over 10 gig of traps. So, huh, okay, now some things might be pieced together as to what might be causing some of these database issues.
Why is my database so big?
So I want to take a minute, just because this is one of my favorite rants this year is, trap and Syslog. You can see here that the trap and Syslog volume is really large, and we do see this for a lot of customers, because we know that SolarWinds NPM will receive trap and Syslog. But for any sizable environment, you don't want to do that. What you want to do is create what I call a trap and Syslog filtration layer. Now we've talked about this in past episodes. And I've got a nice little chart that we'll put up on the screen here. But what we're talking about is putting a number—not just one, but a number of Syslog and— sorry, a number of servers running Syslog and trap receiver behind a load balancer. So one IP address that will be the IP address you send all your trap and Syslog to, and then that set of boxes running any sort of inexpensive software that does trap and Syslog. Like, you could do an open source one, or Kiwi Syslog, which will do it really handily. Not expensive, and you can do multiple, so you can round robin load balancing. And then you filter out all the garbage, because I'll say 90% of the trap and Syslog you get are meaningless things that you don't need. So you filter all that out, and you only forward the messages that you actually want to act on into the NPM layer, so that's one way to avoid this. Sorry, I just had to talk about that for a minute.
No, and I truly appreciate that because that's one of the same type of arguments— or, not arguments, but points that we are going to drive home to the customer. We'll go look, eh, do you really need this?
And it's easy to do, it's not hard to do, and it's cheap, it's not going to break the bank. So I would recommend that everybody, in a sizable environment using trap and Syslog, do that. Okay, so we see that trap and Syslog are a culprit. The tables are huge.
Tables are huge, so, one of the things that I want to revisit is going back to that database maintenance routine, to see exactly what the retention settings are set to. Because it shows me right here at the top the retention settings, so right here, I'm retaining Syslog messages for 60 days and trap messages for 60 days.
So 10 gig of messages for 60 days' worth of...
For 60 days, which is a maybe. Now, understand, we're having problems even running the database maintenance routine, so it may not have been filtering all the old stuff out. It's just sitting there collecting, collecting, collecting.
Right, good point.
So, yeah, this raises a flag for me, because by default, it's seven days. So then, I'm going to ask, one of my notes to the customer is, "Is there a reason why you're actually retaining these for 60 days? Because that's well beyond the default." Same with Syslog. So, now the last one thing I want to do with this particular case, just to, again, give the customer some more information that they can go look at, is let me go actually take a look at those traps. So we get a small snippet of both the trap and Syslog tables, not the entire 10 gig, because wow, right. And here I'm looking at, I've got, it looks like a few different IP addresses are sending, not a lot, but the big thing to me is how many am I getting in a small period of time. And that's why I'm going to look at the timestamps and go to a breaking point, like right here. There's a point between 6:09 and 6:11, and then I'm just going to keep scrolling, looking at all the 6:09 timestamps, and I'll keep scrolling and keep scrolling and...
Public messages, and...
Yeah, I'm starting to get even a bigger idea of what the issue might be. Here we are, finally. So I've got almost 200 trap messages that came in within one minute. That's not a lot, but it still questions the why. So sure, maybe we can handle it. Maybe the hardware's not set up to handle it, that's not necessarily the question at the moment. For me it's, what are you doing? Okay, so to answer that question, you might have some rules set up; let's look at the Trap rules. That's the nice thing about this is, ah, look, hey--
You have none.
I have a Trap rule with a default name, New Trap Rule, hmm. That's probably--I was out there playing; I got this new software. And hey, let's create a Trap rule--and hoo, there's a Trap rule.
And it's all sources, all everything, so you're accepting all Traps in. So again, it speaks to the idea of a filtration layer, which by the way, also in terms of retention; you want to keep 60 days' worth of messages. That's what that filtration layer is for--because it can shove it off to another database or whatever you want to do, but you're not clogging up, you are alerting and you’re sort of near real time monitoring system with that--that goes into your forensic database.
That's right. Right, many customers, they have to retain for even longer. You know, a year, they just have to, that's their company policy. Great. Don't use Orion to do that. Filter it off and store it somewhere else. Couple other things I want to look at real quick. The actions with these traps. So, I know I'm doing something with every trap that comes in. What is it actually doing? And the action here is it's forwarding the traps. So really all I'm doing is, grab the trap, forward the trap, and store it to the database. Grab the trap, forward the trap, store it to the database. I'm not actually doing anything in Orion with it other than storing it. Now I could put an action in here to delete it, so I don't need it to go to the database. That's one of the nice options with Orion, is that you set up the trap to do all these alerts, but it doesn't actually have to store it.
So, delete it, get it out of there. So with this particular customer, one of the things we need to really do is clean up that traps table.
Now that we've seen this, the next thing that I would do as the user is say, oh, well let me check my Syslog rules. And let me see—are we doing the same storing forward, or what are we doing here? But that gives me a good idea of where we can start to dig into this and really, I can be a little proactive on my own maintenance.
Right, so as the support rep, I've got a handful of these red flags that I found, and I'm going to recommend, "let's clean up those trap tables," get that data out somehow, we need to clean it up. Do some housekeeping, like you mentioned, and then see where we're at.
Wow, seeing all the information that we stuff into those log files makes me wonder if there's anything we can't diagnose from those.
Well, it's good for a lot, that's for sure, but it's certainly not the only trick up our sleeve.
Okay, well, but I feel like showing it here may help people actually not have to open a ticket if they don't need to.
Well, I'm okay with that. I don't get paid by the ticket and there are plenty of calls to go around.
Well I'll bet. You guys get calls on everything about, everything. So actually, you know what? That makes me wonder, what's one of the call types that you get all the time that you could answer here, and help save people the trouble?
Oh, I've got a biggie, and that's issues during upgrades and installations.
What are some of the things that users can do for installation and upgrade that will make their life easier just right off the bat?
Well, there's a handful of tips. A lot of it is planning, and actually reading the documentation. Our info dev team actually does a really good job putting together admin guides that are informative. I use them all the time. I've got a quick link, because I'm going to go grab that page and send that to the customer: here's your answer, right? I just know where those answers are real quick that I can pass on, but most of them, is from the administrator guide. So there's a handful of things that you want to do right off the bat. First, say you're a brand-new install. You want to spec out the system properly. Again, back to the admin guide, it tells you what the recommended hardware is. You want to know what modules you're installing. Maybe it's NPM and SAM and NetFlow and NCM, and all this is adding up, so you need more processor, more CPU...
More RAM, more power.
More power, that's right.
You want to properly spec the system. It also depends on how many elements you're going to poll, how many nodes, how many interfaces. You know, you might have 10 nodes, but you're monitoring these giant switches. That's a lot of one data that you're going to have to store, and all the polling that you're going to have to do. So, CPU memory, so you want to properly spec out the system. If you need help with that, we'll be happy to help you with that. And more often than not, we're going to give you something to read and then offer some of our advice, but we don't have a crystal ball. We're not going to know exactly what you need in the future because you might grow, you might acquire a company and we've had to merge many times before. You might decide to add a couple of our other modules, as many customers do. And now your hardware is ‘under-specced,’ so if you go with the minimum requirements, you're going to get a minimum performance.
Right, so I just want to reference back that THWACKcamp just recently had a whole session on scalability, so that's another place that people can go look for ideas or tips or tricks or what the pressure points are for a real scalable environment. But I think there's a few other things, I remember as we've been installing things, that you should do on the system, regardless of the hardware. So, can you run some of those down?
That's right. So, the first thing I do when I log in is I want to log in as a local admin.
Not Bob, who's an admin on the box?
I know Bob is a super admin and he's got all the power. He probably doesn't have as much as he thinks he does. More often than not, they'll tell me every time, I logged in as admin. Well admin, or local admin. It actually does make a difference.
Right, and also not domain admin.
Not domain admin either. Because GPOs can change all the time, and that's going to break something down the road. Maybe you move off to bigger and better things and somebody's coming in behind you, and now their permissions are different than yours, and there's a conflict. So, you should have local admin to the box. Hopefully you're a server admin, just do it the right way right off the bat.
Right and I'm going to emphasize that because it's bitten me a few times, especially in larger environments where the auditor security teams are really stringent, is that it is a requirement. You must log in as local administrator, not a local administrator equivalent, not whatever. And you might get some pushback from your security team, but it's— if you don't, it will— not might, not could— it will cause problems down the road. Six months, eight months, that one upgrade, and all of a sudden now you're scrambling to try to make it work, so just be aware of that.
Yeah, just do it right from the beginning, and to be honest, many times we've solved an issue just by logging back in as local admin, run a repair. Wow, everything's working. Okay, so, almost can't beat that one enough, but I will go on to some of the next things. The next thing you want to do before running the executable, turn off your anti-virus. Anti-virus can get in the way; it can lock some folders down. Go ahead and just turn off anti-virus, it's going to save you some headaches. Next thing we want to do is talk about, you know you can virtualize these environments, there's no problem putting it on a virtual machine. Now again, we talked about earlier that SQL typically is on its own machine, so we don't want to virtualize SQL.
We could, but okay, you have to be really, really good. You have to be on your game to be able to do this. We're not going to say you can't, but most people, especially most SolarWinds admins, who are usually not part of the server team, certainly not part of the DBA team, it's not something that you want to just willy-nilly go into. You have to know your stuff to be able to virtualize a SQL box. So you shouldn't unless you're really there.
Yeah, we've seen some rock star systems that are virtualized but more often than not, it's what's causing maybe some of the performance issues.
Okay, so back to permissions, it's not just about— It's not just about being logged in the right way, the impact of that is that the permissions on the folders, when you create new folders and things like that, that is why you have to be logged in as local admin, right?
That's right. So, many of the services are running under a local system account, and so they're communicating with all kinds of folders and things in the back end, even like IAS, to build your webpage, and if those specific folders that they're trying to access, if they don't have permission to do so, then they fail.
I think the other thing that tends to trip me up is, especially if I have additional pollers and things like that is that I end up grabbing, surprisingly, the wrong file. I've done an RC, and now I'm grabbing the main one or whatever. What can I do to make sure that I have the right install? You know, I've got 11, I've got 10.7, and I want to go to 11.5, how do I get there? Things like that.
Well, we've got this cool new tool that we've put out called the Upgrade Advisor, and what that does is I can put in what products and versions I've got now, what products I want to go to, even my operating system and SQL environment, and see if everything's compatible. And it will also give me a step-by-step upgrade path, and there's—if I log into the customer portal before I use the advisor, it will give me the links.
Okay, so let's take a look at that.
So I've logged in to my customer portal, and I can go to support. And here we have Product Upgrade Advisor. Let me go ahead and put in my operating system, let's say I've just got a Windows 2008 R2 box, let me go with— Just keeping it simple. And of course, I'm upgrading Network Performance Monitor. Now let's pretend that I am new to this company and the last guy didn't bother to do any upgrades.
Which is just a horrible thing to do.
But it's— I've seen it more often than I care to admit.
Yes, so, let's say that they're still back on 10.6 and let's just add a bunch of products here. Network Configuration Manager, Let's say they're—and again, I'm going way down the list, right? But these products are actually compatible with each other way back. Server and Application Monitor, let's say I've got that product and, woof.
You're killing me here.
I know. NetFlow Traffic Analyzer. And let's really go old. All right, so this is my current platform. I want to upgrade to the latest. Now, I don't have to go in the same order, or list them in the same order that I did in the top box. It's any order. It doesn't matter. It's going to tell me what order to upgrade them. And, magic button, tell me what to do.
All right, bam, here's my upgrade path with little notes of different things to highlight because Network Traffic Analyzer did go through a pretty significant change in this path that we've just created, where we've got a new flow storage database and so on, so it's telling me a little bit about that. "Hey, there's some additional prep work you might need to do before getting started," but, as you can see it tells me, okay, the first thing that you need to install is this version of NetFlow Traffic Analyzer. Now you'll notice this isn't the latest. Because of the way the database and schemas and how everything mixes up with each other, I can't just go, boom, I'm going to make the big jump.
Right, and that keeps people from getting themselves into, you know, painted into a corner so to speak, where they— the first thing I would always upgrade is NPM to 11.5, but that may end up precluding me getting anywhere else, I may not actually be able to get there. The other thing I want to repeat, that you said earlier, is that if you're logged in to the SolarWinds portal, you can see your SWID in that upper corner, then you get these download links. If you aren't logged in, you get all the information, just no links, and that's been a conversation on THWACK recently, about people saying, "How come I don't see links?" and "you lied to me," and I didn't understand why they weren't seeing them either. It's the logged in/not logged in piece.
Right, and so even if I wasn't using the upgrade advisor in order to even get the software, I've got to log in anyway. So just log in first, I can run the advisor. But this gives me a nice little roadmap of what I need to upgrade to get current. Now one of the things that we may even pose: if you're a customer and you're calling me and go, "Okay, I'm upgrading from this really old, like years old, software." I might go; do you want to hit reset? Do you want to just start over? Because this is going to take you some time. And this might be a good time to hit reset. So that's an option out there.
Excellent. You know, I am sure that I have opened a ticket about that stuff at least once or twice over the last few years.
Probably. Like I said, those install and upgrade issues account for a high percentage of our ticket volume.
Yeah. You know what, and that takes me to my next question: what can users do to get a faster ticket resolution?
You mean like sending us a plate of brownies or scotch? Because I could certainly get behind that.
Okay, but you know that could add up to a lot of brownies and scotch.
Well, me and the support crew can consume a lot of brownies and scotch.
Okay, fair point. But I was thinking more along the lines of things that users can do before they call, or ways that they can fill out that ticket request so that they cut down on the back and forth, the "are you sure," "can you send me this," that kind of stuff.
Sure. I know it sounds silly, but before you submit a ticket or call us, know what product you're calling about.
Okay, that does sound silly. What do you mean?
So I'm calling about SolarWinds.
Oh, got it. It's like when I call the mechanic and say what kind of car; I have a Ford. Like you know, Focus, F-150, you've got to be a little more specific. Okay, got it.
Completely different beasts. And even when you're drilling down into the product menu, there's, as you know, there's a bunch of different products that we have. Know which one you're talking about. So are you talking about Enterprise Operations Console, Failover Engine, Network Performance Monitor, SAM, know which product you're calling about. And also take a lot at the version that you're calling about.
Okay, right, because that can probably get you just in the wrong queue.
Yes, you can get in the wrong queue. And if you get routed into the wrong queue, that's going to take extra time. You'll get the wrong person, and they've got to transfer you, and… So just to make it as simple as possible, know which product specifically you're calling about. As an added tip, one of the things that we did in our call system is, once you get into the phone system for support, you'll hear my voice saying hey, you can enter the first letter on your keypad of the product you're calling about. So, if you're calling about Network Performance Monitor, that starts with an N, hit six, that jumps you straight into that queue. So, nice little tip.
Oh good, okay. Anything else we need to know before we call?
You need your customer ID.
Okay, like, you know I log in now, we've got the great login so I log in with Leon.Adato@solarwinds.com, like that one?
No, not going to help us much. It's the SWID.
Okay, which if you're logged in as your account, you should still see the SWID up in the corner. So it's not like you can't—if there's a manager who bought the software, and he has the main login, you still will see what the SWID is, so just have that before you call. So, okay, we've got the right product that we know we're calling about, we've got the version number, which you didn't mention, but it's something I always try to tell people. Bottom of the screen, go down to the bottom of the screen, and right at the very bottom, there's all the products and all the version numbers that you've got right there. So you want to do that, and then also know your SolarWinds ID.
Right, because we're going to ask you for all those right off the bat.
Next thing you want, if you're actually submitting the ticket online: same type of things apply, put the right product in there because it could then get overlooked, so I might have a dozen people looking for NPM cases, but you labeled it as Enterprise Operations Console. Now nobody's going to see it as quickly. Screenshots. So if I'm describing something to you, I might think I'm painting this beautiful picture of what the issue is. The support rep's going to read it and go, I don't know what he's talking about, so typically we start this ping pong match of, now I'm asking you a bunch of questions, you're clarifying back to me. As an extreme example, but we do actually still get it, for the subject, "I've got a problem." And for the description, "My web console won't open." Okay, well we're not getting anywhere fast at that point, so give me as much detail as you possibly can right off the bat, and the best way to do that is screenshots.
Okay, so I'm going to be hitting control-shift-print screen and saving it to paint, and that's really a hassle.
It is kind of a hassle, and many times when customers do that they'll give me a thumbnail. Sometimes it ends up as a thumbnail, too, which again, is not helpful. So now, I've got this tiny little picture. I'm not supposed to do anything. But one tool that many of our customers may not know about is problem step recorder, and it's something that's actually built into the Windows platform now.
Right, you were telling me about this earlier and my mind, my head exploded it was really awesome, so I want to take a minute and show that. So we're on my dev server here, we're actually looking at the SolarWinds demo screen. But if I needed to take some screenshots, what would I do? How would I get into it?
So the best way, that I bring it up, is just type in "problem step recorder." There we go. So steps recorder is the option in there. Go ahead and launch. It's another little tiny tool. When I'm ready to go, I'm going to hit start record, and then I'm going to start to describe my problem to you. So maybe I'm drilling into specific devices, and I'm just using the demo on the, another web server as a demo. And as I drill down into different things, I can even hover or highlight some different points, and the problem step recorder is going to point out, in the recording that it makes, it's going to take a screen shot and highlight it. So after I'm done, I can just go ahead and hit stop record. It's going to bring up another screen that shows the different steps that it recorded, a little summary, and it actually took the screenshots for you, as you can see here.
Oh, wow, yeah.
So then I just hit save. It bundles it up into this nice little zip file package; you send that up to me, it paints the picture for you.
All right, of exactly what steps you were going— the customer's going through, as the problem was being experienced. That's fantastic.
Right, again, because you might get some error messages on the website, and we're going to go, "Well, what were you doing before that?" "What led you to that?" So recreate the issue, send us the recording, maybe along, again, with some diagnostics, and you paint a great picture that helps get your support to resolution much faster.
Well, we'll get to the diagnostic in a second. But one thing that we were looking at, when we were talking about log files, is the time. That you should tell the support technician the time that you were doing these things because now you should be able to see that they're going to go back in the log files and look around that time for what's going on, so you need to note that in your ticket notes.
That's a very good point and I'm glad you brought it up because yes, we're going to look straight at that time stamp.
Okay, and another—so, talking about diagnostics now. One of the things that I learned the hard way is that if you're having a problem, you know my first reaction is restart services, and if restarting services doesn't help, then reboot the server, which you shouldn't have to do but sometimes you do. And then if it's, there's still issues, then I open a ticket and I call. But I forgot something, which is that when you restart services, a lot of those log files start over again.
They start over again, so what could have caused the issue, that little error message that we were looking for earlier, the little time stamp, that went away. So, now you've got to recreate the issue again.
Right. And I've been on the customer side of that call where they say well, wait for it to happen again. What?
Yeah, sorry, I don't—because again, I'm looking for something tangible, some clues, and if you just went [blowing], all the breadcrumbs are gone.
Right. So again, my little hint to everyone out there is when you are having a problem, you've taken pictures and all that stuff, before you restart—because we all know you've got to get back to work, right? I mean, if you know that restarting services is going to actually clear the issue at least for a little bit of time, first take your diagnostics, save them off, then do what you've got to do, because somebody's breathing down your neck, but now you've got the diagnostic file. So that takes care of that. All right, so that takes us to actually opening a ticket. So here we are on the customer portal, okay, and we're ready to open a ticket. Now, I've taken my diagnostics, I've done those screenshots. What do I do here? What should I know about here to actually get this thing opened and moving fast?
Well, again like in this particular case, we logged in ahead of time. The advantage to that is it went ahead and populated the customer ID. If you don't log in, please put that customer ID in there. Because, again, if you don't, it's going to get routed to our customer service, they've got to look it up, and then it goes to support. So it saves some time. So if you're wanting to get to a resolution quick, make sure you actually put a real phone number and a real email address in here, because we are going to try and call you, in many cases. Maybe if we hadn't heard from you, "Hey, Bob, everything all right?" So real phone number, don't just key smash.
That's right. Again, proper description of the issue but, now we've got this option here as well: attach a file.
And that's going to be for the .zip file with the screenshots, not necessarily for the diagnostic file, right?
Not necessarily. Now, it might fit. Most of the time, though, it doesn't, so you've got an option to upload those to a LeapFILE server that's already there. Many of our customers are already aware of it, they know how to log in and do it. That's an option for you. You don't necessarily have to do LeapFILE. Maybe we don't need the logs but if we do later, we can send you a token because we've got a new Serv-U server. We're using our own, right. Which creates a nice little token, when you upload the log files to us, we get notification, we get direct links, so it's a fast process that way as well.
Right, and I'll add one of my little tricks in there, which is that my habit is, again, problem is happening, I start the diagnostics, I start taking the screenshots, now that I know how to do that. I clear everything out, I do what I've got to do to get back in business, then I go here and I start opening the ticket. Fill everything out, attach my screenshots, open a ticket. You're going to get an automatic email back with a ticket number. Once you have that, I'm going to say go to LeapFILE, solarwinds.leapfile.com, and there, you can upload your diagnostics right away. And you say, referring to ticket number such and such, here are the diagnostics, put a little note saying I'm just sending these in advance. What that means is that the technician who opened the ticket is going to see that the diagnostics are there. They can actually get to work without calling you, without that ping-pong match of back and forth. You've given them everything they need to get moving right away.
Right. And the good thing to highlight what you had said is, put the ticket number in there. Because again, if you consider all the thousands of customers that we have that might be doing the same thing, how do we find that needle?
Right, always want to preface that. So, then I think the next thing I want to do is— so, my hair is on fire. It's nice to open a ticket and all that stuff, but what happens when it's really a screaming problem? How do I open a ticket faster than that? Can I do that?
Well, you can look at the active diagnostics.
Okay, that's a new idea for me.
That is new. So, one of the things you'll see when you went and created the diagnostics is two buttons now: one says start, one says active diagnostic. So I can demo that and show you what it looks like.
Let's take a look at that.
Okay, so this is the screen that you would go to run an actual diagnostic, but there's another option here: run active diagnostics. Once you do that, you'll have an option down at the bottom. It's kind of greyed out right now because we already did it. It says run diagnostics. And it's almost like Orion for your server all at one time. Take a look. So, I can look at some of these different little icons. So I've got, you know, what's this application certificate? It's in yellow. Why is that? Automatic root certificates, what? Huh, so I drill down in, it gives me a little bit of nugget of what's going on, and look, a link to a KB article.
Oh wow, so it even gets you— so this is something that people can do even before opening a ticket that might help them fix their problem without...
Ever getting there. I love it, okay great.
Right. And so it's got the steps, again, written great, just like our admin guide. Tells you what to do, could fix the issue, and there you go.
Okay, even more crazy than that, my boss is sitting next to me and my job is on the line. How do I fill out a ticket even faster? Like, how do I get it done?
You pick up the phone.
And that's the fastest way you can get support. So, many times just as a highlight, I'll get a support ticket going "hey, my system is down." I can mark my ticket as system down, so maybe I got the ball rolling, maybe while I was on hold I created a ticket, mark it as system down. We try and get to all the system downs as quickly as possible. But if that person has went ahead and called in, that's even faster.
Okay, so if it's really urgent, you definitely want to— So, okay, the last question I have is: I've opened my ticket, I've sent you all the files, and I haven't heard from you for six hours, eight hours, it's been a day, is it okay if I call?
Oh sure, it's definitely okay to call. Any time you want to get the thing moving, go ahead and call. Now you mentioned earlier that you get an email, an automated email reply, you can always reply back to that and say hey. But if there's something that you really need to get some traction on, it's best to just call us.
Patrick, did you bring us any breakfast tacos?
No, they were fresh out. [Crashing] Did your guest cover troubleshooting 101?
Yes, and 201, and we got a good start on the 301 curriculum as well.
Yep, I figured during the live chat, I'll be able to start recruiting for support techs on the spot.
Even better, because that actually reminds me, if you don't see a chat box right...
Oh, Patrick, we actually already covered all that.
Even better, that means we have time for one more quick demo. And this time, we're going to show them something that they're really going to like. Something really, really cool. This, ladies and gentlemen, boys and girls, is new Orion user interface.
And the crowd goes wild. [Pretend cheering]
Now I know you guys are all going to get really excited about this when I show you, so just remember this is a prototype. We're not saying when and if this is going to get built, but— This is running on a server, so go check out the THWACK post for complete details on what we're looking at here. All right. So the first thing is, what is missing?
Where's my gray menu bar? Do you have to scroll up?
No, no, that's what this little black bar is across the top. So what used to be many, many pixels of stuff and tabs and settings buttons and everything else, now collapse in a nice little modern bar. And I did scroll down here just a little bit because also, top-level alerts just pop up at the top. Like in this case, it's saying that's in eval, I mean obviously. But so major licensing and other alerts will pop up there. But the thing to pay attention to, in terms of what you're looking at in general, is My Dashboards has replaced what used to be all the different tabs. One of the criticisms that we've had for a while, especially as we've added more and more and more modules, and a lot of you have set up custom links for those, for all those tabs. So, being able to just zap over and get right to a sub tab? Really, really easy, because now you just pop over here and say, we'll go to SQL Server. I want to look at my Exchange, for AppInsight, or I'm going to go over and look at my home view, or work with groups or my NPM summary, or whatever else. I don't have to kind of tab down and then crawl across...
Yeah, and then you sneezed and you went down too low and you lost it, ugh, go back up again.
Exactly, and you saw that little jump a second ago. That was the page refreshing and the menu stayed down for me too. So even if you have set up for page refresh automatically, it's not going to knock you off this menu. This is just way, way easier to use. And the other thing is, let's say maybe you want to edit that menu. Well, how do you do that? Oh, you click on configure dashboards.
Right? So you can go down here, you can create a brand new one, you can edit these, everything that you would normally have done before. But instead of going down to settings, click, dut-dut-dut, to add, modify menus, it's right here from that link. Really, really, really easy to use, and again, everything we're seeing here has been coming out of the UX sessions that we've been having with customers, so definitely pay attention to those on THWACK. Get in there. You can earn a lot of points; this is just really cool new work. And this is a lot of, also what Joel was talking about during his presentation of the keynote for THWACKcamp, about where we're investing in products. Okay, now you could just say, this is cosmetic and really cool, and then you could say well, okay there's been some additional user improvements. For example, let's say I want to look at my alerts, right? Now there's a nice little notification bar for that.
Now when I drill into there, I'm going to get all the detail for each one of those, but it puts it right there. And you might ask yourself, well, what if there are a whole bunch of them? You get a scroll bar. So I can close them right here, or I can click and drill into them. So again, consolidation of all of our notification alerts into that little alert tab right there. The other thing—boop, want to dismiss that. The other thing here, global search.
Global search is there!
Global search is there in this pre-prototype.
Yes, so as stipulated.
Yes, and we've talked about global search before. So global search is actually in the copy of the Orion products, the Orion core that you already have. So NPM or SAM, or one of the other products, it's just turned off by default. And if you want to play with it, go dig down into the documentation. You can turn it on, but here it's actually front and center. Now, again, maybe it's there, maybe it's not. But finally integrated right there on the bar, so if you wanted to search for something like 10.199.zer— No, I'm sorry. So even IP address-based machine, whatever else, it'll take you right to that. I won't click on it because this NPM is a little slow. Okay, the other thing that's really cool here is, how many times do you need to do network discovery, and we spend a lot of time sub-selecting the results of a sonar discovery before we decide what's in? Well, I'm going to show you some big changes. So this goes to the question of, is it just cosmetic or are there actually big new features that they're considering? So one of those is, let's go ahead and look at a new discovery wizard. Now everyone knows what this wizard looks like but no, you don't. First of all, it's going to walk you through the process. So if you're not accustomed to doing this, or maybe you're an administrator who occasionally logs in and does this--but you're inheriting your Orion system from someone else who set up a lot, and you're going to do it for the first time. It'll now walk you through this process, say what's going to happen. And so I'll click start, as people often do. Look at this, nice and clean, and you're going to notice this all over the place—I'm going to pull the dialogue here— that it's much flatter and much wider and simpler than it was to navigate. So, for example, instead of an IP address or subnet, I'm just going to search for a particular IP. So I'll put in an IP address and we'll just validate that, make sure it can say hi. There it goes successfully. Now one of the things that you can do first: add Active Directory domain controller to query. This is pretty cool, so when you click on this box. Ah, nice, clean modal, right? So you'll add your Active Directory domain controller, and then you can walk through the OUs to find...
Your system OU, your computer OU, yeah.
To find the computer OU, and definitely go check out— again, the blog post, there's a picture of how that works. There's actually a screenshot of that populated. I'm not going to do it now and save some time, but you can literally walk through, so if there are 10,000 candidate machines, instead of running a discovery against all 10,000. You can sub select the few that you are actually looking for, so that's really cool. But now, here's another new screen. Check this out. Normally what would happen is the discovery would run, and then it gives you the long list of the results, and then you can decide what's included and what's not. So do you want volumes? Do you want your loop back? Do you want whatever else? Well, you can leave this radio button set here, and it will continue to give you the results in the new sub select. So if you are going to do a discovery where you can get back anything in the world, you want really specific selections of what you're going to monitor, you can still do that. Or—and again, this is per discovery settings, so if this is a regularly scheduled one or a one-time, you're going to get different results. But if I say, "Automatically monitor based on my monitoring," and then define the settings. So I'm going to walk through the wizard and it's going to walk me through which interfaces do I want, do I want volumes, do I want any particular applications that it's going to find. And then I'm going to say next finally, which ports am I going to include, and when I do that, the next time that it runs discovery, I don't have to go through and sub select the results anymore. It's just going to automatically start monitoring those. So especially combining Active Directory, and then being able to do this, especially for, also, Linux boxes and network gear, you're going to only end up monitoring what you normally need to monitor, without going through that extra step of going through the results.
So you could actually set this up to only monitor your servers?
Set it up as a scheduled job, and all you're interested in it is new physical disks, and you have a way of onboarding physical disks that have changed. Because this is one of the big things that I see all the time in larger environments, is the people who are provisioning and building machines don't open a ticket or don't tell the monitoring team. They might tell the audit team, they might tell other groups, but the people who are racking and stacking, but they don't tell the monitoring team, oh by the way, I've added three more disks to this server, which is already in existence. So you never get a monitor until it crashes and there's a fire drill and stuff like that. So what you're saying is you can set up a scheduled job that scans your existing servers, and just adds automatically those new physical disks.
That's right, the nightly backup, or the nightly scan if you want to do it that way. But actually, you haven't seen this next screen because it just got into this prototype, but let me show you this other cool thing that will really, that you're going to geek out on after that. I mean, this screen's pretty much the same, although it is nice and blue now. We went back and forth with blue and orange, blue and orange, and yeah, you guys were all like, "Blue, please blue." Okay, so to your point about discovery scheduling. For a long time, you can run once, I'm going to run it right now. Or you can say I'm going to run it hourly or daily. What is advanced? Now this is an example of what happens when we are doing UX design that turns into functionality design, right? So we got a lot of feedback, and when you go to advanced now, you can define exactly what that frequency is. So I can set up the frequency for it to run on specific dates, or maybe I want to do it weekly, and I'm going to say Wednesday, Mondays, and Fridays. I'm going to give it the time that I want it to run, so instead of it just sort of running when it's going to run on a particular day, I know that, I'm going to wait until my backups are all done hammering the network before I start doing discovery. And you can say when you want it to start, so that's really nice, because you can set it specific scan schedules for specific periods, right? Like maybe you want to do something up until a date, and then change to a different schedule. You can actually have them set it to dovetail and it'll automatically do that for you. Another example might be, I'm going to start scanning right now, but I want to end my scanning on a specific date and time. So why would you want to do that? Well, the reason you'd want to do that is, let's say you've got a big data center, network change, for example, coming up. You know you're going to do it, it's going to happen on a particular date, and you know you're going to forget to go in and make those changes. So while you're thinking about it, go ahead and put that date in here, and your discovery will automatically stop on that date, and let you concentrate on making the network changes to that center of your network, and then come back and then add discovery back in again. So again, is that functionality or is that UX? I don't really know. It was driven by UX, but it came out of those meetings with customers. So, that's the discovery engine. The other thing about settings, being able to do--again, you saw me jump right to network discovery--that happened here. These menus here for alerts and activities, those are also quick links for different alert types, reports, and settings. Those are sort of the ones that are built in, so just remember that the My Dashboards tab is basically all of the tabs that you already have, And it'll remember all of the tabs and the links that you already have inside of the my dashboards link. So that, in a nutshell, is the preview for Orion UI that we're messing around with. Again, check out the post on THWACK, where we're talking about that. We'll put the link below, especially for the screenshots for Active Directory and a couple other things. It's not nearly as cool as all the troubleshooting tips that you guys were talking about. But I thought you'd want to see that.
I'm interested to know what customers think and if this is something that they would like to see. If we're going to end up using this design, I need to do a "who moved my cheese" KB article.
Yeah, that's actually a good point. When exactly will we be moving everyone's cheese?
Ah, very interesting. We're going to actually add this to Orion at 1 pm, or next year, or maybe never. So, the thing is, this is a prototype. We're just trying to get feedback, and the main thing is you're going to be wanting to watch the "what are we working on now" section of THWACK, because this is certainly not a product. We just want to get your feedback.
So what if viewers want to know more about this going forward?
Okay, so this particular change, there's a great blog post that's been added to THWACK talking about this that the UX team put up there. They have screenshots of not only this version, but some of the older versions going all the way back to the solarwinds.net version. You know, with the Battlestar Galactica font and the star field. That just feels like old times, doesn't it? Feels like Tulsa, but, so yeah, definitely check that post out. It'll talk about when it should appear, and opportunities for you guys to get further involved.
Okay, yeah, so everyone needs to check out that link. So, we have really covered a lot of ground today. Jason, as our esteemed guest, would you do the honors and take us out?
I'd be honored. For SolarWinds Lab, I'm Jason Ferree.
I'm Leon Adato.
And feeling very supported, I'm Patrick Hubbard, and thanks for watching SolarWinds Lab.