Orion at Scale: Best Practices for the Big League
This session is for anyone who is having issues with performance or considering expanding their deployment to cover a large number of devices and servers across multiple data centers and geographic regions. We will focus on maximizing performance. This means tuning your equipment to optimize its capabilities, tuning your polling intervals to capture the data you need without bogging down the database with less critical data, and adding additional pollers for load balancing and better network visibility. In this session, you will learn best practices for scaling your monitoring environment and learn how to plan monitoring expansion with confidence.
Hello, and welcome to our session, "Orion at Scale: Best Practices for the Big League." I'm SolarWinds Product Manager, Kevin Sparenberg, and with me today is Head Geek, Patrick Hubbard.
Hey, I wouldn't miss this for anything. We kind of shanghaied Kevin to lead this session because you've been presenting a longer version of this at SWUGs now for about a year and a half on three continents.
Yes, yeah, I've presented to a lot of our customers and a lot of the people that come in with our customers to understand. For those who aren't aware, a SWUG is a SolarWinds User Group. That's where we interact directly with the customers in a city. And we get really chit chat, and they get to ask questions about new features and whether or not things scale properly, and what their best ideas are for...
Troubleshooting with experts.
Exactly. We've helped lots of people. The best part about that in my opinion, though, is when you attend one of these, you get to ask questions of other people in the community. You don't necessarily have to ask us. Maybe someone's already figured it out for you.
Right, and so we thought we'd include this session so that you all have a chance to see what you get by coming out to a SWUG. It is the sort of deep dive content that you're going to get, that you can only get by coming out in person. That's pretty subtle. That's a pretty subtle incentive to get people to come out.
It wasn't bad. It was a little subtle, yeah.
Yeah, not so subtle would be, of course, if you present at one you can get 10,000 THWACK points and if you host one, you can get 20,000, and we'll fly to your city.
Yeah, not subtle at all.
No, it wasn't. Right, so let's get on with it. Of course, all of this content is going to be available later for playback. Kevin, you have a lot of different reference material that we'll have links to at the end of the presentation. Although these recommendations and techniques are going to be especially useful to those of you with very large-scale deployments, most of these are also going to be able to improve the performance of any Orion Platform as well.
I agree 100%.
Awesome. So, where do we start?
I think we need to start by defining actually what at scale means, because it means different things to different people.
So, we're going to qualify what at scale is in the environment.
Right? Then we're going to talk about the necessity to actually plan, like to think this through and be mindful before you get started.
Right? And then, do we need resources?
Yeah, the resources really come in three flavors. You've got the resources you need to assign to the machine, the manpower resources you have, and the development and the extension. And then when we want to talk about virtualization, what do you need to reserve? That's something that a lot of people tend to overlook. So, resources, resources, resources.
Okay, so what else?
A lot of people run in and put Proof of Concepts and say, hey, this is nice. I'm monitoring 30 things. It looks great. Let's go ahead and put the license on. Yay! We're good. That doesn't really work in a lot of situations.
So, lab is important.
Yeah, make sure you have yourself a lab. The Proof of Concept is actually really good, but when you actually move it to your production environment, you probably want to take the time to rethink the way you've architected the solution.
Okay. We'll talk about the advantages of automation, and I promise I probably won't talk about SWIS. [Laughs] Then scaling out best practices that's coming from customers. It's actually a combination of what you've learned talking to customers, as well as looking at Microsoft best practices, and lots and lots and lots of documentation.
And talking to our Product Management Team about how their stuff scales out and the best way to handle that.
Okay, so what do you mean by scale?
Well, the scale means a couple of different things. For me, scale either means, hey, I'm a small organization and we've gotten bigger over the years and now we're a large organization, or we're a large and we're going to enterprise. So that's really about the growth of your environment: how large it gets. Then you've got your environmental scale. You could have a real small IT framework where you've got twenty, thirty people working, but maybe you've got a huge environment. Dozens and dozens of servers, hundreds of thousands of elements that you need to watch. That's the environmental scale that grows out. You also have geographic. Are you global? Are you regional? Are you just local? Are you just in a municipality? These things matter, depending on how you're going to want to monitor it.
And some of those just have speed of light effects and some of those, just lots of complexity and maybe under-performing equipment that may affect the polling performance.
Yeah, because let's be serious. If have a small office somewhere, a lot of people will go ahead and just put really small-powered gear in there, and that doesn't really bode excessively well if you want to hit it every ninety seconds for statistics. Maybe you need to scale that kind of thing back.
Well, that leads us into utilization because utilization is all about how much you're actually monitoring, how many managed entities you're actually looking at, at any time. And that's based on the number of entities, the type of entities, and also the frequency at which they're polled.
A lot times two is twice as much.
Much as a lot, yes.
That's double a lot, exactly. Then none of you guys ever has any situations where your company has acquired another company or merged two business units together. That never causes an issue.
No, not ever, and I came from Wall, which merger and acquisitions was pretty much like, hey, these guys are doing a great job. We really like working with you. Now you're working for us. That happened a lot. It's not the only industry where it happens. So, the question is, do you take over? Do you supplant them? What happens if they already have Orion and they're using it?
Do you want to undo all the work they've done with it? The answer's probably not. Then we also have the upside of that, is what happens when you go and deal with cloud monitoring? Do you have one, two, three cloud instances? Do you have fifty devices out there? Do you have ten thousand servers out there? That depends on your business model.
Right. And I'm going to cover that in another session, which is "Moving Orion to the Cloud." I'm going to talk about some of the specifics of when you're actually using Orion monitoring, not only to monitor the cloud, but if you are going to, in fact, transition your Orion platform system into cloud systems. All right, resources are really important.
A combination of physical, people. What's the list of things that I need to think about when I really am scaling out?
When you're scaling out, there's a couple of things that really come to mind. The big ones for me are--do you have enough people in your monitoring team to actually keep something like this healthy? It doesn't take a ton of people, but it takes people with some in-depth knowledge. The second one is--are you giving enough resources to A) the Orion server, and B) the SQL server that's actually running on the backside?
Those two things in a day of virtualization, a lot of people, unfortunately, don't reserve resources on those. It's all right in very small, like we said, Proof of Concept, but when you actually try to use this in a real environment, it just starts falling down.
And if you're running 20,000, 50,000 elements, somewhere in there, you are definitely, probably using remote pollers. You're using a lot of other elements. You're probably using agents. So, having just time to think about monitoring, and we talk about the discipline of monitoring. Obviously, there's the Monitoring 101 series that are out there that you can take a look at, but that process of finding people— We've had some customers that have said they've come up with a way of partitioning a monitoring component out of several people on the team. For example, if they have maybe a generalist team and people who have little bit of— someone who maybe is a little bit more focused on Linux, they're going to be thinking more about deploying a Linux agent, for example. So, some percentage of their time is dedicated to making sure that the monitoring framework is working well. Other people will have a dedicated monitoring specialist, or maybe even a combination of someone who's taking care of the engineering side of monitoring and deployment, and then someone else is maybe sort of an analyst, who's thinking more about reports and dashboards and customization for particular user groups. Because chances are, if you're actually using Orion in scale, if you are, in fact, big league IT, you probably will benefit from creating just custom views for management and other elements inside of the company. So, that's something that you can sort of hand off to someone else on the team, who maybe is not an admin, maybe they're not an engineer. But they are good at thinking about the way the data's laid out and reports and communication up to management. So they can actually be really helpful, an additional resource as well.
Yeah, because when you really get to that kind of scale when you're talking about all these various teams that actually need to go through and work with this stuff, you have people that are highly segmented. Do you have an executive view? Do you have an executive storage view? Is that different from the executive application view? There are people that, that's their job. They decide what stuff goes on there. Tap those people. Get that information before you start building so you have a checklist of what things you want to get out of the system.
Mmhmm, somebody's who paying for storage, for example, creating a dedicated report for them that gives them their burn-down charts for capacity run out for storage, while they're buying Flash and some other technologies that are kind of expensive. It's nice for them to have a dashboard so they can see that their investments are actually working well, and that's because your IT is running large enough that you have somebody who's dedicated, worrying about the expense of storage.
Mmhmm, yeah. Same thing happens when you're talking about physical server counts. Same thing happens when you're talking about power, in a lot of ways. So there's a lot of these things. You can even take this and actually take it a little further and say, well, what about cooling? If you're in a place where you actually have to pay for cooling, you could get that information out. It's not like one-to-one. We're not going to say it's going to cost you six kilowatts, but we're going to tell you this is how much power is being consumed, and then I can take that, turn that into dollars, and have that as a report.
Okay, so that's people. What's another resource?
Another resource is actually machines themselves. For years and years and years, we've made the recommendation that the SQL server be a physical box. That is still a general recommendation. It's something I like, because then you know there's no chance of resources being stolen. However, if you're going to go SQL virtualized, perfectly fine, but follow the Microsoft best practices. There's a very, very good document, and basically what it says is 100% CPU, 100% memory, and these are the SQL things you need to change so that the machine understands how to use its memory better.
So, thin provision, and auto-shrink, auto-grow are just fine.
Yeah, they're perfect. No, don't ever do that.
Don't do that. It's funny because that recommendation about always running SQL on Metal, it makes me feel good. It just makes me feel like I'm doing the right thing, but when I look at the implementations that I'm doing, they are now, almost always, running on VMs, or they're running in— If I'm trying to save money, it's bringing our own license version SQL server running in AC2, like a Win16 container. But when you run it on RDS, it actually runs really, really well too. So there's some other things to consider there, like performance and ease of use versus cost. You manage it yourself; it's going to be less expensive. So there again is a thing where looking at those recommendations for how to set up the SQL server, and spending a little bit of time— and again, you've got a bunch of the tuning docs that you're going to talk about in a second— can really, really help you and give you more options to scale up very, very large, but also keep cost and how you're going to manage that in your environment under your control.
That's a huge balancing act. If you want to take the absolute best recommendations for everything, absolutely. Buy yourself three servers. Put them in a couple of clusters, and then throw nothing but Flash disks at them, and then go through and have hundreds, if not thousands, of gigs of RAM and 32-prox.
And that would be the Orion Platform or anything else. It doesn't matter.
Those are going to be the issues. All right, so what's the third resource?
The third resource is kind of the machines themselves. We talked about planning the resources around it, but the other one— You've got the people. You've got the VMs themselves, and then you've also got to worry part of it about the network area, who you're actually working with on this, and how to reserve that information on the machines. Virtualization is here to stay. It's not going to change, and actually being able to reserve that— we mentioned just now about SQL going to 100% CPU and 100% memory. Does the monitoring machine need to be that way? Mm, that's kind of up— this takes a step back. Is monitoring a Tier 1 app?
For me, it's a Tier 1 app because business wants us to go fast.
The number one request from the business right now for IT is "go fast." I know you guys hear that all the time. We need you to move faster. We need you to move faster. It's the old thing of why do cars have brakes? So you don't go fast, right? So you need to measure quality to make sure that as you're accelerating, your changes in IT that your quality still remains measurable. To me, being able to have metrics, be able to assure quality, that the services are working and fix them, especially if I'm in an environment that is with a lot of change— I need to quickly identify issues and maybe troubleshoot novel situations. For me, it feels like it's a Tier 1 application.
Yeah, and I'm with you, but I came from a network monitoring background. For me, that was part of my job. That was the number one thing. But when I compare that with something like Exchange or business intelligence or things like that, is it as high priority? Personally, yes.
Yeah, but I mean if your Point of Sales systems go down— It's that thing of co-equal top priorities. So, no, is it an ops? Is it a Tier 1 application in the sense of my business is going to stop if my monitoring's not working? No, but if my business stops working because of a systems issue, monitoring is the way that I'm going to get it back online.
Mmhmm, yeah. It's the chicken-egg situation. [Laughs] But it is very valid with this. For me, in my environment, and I have a way overloaded virtualized environment, I still reserve 50% CPU and 50% memory. We'll get into that when we talk about how to optimize your individual builds.
One of the things that I would recommend is there's some recommendations that right off the top of your head that I think are really helpful.
Yeah, so, the CPU and memory are cheap nowadays, and I'm talking about now versus five years ago versus ten years ago, when you had to buy an 8-proc motherboard. It was eight sockets. This was expensive. It was intensively expensive. Memory was also very, very high. Nowadays, that's really not the case. Realistically, in 90% of the situations, disk I/O is going to be your bottleneck. So, how quickly you can get information to and off the disk, and the number of people using RAID 5 for everything. I understand some industries— it just bogs down. We had a whole discussion last year about why you want to use certain RAIDs for certain storage solutions.
Yeah, that was really interesting. Of course, now you throw Flash into the mix and it gets really interesting. We're going to talk about drive configuration in a second. So there's a lot of things that you can actually do to take advantage of maybe RAID 5 for some data.
Flash for others, and then RAID 10 for others. You can actually segment according to what those volumes are expected to do.
This is all assuming you actually have that capability. So that means either you're working with virtualization, and most of the time you're going to be, and, more importantly, you're running on some type of SAN, where you actually have configuration of the disks and the certain RAID groups. Everything like that, because we're going to talk about some of the read/write ratios because not everything that gets written on the monitoring, the Orion server, really has the same kind of workload. When you really get down to it, Orion, and I jokingly say this but it's not much of a joke, is an OLTP database.
Yeah, like anything else.
It's reading information. It's pulling stuff out of the database. It's writing back to the database constantly.
It's writing tons into the database.
Yep, and it's also reaching across your network to pull those metrics, crunching those numbers, and inserting it. So it's doing a lot of work. So you want to make sure you can plan for that. Part of that, for me, was making sure the disks were built properly.
Absolutely. All right, so, I do want to talk about IIS for a second because we kind of forget the fact that it is a web-based application, which is— Can you imagine if we still had to do fat client installs? [Kevin sighs] But it is a web application, and it's a web application that's actually pulling a lot of data together. So it's not just—I mean, if you think about the richness of the page, like the new interface, which is AngularJS. There's a lot going on there. There's a lot of parts that are moving, and so optimizing the IIS is really important.
Yeah, and out of the box, IIS works great. Realistically, when you add that particular feature to Windows, it works fine. There's no problems with it, but it's not tweaked.
You can actually make it a little better. You can kind of speed things up a little bit, and a lot of that comes with positioning files, where you put files in your environment. Why am I doing a lot of read/write to the same drive where I have the webpages? Web pages are all read. There's no writing back to webpages, but the logs are almost all write, so put them somewhere else that's actually optimized for write. These are the kind of things we're going to break down for that. It's all about placement. Do you want the logs? If you don't really need the logs, if you don't need the IIS logs for something like this, turn them off. It's just overhead you don't need. I like them. [Laughs] I use them all the time.
Well, I know why you like them, and actually, one of the things that Kevin and I share in common is you ended up inheriting the demo systems that I built about ten years ago.
So you and I really do care about the HTP logs, for example, because we want to see where you guys are coming from and store the experiences you're having. But it also means things like, when you talk about compression settings. Logs can tell you a lot about that. For example, I found that if the server's really hammered, that actually turning off dynamic compression allowed me to serve more pages, but then the pages load slower. So it sort of transfers the expense to the end consumer who's sitting in front of the machine.
And logs can let you tweak what the performance looks like.
Yeah, and if you really want to dig in, you get something like Log Parser and you just write in pseudo-SQL and get the information out of them.
Yeah, but if you're running it on your intranet and it's running just fine, turn them off. You don't need them.
Yeah, exactly. Your I/O is going to be so slow for your network side that it's going to be near instantaneous anyway.
So then the other thing, of course, you should always just install everything in default locations, right?
In your Proof of Concept, absolutely. I have no problems with this. This is something I never do. This came from my history and the companies I came from. Software never went on the C drive.
We're going to actually have a link to this, this section that we're talking about right here. All these recommendations are actually in the document that you're talking about.
Yeah, I've actually gotten two different ones. I've got one if you're going virtualized building your Orion server on Hyper-V, and another one for VMware. That way you get to choose which settings you're going with.
So that's going to actually help you figure out specifically where to install for best performance.
Yeah, and these are from my— this is me. This is what I've done. This is what I've found. This is how I've been able to tune it up. It may not apply to every situation.
Just because you've been an MVP.
For many years and installed and been running huge environments, I don't know why you would know anything about that.
Well, the other thing that kind of— Yeah, I can do a little bit. But Windows will also trick you, and I say trick you in kind of a backwards, half-handed kind of way, because you know program files, common files? And you know program data? Even if you redirect where you're installing stuff, stuff still gets stashed on the C drive.
I found a way to get around that.
So, I'll be sharing that.
And that's covered in the doc.
Okay. So, we'll talk about that. You mentioned just a second ago storage, specifically storage. Walk us through. One of the things that you talked about before is optimizing for read/write performance as well. This is your super optimized drive breakdown for an Orion platform install. Walk us through what these are.
Okay. So, the drive breakdown is the Windows O/S alone. It's only the C drive. The big thing here is this can be optimized for read/write. It's the operating system. It's constantly reading right back to that drive. You need that. The page file—couple of things on that. I put the page file on a separate disk. Personal preference, because I don't want it accidentally maxing out the C drive and then Windows stops working. If I max out a drive that doesn't have anything else on it, I get an alert. But it doesn't stop working.
I also pre-populate the page file. I set it. This is the minimum size, maximum size. Identical, lock it in. Also, some information about what sizes to set those. I also do technically leave a very small page file on the C drive for the mini kernel dump in case there's a crash.
Wow, that's great.
Then programs. I install programs on another drive. For me, it's E. You can do other ways for it. I do that. I also redirect some stuff in the scripts, and then websites, caching, and temp, put on the F drive because that's mostly read access. I mean, obviously, we never run through the config wizard. It's going through and building all of that stuff, but after that, and it's actually running, it's just reading out of it.
Do you find that, especially moving the program files off on their own drive, is it like when you— you know when you're working in a page, for example, and we kind of have our own path that we follow through the UI on a regular basis? You have that one novel issue that you're trying to troubleshoot. You ever have that situation where the page--just that one takes a little while to load and you know it's not an IIS caching issue because you did it before? So that page has already been compiled, and it's sitting out in the disk cache. It's going and maybe pulling a DLL it hasn't opened in a while or something else from the program files directory and actually having that on a disk that's optimized for read. And it's not competing with logging or anything else, or IIS logging, or some of the logs that Orion or some of the modules are generating.
Yeah, we make a lot of logs. IIS logs are bad because they're constant. Our logs are almost as bad in a good way. Because we actually trap that information, it's great for troubleshooting. It's great for finding out what your problems are. I had a joke the other day on a Slack channel where I liked details minus F the Config Wizard. I just watch it go by because it makes me happy to watch it progress. I like shifting those off. Those are a little harder because those get stuck in program data, and in SolarWinds, and then in the sub-folder, and then in the sub-folder, and then in the sub-folder, and so on and so forth. Shuffling that information off so when you're trying to actually read new webpage files like you were mentioning, and have them be able to have complete connection to a different drive to pool that information while still writing to another drive for the log information, does speed things up. You don't have as much— I don't know if the term's collision, but basically that's what it is. It's when you're trying to do too many things simultaneously to a disk and you just run out.
Okay, so that is an awful lot of specific information.
It is. It's a lot of detail.
And I like to believe that you're actually magical.
But I think what you really did was spend an awful lot of time reading documentation because the recommendations that you're making here, although I know that they will definitely improve performance and allow you to scale very, very large, they actually apply to just about anything that runs on Windows, any large application that runs on Windows. So how did you get to these recommendations?
I read way too many documents. Microsoft actually publishes a great set of documents that they call their Performance Tuning Documents, and they're available on their website free of charge. I believe it's on the MSDN pages, if memory serves. I read through hundreds of pages of these, probably around 130 pages specifically on how to do virtualization, how to tune the Windows Operating System itself, how to tune for specific applications and their loads, and how to tune IIS. The IIS one was actually the most interesting read. It turns out that our configuration wizard does a lot of the best recommendations out of the box, which is great, but there's always a couple things you can tweak.
Right. Well, I know the team spent a lot of time, and actually, a lot of that came out of the QA team, and then there was some UX recommendations, like the one— they actually worked with a lot of customers. When you go through the installer and there's that checkbox that says optimize website.
Right? So that's actually— many of you probably know what that's doing behind the scene. It's doing IIS pre-compile, and then building all of these class files out from source. Those sort of recommendations, while it is annoying to sit there, especially—I just did a Network Automation Manager install. So that's basically like all the network management products all as a single install, and then watching it sit there and compile. I'm sitting there thinking, okay, this is really annoying because it takes five minutes. But the reality is then the website runs faster. So those recommendations actually come, a lot of them, from the same documentation sources for IIS performance and Window enhancement.
Yeah, and we're doing all those optimizations. We're constantly looking for ways to kind of squeak a little bit of more performance out of it. But one way we found in the past, and me talking to a lot of customers, and I've fallen victim to this when I was new to this suite as well is, I didn't do the virtualization properly. For me, that was not reserving enough resources. Microsoft's best practice for anything that you want to consider Tier 1, and we can go back and forth whether monitoring is or is not a Tier 1 system. [Patrick laughs] It is for me, and sounds like it is for you.
It is. It may not be for everybody, but reserve at least 50% CPU and at least 50% memory. More is better, obviously, but it can impact the rest of the virtualization infrastructure if you do that.
Is it about the same for Hyper-V and VMware? About the same?
The best practice requirement documents from Microsoft. I read the VMware ones that VMware published about running Windows, and I also read the Microsoft ones from Hyper-V and they actually make the same recommendations for high load servers.
And I use the same actual recommendation for the cloud scale out as well because the requirements document actually specifies for cloud, especially for AWS or some recommendations for sizing. So I did the same thing. I basically said, okay, I'm going to multiply that by two so that I'm using 50% of my resources. Then I tested, and I did actually see some real improvement, especially the CPU, for example.
Yeah. You don't think about it.
You don't think it, but when you think about what's happening on the system—polling, right? You have these— especially when you're polling at scale, is that you have this intense complex set of permutations for what is going to be polled when. It's not happening in a serialized manner, especially if you've got agents; you've got polling all over the place. It's a lot of asynchronous activity that's happening on the server. So those occasional troughs and valleys where all of a sudden, those nodes overlap and you have a high level of concurrency, having a little additional CPU can really be a help.
Yeah, and you know, you can actually not run with these reservations, and the system will work. But you're going to notice some weirdness, just like what you mentioned. When you actually have polling times that actually happen to synchronize. Like, let's say you're polling a bunch of nodes and you have them set to one every seven minutes or something like that. And then all of a sudden you get 300 of them that are for the same sixty seconds, then you're going to have this need for all this memory and all this processor. But what happens if you don't actually have that available on the host? That's where these reservations come from.
Is that where part of this conversation about— you know, when you talk about performance in a very large environment, or a very complex environment, or an environment with lots and lots of high frequency polling, for example, or lots of data retention. Is there, whether it's the Orion platform or any other application that's running on a server, is there a point where we almost resist the thought of cost of resources? Where maybe we got away with it for a long time, and everything performed well--and we increased scale, and we increased scale and we increased scale--and then the cost of resources actually became something that we were thinking about, where before we didn't worry about it.
But we're virtualized. It doesn't exist.
A billion here, a billion there, and pretty soon you're talking about real money, right? Do you feel like there's a tipping point in there where maybe we're just sort of accustomed to a certain cost and that paying for resources, provisioning resources— I'm saying paying for because I'm thinking of cloud— but provisioning resources where it is natural and it is a linear investment that follows what we've always been doing. So if we go talk to management about it, we're saying well, or we could hire a whole lot more people. So it actually fits within the cost structure that you would expect, especially at the scale that you're at. But do you feel like we have a tendency in IT to try to do what we can in a thrifty way, and so sometimes we're afraid to go have that conversation about resources when really, management would be able to say based on a linear chart of, well, this is how much monitoring you're doing, so I would expect some other cost to increase proportionately, and they'd be fine with it. Do you think sometimes we just don't go ask that question?
I think we do because we typically think of monitoring systems as kind of an extra, and this is what I've got from customers is the way I felt when I first started with it. It was, "That doesn't do the work. I do the work. That just kind of watches stuff." When in reality, it actually does a lot of work, and if you take a small office, 25, 30 nodes, or whatever, and you add that in, what happens if you're already in that 95th percentile for CPU or memory? You're locked in. If it's a physical server, you're even locked in worse because you've got to buy a whole new rig to put it in. So, I mean, yeah, we need to have those conversations. Thankfully, we actually have the predictive scales that we're actually putting in Orion now so you can watch itself and actually see how I'm going to run out of memory at such-and-such time, or CPU's getting high and I'm going to hit it here.
Interesting. You know what's funny; you mentioned sending logs off or doing a trail on logs.
I'm actually doing that to monitor Orion in the cloud because I need a heartbeat. So I was looking at the module engine log, and about every five minutes, it goes out and tries to do a bunch of back-ups. So I send that off to PaperTrail, and then I've got a haven't-seen-it-in-five-minutes error. So if the heartbeat message doesn't come through in the log, doesn't get piped out, then I know that something's wrong with Orion, even if it's offsite where I can't get to it.
Yeah, still got to check on it.
So, I'm always going to want to recommend a lab. Let me rephrase that. I always recommend that you have lab.
Especially if you're running a large environment. Many of you actually have lab licenses. That's something that's available to pick up through the maintenance team, and they are really, really helpful. So what's the number one advantage in having a lab?
The number one advantage is actually, for me, has always been get to see the new features and how they'll work with your environment first. Also, to determine whether or not this is going to be a clean upgrade. Because as silly as it is, you can watch your CPU memory, disk I/O before an upgrade, upgrade, see if it's still the same. If it's complete plateau, you know it's not going to hurt you anywhere. However, if it gets a 30% jump in anything, you're like hmm, maybe I've got to contact and see if there's a problem.
I mean, that's the same way with all software. The reason you have a lab is to handle these kinds of upgrades.
But for me, I would argue that the reason for having a lab is it's a great chance to break things. If you want to really make those big breakthroughs in reconfiguration, let's say. And you're going to try it in production; the chances of something going wrong are very high.
So instead, if you don't have a sandbox to play in, you will only incrementally make changes to your environment. So you don't get the benefit of actually thinking about, planning the changes that you want to make. In your case, looking at the optimization docs and saying, "Hey, I want to do all of those things." So you only make these little bitty changes, and you don't actually see the real benefit. So that once you've figured out what you want to do, if you don't have a place to play with it, you're not going to gain the skills that you need to make those changes.
Yeah, and that's the big thing about it. The thing I like about our installer and the fact that you actually have a lab, is you don't have to go through all these documents. You don't have to read all the documents I did. You don't even have to take my advice. Everything will install on C, and if it's a small environment, it'll work fine.
I've read all your documents.
Yes, you have.
And I now follow them and I actually do a lot of the scripting to automate a lot of those installs.
Oh, thank you.
No, thank you. [Laughs] You did all the legwork. But those recommendations, do you recommend that people just try to upgrade to that environment?
[Sighs] That's a really gray area. I will probably say no.
Yeah, no. No, you're going to do a clean install.
And then migrate the database to point at the new install to be able to take advantage of that. If you don't have a lab, you're not going to get a chance to practice that. First, back-up the database. Restore it to a stand-by server. Then build out your environment, all optimized. Make sure that your scripts all execute, and attach it and see if it works. Chances are, if you've never done it before, it took me a couple of passes [laughs] to get it to where I could do it, but now I can successfully do it every time. If I hadn't had that sandbox time to really play with it, and kind of get some experience, that would've been really difficult.
Everyone jokingly calls it a sandbox, but when you think about it, a sandbox is an ideal analogy for it because when you build a sandcastle, it just takes a breeze and it all falls down. You made a mistake so you build it up again, and you make another mistake and you build it up again. It just allows you to do that over and over again. You were saying you can take a bunch of those little small incremental changes you talked about, and maybe make them one big script that goes through and makes a ton of these changes, and see if that increases your performance, or if you get a little more stability. Or, maybe you can ramp up your polling frequency or something like that.
Right, well, the other think I like about it, too, is there's probably people on your team maybe who aren't experts with the platform, but they're really sharp. Maybe they're really great at Linux scripting. Maybe they've got some ideas about storage. Maybe they've got some crazy ideas about the way that reports ought to be created. So cutting them loose with admin credentials in the monitoring environment that is central to your operation, especially if you have customized views for management, there's a little risk. So you might hesitate handing over the reins to someone on your team who maybe is a little more junior but has some great ideas. If you have a lab environment, you don't have to worry about that.
Yeah, and that's the big thing. Make sure you have a lab. Try it out. Let people play in there. There's no reason your lab can't be also joined to the domain. There's no reason you can't use the Windows paths through all indication and keep track of who's exactly doing what. All of that still applies. Maybe you just give them a little more rights on that POC box.
Sure. Okay, so then the next question is, now I'm running this parallel environment. I've got my lab. How do I make sure that it is an accurate representation of what I'm going to be building or what I am actually running in production?
Well, that's just it. You should build it identically. There's two ways this can happen. The way I came from is I walked into my company, became a Network Engineer. They said, "Hey, we just bought this software, NPM, and we would like you to take over." I said, "Cool." I started looking at it. I was like, "I really like this," and I didn't realize they literally did a next-next-finish install. [Laughs] So, SQL Express.
On the box, low process account, low memory account, and they said, "It works great," because they're only watching sixteen things, and that's all they tested for. Then they said, "We'll license it." Great, but they put a license on that box.
So it basically ran out of headroom in like six months because then they say, "Oh, well, we need a year worth of CIS logs, and we need a year's worth of traps." You're not going to get it in that database. It's not designed for that.
So that's where we kind of get the concept from Proof of Concept. This is just to make sure it's going to work. When you're talking about a lab build like that, then you've got the other side of it. It's like, this is how I build servers in production. Therefore, if I'm building server X in production, I need to build server X in my lab identically or as close as I can. Because you're not going to have the same SAN, probably, for both if you're actually doing real segmentation.
I'm kind of starting to describe this as the politically safe way to untangle your boss's implementation. [Laughs] You know what I'm talking about?
Yeah, I do actually.
Because when we see you guys at SWUGs or maybe Cisco Live or in-person at other events, we are now seeing— Once upon a time, you talked to someone and you tell them about a new feature. It's like, "Hey, I remember you from a couple years ago, and we got this great new way that we can untangle interfaces for stack switches, for example." It would always be, "Great. I'm going to try that when I get home." Well, now a lot of those conversations are, "Yeah, I actually got somebody full-time, and she's dedicated to managing my Orion environment, and I'll let her know." That admin is inheriting someone's already very large and heavily customized environment, and you guys are building some really, really amazing alerts, reports, custom properties, sometimes even customizing the web interface, and it's really amazing customization. If you come into an existing Orion platform, especially one that's been around for a long time, some of them ten years or more, that can be pretty daunting. So one of the nice things about having a lab is you can actually take that install and take it apart safely on the bench. And no one will see you do it while you learn the difference of how it's been customized than it was out of the box.
Yeah, and one of the things that we used to do— I actually spoke with my SQL admin at the last company— is we actually took our production database, and every other day we used an SSIS package that would back it up, move it over to the SQL server that the lab was using, and run the script basically to wipe out stuff. And then I just had to rerun the Configuration Wizard, and it was online monitoring nearly identical. It's actually pretty helpful playing in that lab that way.
All right. So let's talk about one thing briefly. And I promised I wasn't going to talk about SWIS, so we'll only talk about this for one minute.
Automation can be really, really helpful. So what do you mean by that?
For me, automation is building the same machine the same way every single time.
Now for me, and you from historical, you know that eleven products installed on one server, as much time as you can save there, the better.
So for me, it's making sure that the machines in my lab are built identically— same disk sizing, same place they're put on the virtualized host. Make sure that they've already got the CPU reserved. Make sure you've installed the same operating system, unless it's time to check that. Make sure you put everything on the same VLAN's. For me, you can do it once. If you do it once, you do it by hand. That's how you learn to learn how to do it.
So we'll have a link to those scripting examples, rather than go into them in amazing detail.
Because they're very, very detailed, and hopefully commented well enough so everyone call follow along. If there are questions, please hit me up on THWACK. I have no problem answering.
I was going to close with this section by saying the main reason that I think automation is handy is because you can get your automation together and hand it to somebody else, and then you go enjoy your weekend, and let somebody else take care of it. It's nice because they can repeat it. All right, last couple of topics. One of them is geographic distribution. One of the things that drives scale, speed of light, and just number of hops, and a lot of other complexity.
Yeah, and it really depends on what kind of content's going over to that. The company I came from, we had one additional polling engine, and it did take me awhile to kind of convince people it was worthwhile. But what those kinds of things will allow you to do is say, "Oh, hey, I've got a lot of things here that are local, so why don't we poll those from a local one, have that information summarized up, and then sent back to the main poller to be put back into the SQL database."
Okay. [Dramatic beats] There's a number of components, right?
So you've got the primary server itself, or maybe you have multiple servers in the OC. You've got remote pollers. How does this break down in a global distribution?
The way I would look at that is that we're going to start up here in the states. This is how your environment starts now, and you haven't grown to incorporate these other ones. Maybe these are mergers. Maybe these are just extensions. But you've got one data center where you've got most of your information, and then you got a couple of branch offices. For this one, single polling engine may be enough, depends on the number of elements and how slow the I/O is. If you have a problem here where those two are connecting but you're only on a T1, maybe you need to, kind of, scale back some stuff on the frequency of polling because you're just never going to be able to chew up that T1 the way you would want to.
Well, and it's also going to affect your decision to do remote poller versus agent, right?
Because if you only have a few nodes, you can just use the agent, which is going to poll at a local— All the numbers that are come back for latency, for example, are going to be based on that local instance. Then it doesn't matter if it's a T1. It's going to slowly pass the data back. You don't care about the bandwidth to get the data. You just want to make sure that the polling's happening on time.
And also, we actually— for the agents and for the remote pollers, we'll actually kind of summarize that information before we send it back. So, you're actually sending less content over your links. We still don't send a ton of content over links doing the monitoring, but it can add up if you've got hundreds and hundreds and hundreds of elements.
Well, or just have a lot of geographic distribution. That's one of the things that I like about Network Automation Manager and Network Operations Manager is that they both give you as many pollers as you need as part of the licensing and anti-node based licensing. So when you need additional flexibility, that's an option that you guys may want to take a look at.
And it just kind of works out of the box that way.
So that's the beginning. But let's say you've actually merged with another company. So then, you've got this in between line here where this is another organization. Two ways I would look at this. Do they already have their own Orion instance? They may. Maybe they just need to upgrade a little bit. But if they don't have their own, then I would just put a polling engine in here. Then we talk to the polling engine. Everything comes back, and everyone's happy. Great. If they already have their Orion instance, get yourself an EOC box and talk to the two of them, and that way you get one place to pull your central reports from everything. So that's the easy one. Now if we continue out and we go all the way around the world, then you've got here. This really depends on your latency. This is something we found out when we went to Australia and actually spoke to people. Latency can be a little weird there, depending on what resources they're hitting, whether it's cloud, local, or just something on mainland Europe or into North or South America. So maybe there, maybe you do need a whole Orion instance just to watch that region. Maybe all you need is an additional polling engine.
This is why we encourage people to come to the SWUGs, because we really want to dig in and talk about all of these possible options for them.
I go back and forth on that. For the cloud session, I set up a bunch of systems in Google Cloud in Sydney because I wanted a bunch of latency for the experiment, especially because I got a VPN and I want to actually look at the effects for that. I don't want to talk about NTs, but that was one of the things going back and forth. One of the advantages is if you put an actual Orion system here, you can actually set the language differently. Now you wouldn't do that in Australia, but if you had one in Japan, or you had one in Germany, and you wanted to actually have it run with all the UI in German, and then poll all of that data together into one EOC instance and be able to see it, that's a handy way to do that.
Now you can actually move the reports back and forth between them. So you built a report that's fantastic in Australia, oh, great. Well, I want to import that into EOC so I can use it everywhere. As long as they don't rely on something like certain custom properties, it'll just go straight through and work great.
So there are guidelines for geographic distribution?
There are several guidelines. We actually cover a lot of it in our scalability documentation. I don't want to go through the whole list because as we release new versions, it's constantly being updated.
Why don't we put that in the information as well?
Give everyone something to read in their down time. In fact, those documents will cover lots of different things. One thing we didn't talk about, where to place your polling engines, is the additional websites. This is when you scale the number of people in IT that are actually using Orion. This is something that I love. It's something that my company came— the company I worked with before used because it just sped things up. If you can take that IS load off of your poller, do it.
And it's relatively inexpensive.
Yeah, we don't talk dollars in this, but it's worth every penny.
Yeah, it's really great, especially if you have people where there's a lot of latency. It just makes the performance of the web performance a lot better. That also lets you do things like, the same doc will talk about high availability and actually how to think about--do you want HA for your primary poller, or secondary, or all of it? That again is going to be some additional guideline.
Yeah, we cover all of that in all, pretty much every scenario. Like I said, the one I came from it was primary polling engine, secondary polling engine in a far data center, and a pair of websites in front of the primary that were just behind a load balancer. [Snaps] Nice and snappy web resources after that.
Awesome. I'm so glad that you did this. This was really, really great. I'm pretty sure that you're going to be taking questions from the community. We are going to be on chat for two days.
Back to back.
During THWACKcamp. So there's going to be plenty of opportunity to ask questions about it. And I love answering these questions because, for me, it's the same journey as with you guys. It's discovery. It's like, "What can I do?" "How can I help you get there?" So, definitely, definitely ask questions on THWACK if you're not watching us live, or ask us directly in the chat if you happen to be there on the day.
That's right. We're on THWACK every day. So to summarize this session. At scale: that could be a factor of scope, complexity, geography, or maybe just collection frequency.
Then the other thing, the recommendations in this guide, again, came from lots and lots of experience. A lot of the recommendations actually come from articles that we've seen in THWACK posted that have come from THWACK--not just Microsoft technical documents but what you guys are actually doing in the field.
Yeah, so people that are actually running their own big league environments.
Definitely big-league environment. So, three, definitely— We talked about it several times, but definitely check out all the links in the description, and there'll be step by step implementation how-tos on how to achieve all of these in your environment.
And don't forget to check out thwack.com/swug and come see us when we're in your city.
That's right. Or, petition us to come to your city. Or, get 20,000 points for hosting your own. All right, well, I think that might be the best advice of all. Thanks again for attending this session. We look forward to seeing you again soon.