Orion in the Cloud: Hybrid IT User Stories and Best Practices
Join Head Geek™ Patrick Hubbard and Senior Product Manager Chris O’Brien for a technical discussion drawn from dozens of SolarWinds THWACK® members who’ve moved their production SolarWinds Orion platforms to the cloud. Learn how they did it, get expert tips and tricks, and view a hands-on demo of how to move your Orion® server to AWS® or Azure®, manage VPNs, and help ensure monitoring services.
Hello, and welcome to our session, Orion in the Cloud: Hybrid IT User Stories and Best Practices. I'm Head Geek, Patrick Hubbard. And joining me today is a good friend and Product Manager of Network Performance Monitor, Chris O'Brien.
Hey, it's great to be here, as the network guy talking about the cloud. Does that even make sense? I think you just like having me co-present the sessions you're most excited about.
That's actually true, and it's also true that I wanted to present this session of THWACKcamp since 2015. And this is the first year that customers in surveys, and when we talk to you live at conventions, like Cisco Live, Microsoft Ignite and even DevOpsDays here in Austin, you were confirming that a decent percentage of you have either already moved your Orion servers to the cloud or you're planning to, or at least thinking about it. So, in this session, we're going to do a deep dive into how SolarWinds customers are migrating their systems to AWS, Azure and Google. And Chris, you're an expert on the two technologies that they're going to need-- that's the Orion platform, and believe it or not, networking.
Yeah, it's easy to forget, in all the conversations about the cloud, that cloud really is all about networks. Of course, cloud providers try to abstract the network, so you don't have to worry about the details, but while that's great in theory, in practice, even if you have no traditional on-premises data center, you're still dealing with physical access, VPN tunnels, delivery infrastructure, critical WAN failures-- all of that still applies.
That's all true, and you're also going to end up learning, as a part of this process, more about VP and performance troubleshooting than you ever wanted to know. And then there's the monitoring itself. Application performance monitoring, whether it's in the cloud or otherwise, is really all about interfaces and protocols, and all kinds of out-of-band communication. And it's regardless of either of the two fundamental approaches that you take to APM.
It's amazing how, even with cloud-native apps, monitoring still seems to come late in the development cycle. It's a little better with more Dev in DevOps, because you have to understand what's going on to react to it. But I'm still surprised how often we're still monitoring packages, package applications or SaaS, that smells like repackaged, third-party apps with protocols that aren't clean API's. Assuring all that works when the platform hides access to network takes some getting used to. Two fundamental approaches to APM-- tell me you're not going off into cloud-native monitoring.
I'm turning into a little bit of a cloud-native guy, but you could approach it this way. Some would say that you could either monitor applications by watching all the elements of the infrastructure; others suggest that dedicated tracing is the only way to assure that users are happy. All we know is, you have to learn both.
I know you're trying to be funny, but you really are turning into Jeremy Clarkson.
And that would make you the Stig, and I'm okay with that. Okay, so let's get on with it. We're going to be going in and out of a lot of UI in this session, white boarding topologies and covering setup and config and best practices.
We know it's going to take a lot of information so, of course, this session will be available for replay. We'll have links to some of the tools and how-to guides, all of that stuff you'll want to review when you're doing your migration planning.
Awesome. So, you want to start with the topology that we're going to be using, and then get in the best practices?
Yeah, I like topology.
All right, so let's take a look at this. What I've got here is the environment that I set up for this demo--so there's maybe 80ish different components that are configured. And I've got three main pieces, right? I've got an AWS environment that's running in the Northeast region in Virginia. I've got Google Cloud Platform that's actually running in Australia. I thought, you know, latency would be something that would be fun to experiment with-- and Sydney should generate plenty of it for us. And then I've got Azure running in California. And I don't know if you've seen this many icons and gifs all in one place.
Yeah, you just keep adding people's logos until you look competent--that works.
Yes, that's kind of it. But the other thing too is, remember, documentation is going to be a big part of migrating anything to the cloud. It doesn't matter whether it's Orion or anything else. So one of the great things is that all three of these providers encourage you to download these templates. So, if you just google Amazon PowerPoint--
You will literally get a-- I think it's 35 slides with use guidelines. And they're clear about what you can and can't do, but they want you to use these icons. Same thing for Google, same thing for Microsoft. And Microsoft actually throws in all kinds of other icons, because there's a little bit more kind of lift and shift. And so if you're running something else, they want to be able to show that to you. But it's also handy because many of you, according to the last survey, you're actually multi-cloud. I didn't throw IBM Bluemix in here, you know, just because, you know, that would be too much. But when you have multiple environments, it's handy in your diagrams to use the vendor diagram, the vendor iconography that actually maps to that environment, so that you can, at a glance, remember where you are.
So, like you know, if you're going like DNS debugging, it's kind of handy when you kind of go in and out of the different colors and icon sets to know immediately what you're looking at.
Yeah, that makes sense.
So, we've got here-- just sort of in general, I'm treating AWS the way that most of us do, and we were sort of early and first with AWS, and so there's a little bit more kind of lift and shift. Something that maybe would have been running in a VMware container is now running at an EC2 instance-- and then a few services, so Route 53, RDS, and a little bit of Lambda.
And you have Docker?
I've got Docker. And this is a build it yourself Docker Swarm, right?
Because if you work early with Docker, you would have sort of come to Swarm as opposed to using like the container-management service, where it's managed for you. So, in this case, I set up a Swarm. There's a controller, and there's an ABI to get it that's actually pretty primitive, and we'll see what that looks like. And then I've got my Orion Server and the Orion SQL Server instance. We're going to talk about that and budget, and a couple of other things-- deciding, you know, is it actually a like, bring your own license version or RDS. And then, connecting that to the Google Cloud Platform through their VPN virtual gateway. And then there's a virtual gateway over here, corresponding with AWS. AWS gives you two tools out-of-the-box, which is kind of nice. They encourage you to set those up. And then Azure, of course, the same thing. We have our virtual gateways. These are all IPsec links, and then the other one that's kind of weird is this one between the Azure instance down here, and this little strongSwan VPN, that's a--
And that's like third party VPN?
Yeah, it's like a VPN appliance, so we open VPN. There's a whole host of them, I think, in the IM library for Amazon. I think there are about 60 different appliances. Some of them you build yourself, and some of them you pay for, but--
Why not use this just native, like you had in the others?
Because that doesn't seem to be easy. I don't know why AWS and Google will speak to each other, Google and Azure will speak to each other, but Azure and AWS-- oh wait, that's a number one and number two, they don't exactly assimilate.
And maybe I just misconfigured it, but that seems to be the way that most people are doing it. And a lot of them will actually use-- and a number of VPN appliance images that are out there, suggest to me, especially in documentation, that especially like, multi-region--
Because remember, each one of these regions is effectively it's own cloud. So people will say, "Oh, I'm not multi-cloud, I'm only AWS." Oh, you don't have any replication? "Well, we're in two regions, and three different availability zones." It's like, okay, you're multi-cloud. So all of that communication, and when we get into, we'll be talking about MTU in just a minute, that's a big part of that. And latency's a big part of that, because applications are not necessarily designed to be holistically monitored in an environment.
Okay, so that's what we're going to talk about. And then the only other thing is over here in Google Cloud. This is a little bit more modern-- so this is a Cooper Netties cluster that's actually running on a Google container engine. So, it's a managed service providing access to all the containers.
Okay, and just to orient myself for the Orion platform stuff, we've got the Orion Server sitting in AWS.
Orion Scalability Engine, like additional polling engine?
In Google Cloud.
I use the fancy word for it.
Okay. And the Orion Scalability Engine also in Azure West.
That's right. Remote polling is a big part of that.
And number one recommendation to think about here, is as you transition, you will probably want to have remote pollers.
You can certainly use agents to do a lot of what we're doing here, if your environments are small, and if you haven't taken a look at Network Automation Manager or Network Operations Manager, take a look at those because there's a little bit more flexibility. There's no base licensing, and you can have as many remote pollers as you need. And you're briefly going to have all of your existing pollers in your infrastructure, and then kind of press out as we talk about topology here. So, let's talk about the main approach here, right. So, why do you think, why-- the customer's that you've spoken to who have migrated, whether it's SolarWinds products, but especially the products that we monitor-- why are they moving to the cloud?
Cost is often a big piece of it. Optimizing their costs. There's an idea, at least, that it will be lower costs in the cloud. Performance is a question as well, or is a common driver. People do, when there's a big variation of the workload, that sort of spike workload, it's often easier to do in the cloud or cheaper--probably both.
So, this is some of the common ones.
Whose idea is it typically? Is it IT's idea, or are they being 'voluntold?'
It's usually not, is it? It's usually voluntold. Like CIOs, those sorts of people, they know where the world is going, and we need to get there as well.
Right, and the budget, it won't be double the budget for a period of overlap during that transition, it's just--
Well, it's cheaper.
It's cheaper, but it's not instantaneously cheaper. And then, what are the typical sort of access network issues that you encounter? Do people tend to use VPNs? Do they tend to use direct connects? Do they build something where they can take advantage of BGP, for example, and actually have what is essentially a distributed access network that pulls all those together into a SQL network? Or is it a lot of, kind of, hub and spoke?
We're seeing a lot of VPN, just as you have here. A lot of VPN tunnels. And then connectivity beyond that handled by the enterprise, usually just one, two VPN tunnels for redundancy to the big hubs they have in the cloud.
Right. That was one of the things that was kind of interesting talking to customers-- is that a number of you are using a direct connect, where this is sort of the, let's say, second wave of deployment-- where it's almost, you know, kind of rogue IT or someone set it up, and VPN was something that we as network administrators get asked to then implement, and then when they finally get big enough that they say, "This is ridiculous. My cost for operating the VPN, the bandwidth, the rest of it, is much higher than it would be to have direct connect." But then you use direct connect, and now you've got a router and probably a switch sitting in a cage somewhere in colo. So although that connection from that colo-provider vendor is all magic and managed for you, now you've still got more distributed hardware.
And that's one of those where they're like, "Oh, I have to have redundant connections to the cage, so that I can fail over to my primary link?"
Yeah, in my cloud deployment.
In my cloud deployment, that's right. So, the basic steps for this are about what they look like for any Orion redeployment. I've got a dock here, and I'm just going to use this. So if you haven't seen this before, some of you, especially if you've been using the Orion platform for a few years, you've probably moved it.
Certainly to get to, you know, kind of as part of your hardware refresh. But when you're moving it to a different network, there's some extra considerations. So, out on the Customer Success Center, there's a guide for this, as there is everywhere else. And we'll put the links to all of these guides in the description. So one of them is, okay, so we're going to migrate to a new IP and host name.
Yeah, this is actually a good chance to plug. There's a whole bunch of different migration guides, like this one's specific to you, "I want to migrate to a new IP address, a new hostname." That's a very specific scenario. We have several different of those migration guides that make it really easy, and step-by-step.
Yeah, what are you trying to do, and then read just a page or two about what you're trying to do, instead of a very long document.
So this is the guide, and I will net out what the steps are here first. One, prepare your new hardware, and in this case, it's new virtual hardware, or it's somebody else's hardware, right? So, this is going to be--take a look at the sizing guides for a new install, which are also in here, and it will actually give you specifics for installation in Azure--recommendations for Azure and AWS. They are surprisingly similar to the same requirements that you would have anywhere else, and you can actually map those to the sizes.
Yeah, it's almost like Azure haven't invented their own CPU and RAM systems, so it's very compatible.
It is. And the other nice thing, too, is you can just down the instance and change the instance type and check the performance. So you can actually do a little bit of performance testing. So the first thing, set up the new environment, get it installed, get all the bits where they need to go. Then you're going to go into your existing Orion platform install, and you're going to release all of the product licenses.
If you haven't see that, the way that you get there is you go into 'Settings,' 'All Settings,''License,' which is down here at the bottom. Okay, so 'License Manager,' select the module that you want to remove, and then click--
Yeah. And then click, 'Deactivate.' Oh, I forgot something. Before you do that, make sure you come over here, and grab this license key. Just throw this into a spreadsheet, and make sure you've got two copies of it before you start deactivating this. You can reactivate it here. The main thing is, just make sure you keep your key, because it's a handy way to do it. You can pull the key off of the Customer Portal, if you lose it. It's not a one-time deal, but it's just eas--
Doesn't it feel better to copy/paste sometimes?
It does. Or just grab the whole page, and that works too. Okay, so you’re going to deactivate all of them, right?
Then, you're going to go back to your new environment, assuming that you're going to move the database as a separate guide for migrating the database-- and I'm going to talk about database next, because there are some special considerations with cloud. But you'll relocate the database, if that's what you're going to do. There is one change that you're actually going to want to do-- of course--back up your database. There's a script for it right here. Log into the database and actually execute these scripts. And these are basically taking some of those really tight bindings of subnet and IP and hostname out of the engines tables and a couple of the other tables, so that it's going to have a clean start. Now, in terms of the existing pollers, your remote pollers, because chances are you're going to be connecting to the remote pollers, which used to be adjacent in your on-premises network. Now, you're going to be connecting briefly back, and this is also assuming that you're drawing down the size of your internal network and at least your data center. Your delivery network's still the same, but the data center's getting smaller, which is the whole reason that you're now making this transition.
You've got to move something to the cloud.
Yeah, you've passed that tipping point now where you're like, well, more than half of my stuff is all now operating in the cloud. I need my monitoring platform to be adjacent, and that's actually... It's interesting talking to you all at conferences, and at SWUGs especially, because I thought the tipping point would be sort of in the middle of the natural curve, right? It would be about 50%. And some of them are actually saying, once they realize the momentum for change, once they see that the business has actually bought in on cloud, and they can see that it's not going to stop-- a lot of them are actually getting ahead of it. Because moving the monitoring first, like we talk about monitoring as a discipline-- but like, moving that first, having the dashboard first, means that they can more immediately get accurate performance measurements of relocated applications instead of, well now, it's behaving differently, my monitored app is behaving differently that it did. Is it because I've moved it? Or is it because now I've introduced latency or complexity into that monitoring path?
Yeah, that makes sense. And one of the key pieces of information people use to troubleshoot is the data from your monitoring server--so having that local makes sense.
Yep. So they tend to be sort of at that 30% total transfer; that seems to be like a third in once they feel the momentum. But anyway--so you execute a couple of scripts that are listed out here. Those will take care of the moving the primary engine database records, and then you're essentially going to rerun the Configuration Wizard. It will reconnect to the database, and then you're going to relicense. So take the licenses that you cut and pasted into your spreadsheet, drop them back into the license manager, and you should be pretty much good to go. Again, it's documented; it's pretty straightforward. So, that's the first step. And just consider this as any move.
The other piece of it is when you, for example, if you're doing NetFlow, there's some other considerations, right? Like you might want to leave your NetFlow collector where it is in your on- premises network, if that's where most of your network traffic is.
Yeah, so it isn't encrypted, for one reason.
Right, well that's true. And it's also going to keep your network pretty busy. And because NetFlow is still UDP, adding additional links and the VP into that route, may not be what you want to do. In the long term it may be, as in where it gets more simple. I mean, it may essentially turn into a set of access points and access networks and that's about it. In which case, you might want to go ahead and redirect it. But do think about that in terms of where you'll relocate that, and the access. One thing I do want to mention here-- and I'm going to come back to this architecture diagram for one second-- is does latency have any effect on MTU? And I asked the father of NetFlow this question.
MTU? I think there's interaction between them, right? Because MTU has an impact on whether fragmentation is used, and fragmentation yields more packets to which latency applies.
Mm-hm. Well, that's exactly it. And what I found was regular, you know, kind of default Microsoft Server, 1,500 packet links break down a lot on many of these IB set tunnels. And I thought at first it was me. And then I started thinking about it, and I realized that we were getting packet fragmentation, and that the latency then, the retry, takes a lot longer. So if you're a millisecond or less, or something in that range of retry, is you take the window size down to figure out what's going to reliably transmit. That's one thing. But applications that are already pretty inefficient with retries, like HDP and some other ones, those protocols-- bad.
Yeah, and one of the challenges, any time you're developing applications that work across the WANs, you have to be very careful about number of roundtrips. Because roundtrips, where roundtrip number two waits on the result from roundtrip number one, ends up as a multiplier for your latency number.
That's right. So, I did a lot of experimenting. And it's interesting, this connection between AWS and Google. Now you would think that this being in Australia, that this link would at least be more durable because it's a double link-- it's got some redundancy, and it's between two pretty large providers. That one--I could get maybe 1,200 bytes together in a coherent way-- and with nearly 170 milliseconds of latency, those retries are taking longer.
1,200 bytes inside of one packet.
Yeah, but this one, through like kind of homegrown VPN appliance, that one I can do 1,490ish.
Yeah, so it may well be that the transit between you, between AWS and Google Cloud versus between AWS and Azure, just has different minimum MTU, right? Because MTU is all about what is the point along the way that has the lowest MTU-- because that's your constraint.
Right. Well I thought that too. And I did a lot of reading, and it does turn out that certain, whether the strongSwan appliance or Google's virtual gateway, they do have a certain minimum MTU. But I started tracking with a little script and started watching, and they were wobbling.
And they wobbled within a small zone. And I'm still trying to understand that, but--
Could be Multipath.
It might be Multipath, and we're going to get into that in a second-- and using NetPath to figure that out. But in this case, obviously, this is three clouds. And normally, you would have one or two cloud providers, and then maybe remote offices, or especially if you're a large environment with multiple data centers, you're probably consolidating that-- and that cloud is one of those things that's been sold to you. Because it's like, oh, we're going to solve this distributed data-processing environment that you have. You'll just have regions wherever you need them in this wonderful holistic environment. Well, you're just trading latency from one place to another--so that doesn't go away. But where this gets really interesting is DNS. Is that in your environment now-- especially if you're coming from all on-prem or mostly on-prem, then you probably have a pretty coherent naming service. And one of the things that was--I don't know why it didn't occur to me before, but each one of these have a proprietary mechanism for managing their internal DNS.
Yeah. And so naming, especially for third-party packaged applications, whether it's a hostname or an IP, can get kind of complicated. For most, like if it's an Oracle database or something else, they tend to stick to maybe an IP address and a host name-- something that they got from somewhere. So worst case maybe on an app like that, you modify the host file, if you have to, and you do it once, and you move on. But the point of cloud, like especially up here at the Docker Swarm-- that's an elastically provisioned resource, right? So, it gets bigger and smaller depending on the workload. Well, I can't go in and modify the host configs, so that it knows where to send its agent logs. Like it's-- actually, these are all agent-based pollers, and they're talking to the primary Orion server, right? Well, if they don't know how to route that, they're not going to be able to talk to that server, and the fallback address may or may not work. So what I did was I used route 53. Route 53, as it turns out, can actually do not only external geo-routing, but it can also provide internal VPC routing. And then you can set up a bind instance, and then I have both of my Google and Azure actually exposed to that, so that provides internal routing for all of those addresses. And the reason that that's particularly important when you migrate your Orion platform, is Orion is a device for discovering multiple hostnames. That's what it does. It discovers lots and lots of hostnames. It discovers lots and lots of IP addresses, and depending on the protocol, if you're talking about, not so much NetFlow, but IP address management, especially some of the polling for services on Orion, on Windows services, Linux script monitors, the rest of it-- narrowing down the number of possible changes in IPs that you have to, and host mappings that you have to set up, will make all of them operate better. I mean, the Windows protocol that's used between the remote pollers and Orion itself, I was counting three or four different hostnames. So DNS is a great way to make sure in any cloud environment, but especially monitoring-- regardless of the monitoring tools or the vendor that makes them--naming services is one of those things that you need to plan for and don't overlook. Okay, so one thing that's different here is-- Have you ever heard of the term 'cloud inversion?'
So the idea of being, that point where you take services that were-- not the kind of really cool application deconstruction in the cloud-native services sorts of cloud, but actually just lift and shift. Like moving workloads and moving package applications.
Move that old crusty application into the brand new.
Yeah. Where they're still crusty and they still smell, but at least you can't see them or smell them. It's almost a way of--it's not just turning it upside down, it's not just a relocation to a different data center, because a lot of things become opaque. And at the same time, you have a challenge of monitoring a lot of new things. And so that's one where you're going to start to experiment with new tools. So what I want to do here is come back over here to our Orion instance. And I'm going to go back to the application-- or to the application dashboard, because we haven't been here yet, and we'll talk about how to use SAM. SAM is your best friend when monitoring cloud for a lot of reasons. And I know that I get on my DevOps soapbox occasionally, and I did start as a developer. But I have got to encourage you guys to start playing with script monitors if you haven't-- because there is not anything that you cannot monitor with SAM, if you're willing to do a little bit of scripting. What I'm going to show you here is actually based out of Python instead of Bash, but that's just because I like Python better. So when we talked about strongSwan, that kind of junky thing that I built, it's kind of nice to know if that tunnel is up, right?
And although I have APIs that I can use for Google and AWS's virtual gateways, I don't have that here, so it's effectively blind to me. So the way that I would normally monitor this, is I'd come over here to PuTTY, I'd pull up the appliance itself, and then I'll do sudo... All right. So I've got my configuration information for the tunnels that are established here. I just call this one Azure, and it's also telling me the protocol. And then down here it's giving me my bytes in 'packets in' and 'packets out,' and a little bit of other information, like how that tunnel is configured, and what the inside and outside gateways are, right?
That is not a good way to monitor that.
But what is a great way to monitor that is what we see here. So here it is monitored inside of SAM. So, here's the data that we were looking at before, right? I can see bytes in, bytes out, route-- so that's like your interesting traffic-- Status. So here's your local and public IPs-- for the peer IPs, is that what that is?
And then the tunnel name. And if there are multiple tunnel names, it would list those for me. And then over here, I have got a nice chart, so I can actually see what my in and out traffic looks like. That's just much more convenient. And I can alert and report and do all kinds of amazing things on that. And the way that that was built, as you might guess, is with a custom monitoring template. We'll take a look at what that looks like. So, I just created a template so that I can apply it to multiple machines, because chances are I'd have lots of them to back up. This one uses a script monitor. And a script monitor, as you well know, is a script and a little bit of other information. This one, the connections are based on SSH private keys instead of using name and password. And then I have a script, and the script basically executes that command that we saw before, and it returns it in the standard that you specify data to come back. So if you look here, there's message bytes in, and a statistics bytes in, so one of them is essentially description and then the data-- description, data, description, data.
That makes sense.
Yep, and then this one right here, when you create these-- so like in this case, that byte counter increases over time-- is basically just a counter.
There's a check box when you create it, which is 'Count Statistics as Difference.' Set that to 'true.'
Yep, just like we do with interfaces.
Just like with interfaces. Well, what's really cool is, with cloud, you're going to be monitoring a lot of different technologies that maybe are new to you, right? Maybe LAMP stack-- a lot of other things that you're going to want to go ahead and include. So one of those, for example, might be Docker. Now, instead of just Docker, Docker Swarm in our example, right? So, I sort of did that thing-- set up that first Docker environment that was monitored by hand, and then converted it to Swarm and added a bunch of nodes and made it elastically provisionable. Well, I'm using that same approach here, where I'm actually using Docker command line API and custom monitor, and it's coming back with number of containers, the number of nodes, the state that they're in--some state other than running or shutdown-- and you can see that it's changing over time, and I can make changes. Let's actually do that. We'll come up here to a little portal manager that I've got, and we'll go to our service and just scale this thing up, but we'll come back to this and we'll watch the numbers on the chart actually go up. And I'll scale it up to 10. And again, the great thing there is that this environment is going to take care of managing that for me-- so I actually care about that. But this is an example of one where I had to kind of build it myself.
Because it's an old Docker Swarm that I set up. Smarter would be if I wanted to do something like Cooper Netties, or use the container management service from AWS to take care of that for me. So with Cooper Netties, the way that that looks is there's a pretty clean API for monitoring it. So here is a Cooper Netties monitor for the Cooper Netties running in Sydney in that Google Cloud instance. And so, you don't like my name here?
K8S? What does that mean?
All the cool kids are saying it, 'Kates.'
Okay, well SolarWinds is then 'Sates.' But yeah. [Laughs] Okay. So this one, I'm using a couple different charts. So the data that's coming back is, again, all of those components that we saw. But it also-- this API's a little richer and it gives me memory, so I can actually walk through each one of the pods, each one of the containers, and actually roll all that up.
So, this is an aggregate. You have 12 containers using 2.6 cores across all of them and 744 megabytes of memory.
Megabytes of memory. That's right. Number of nodes. And then which ones are in a 'not ready' state. And what 'not ready' means is, that when you make changes, it can take a little while for the reporting to roll back up to the control node. So that way, I don't panic if I make a change, and it takes five minutes before I start seeing data. I know, okay. It's just that it's not reporting, so that number might temporarily bump up. And then I just broke this into a couple of different charts, right? So, this one up here is the same thing that we saw for Docker-- where I had containers and nodes and status-- but I've also got my memory as a chart, and I've got my equivalent cores as a chart, too. So, this comes into cost and a couple of other things, but again, using the same approach for the VPN and anything that I can get to in a command line, I can pull into SAM. One thing that is a little bit different is you might ask, "Well, you're monitoring memory in Cooper Netties, why aren't you doing that in Docker?" Well, I could through the API, but remember that you've got built in monitoring for AWS. And so I'm just using that, right? So that makes sure--
Less at a time. This all works out-of-the-box.
Less at a time, works out-of-the-box, and it's also giving me things like volume information and events that are related to it that I otherwise wouldn't get. And it also, remember it lets me kind of kill bad soldiers, right? Because I can come into my management portal right here, and then go to your 'Cloud' tab, and then if you can't reboot it through the command line because you don't feel like it, you can actually stop and start-- in fact, terminate instances right here, too.
So action for a bad actor.
For a bad actor. So yeah. So that combination of custom Linux monitors, and the built-in for AWS makes that really easy. Now, you might want to start to think about some other new tools, right? Like I use Papertrail a lot. I don't know whether you've worked with it at all.
So Papertrail log aggregation service-- think of it as a giant syslog in the sky.
So all of those workloads, those stress workloads that I am spitting out in Cooper Netties and in Docker-- there's something about 100,000 messages an hour into Papertrail here, right? And so I don't know, a lot of times, when I migrate applications to the cloud, and especially just migrating Orion itself, how it's going to behave. And I might see novel issues. And capturing those logs-- well, I don't have to log in through the opaque interfaces of the VPC if I have all the logs where I can get at them. It's really handy.
You can aggregate from multiple clouds versus the machines themselves-- which is all of that stuff in one spot.
That's right. But it also lets me alert. Because one of the things that you need to worry about is if I now move my primary Orion poller to the cloud, and I lose my VPN connection to my on-prem network, how do I know it's still alive?
I kind of want to know that, right? So, two things. One, you're going to want to set up a dial up, out-of-band VPN connection just through a gateway. And they charge those by the hour. Those are pretty handy. And that way, at least you can use your mobile app to get to it. But the other thing you can do is use the events of Orion itself to raise alerts if Orion goes down. And you can use Papertrail for that. And you can actually use the free tier if you want. So, here are the events that are coming from that Orion poller, right? And I was trying to think, like, where could I get a heartbeat? And it occurred to me that the business-layer engine, the main service, every five minutes or so does some backups.
And it sends a message to tell me that it's doing a backup. And then things break, this service stops. So I'm sending this one log using just the regular log forwarder to Papertrail. Then I set up an alert on it-- which is Orion running a heartbeat. It runs every 10 minutes, and then it alerts when--
No new events match in that 10 minutes. And then go ahead and send events.
Send in events. Now, if I'm going to send events, I can't send an email. I probably want to send an email too, but I can't rely on that because if my connections down, I'm not going to get anything. So one of the things that you'll want to do, too-- and check out our lab episode of the integration of Orion alerts with Slack, because that will show you how actually to use Rest and third-party services to send messages out. One of the things I really like is Pushover. That's an app. It's free to use. And you can actually define apps. And it's integrated with IOS and Android OS, so you'll actually get an alert to your screen, and push those events out that way. So again, out-of-band monitoring--the monitor becomes really important when you push it out of your building.
Another thing you may want to look at-- I gave the example of how to monitor Linux. And with the Linux agent, it's really pretty easy to monitor just about any type of Linux you can think of with SAM. But if you only have a little bit of Linux, what you might want to take a look at, too, is Pingdom Server Monitor, because it is a hosted method to be able to monitor those. And so here's my strongSwan server, right? This is that appliance again. And what's cool about this is it's giving me most of the metrics, and I can also go and add plugins for things like Docker and EC2 monitor, and a pretty long list of different plugins that I can assign just out-of-the-box. But it's all push-based. So you install the agent on a command line, an SSH, or push it as a part of maybe your Chef or Puppet build and deploy process, and then it all pushes the metrics up into the dashboard. So again, like Papertrail, it's pushed to the cloud and it's aggregating all of those in one place. And you don't manage that, so that's one thing to take a look at. But the other thing, too, is you will probably start to manage data from systems that-- especially if you're using elastically provisioned resources-- where you may be spotting and killing thousands of containers a day. You are not going to do what I did there before, and actually set up monitors. Even if they're using monitor discovery or API or, in this case, I'm actually integrating with Chef to add those new nodes. So that as they-- the Docker instances, right? So that it adds monitoring for it. So one of the things that you might want to look at, too, is check out Librato for sending those logs to. This one--these are essentially the same metrics that we were looking at before in a smaller dashboard. But when you look at something like monitoring Zookeeper for example, where you've got huge numbers of reporting elements, and you're trying to aggregate those altogether, it's a really handy way to build dashboards so that we'll take care of dynamically configured resources-- so you essentially have named paths. And it also does multiple, tag-based analysis as well-- multidimensional tag analysis. That's pretty handy. The only other thing-- and I know this is going to sound like programmer stuff, but it's not--is that distributed tracing is a part-- is that second way of doing APM monitoring. It is something that I think you should all take a look at and learn. Whether it's--there are a number of different products that do it. I'm going to show you what that looks like in TraceView here. But the idea that you need to be able to see application performance that's coming outside of the cloud, it's coming from the user's experience and trace that back through all the layers and actually do aggregated analysis. And especially where applications break, because there may be data that's a part of the procedure calls, like the data itself can cause it to break-- is really important. And it will become increasingly important over the next few years. So it's something that you should take a little bit of time to learn. And what I mean by that is, like this is a-- basically a hotel booking service. So it has a number of different components that are all working together. Well, a couple of them are microservices. And like pricing and availability, and the credit card piece-- and the booking service that's actually making the reservation is taking most of the time, because it's integrating with the most parts. So you would typically start with a layer of breakdown. And if you're using WPM, or using Pingdom, for example, you're used to sort of seeing that waterfall of how long the transaction takes. Well, when you start to really look at all the transactions, you start to think more like, 'What do these transaction periods look like?' Like, what are the patterns that begin to appear?
So not just the averages, but outliers. How many outliers? How do those contribute and make the average? And what is the impact to your application and your application usage because of that?
Right. Because when all of the resource are a variable, you start to get into some really interesting root-cause analysis. But it's also being able to trace individual requests-- gets to be important too. Because when you find those outliers that are way outside, you need to be able to look at that. And so we're not looking here at just sort of monitoring from the bottom up in the infrastructure; this is actually a transaction. Now, we can go and take a look and see. So here, this is a Rails-based framework, right? So we've got Tomcat, Spring, MongoDB on the bottom end of it. And so these are the transaction calls of a single request that was made to the website. And we forget how many of them there are. And back here, we're just hammering that MongoDB, right? So I might want to talk to the application engineers about this. Especially if this is homegrown. But I can look to see where I'm spending most of my time-- so especially where I have interconnected elements that are maybe--that are each one of those deployed inside a container, inside of my cloud provider. Maybe it's the size, the resources that are dedicated to it. Maybe they're in different zones, and they need to be able to get closer together. But also, sometimes it's handy to be able to go and look at what the actual query was, right? I can actually look at the value. So if I have one that I'm constantly getting an error, I can go back and say, you know what--
So it's like infrastructure toward app, versus app toward infrastructure.
That's exactly it. That's exactly it. And you have to be able to monitor both when you have a hybrid environment. So that's the other tool to take a look at. Okay, so there's a couple things to remember here. First, I didn't have time to go into RDS versus bring your own license for the SQL Server. I will put that up as a THWACK post, and then link it to the description so that you guys can check it out. There's cost considerations. There's performance considerations. I could not build an environment that ran as fast as the same hardware in RDS-- that ran as fast in my own as it did in RDS, but it was more expensive. So if you have a large environment, and you're already running a lot of RDS, you won't really notice. If you're just kind of starting out, you might want that on a separate machine. If you're using RDS, there's a couple of special considerations. You either have to restore the database into RDS, or you have to pre-prepare the RDS with a special script in order to get Orion to install on it. Because it doesn't have, kind of, all the, kind of, SA-level management, store procedures that are part of it.
That makes sense.
And I'll include a script for that too. But the main thing to remember here-- and I think the thing that's most interesting in talking to customer's about it-- is the reason that they are moving their entire platform-- not just SAM, which is really important, but also NPM-- is that they still have a delivery network, right? That's never going to go away. They are always going to have to get applications that are running off-prem to be made available on-prem, or there's no point to any of it.
You always have the users. What are they going to connect to? Wi-Fi or wired, you still have to connect.
Wi-Fi or wired. And they've got VoIP. And you've got NetFlow considerations. You've got firewalls. Like if you're looking at the new ASA monitoring that's built into SAM for example, into the NPM, for example, that's one of the first things you're going to need to do as a part of your VPN. Are my VPN tunnels working the way that I expect? So, it is very handy to be able to have both of those in one place. And that was the thing that I hadn't really expected. I thought that customers would actually start to migrate to some of our cloud-based tools, especially as they start to do more and more DevOps. And what I'm finding, instead, is that they have been using the Orion platform for a really long time, and it runs really well in the cloud, and it provides capabilities that they're familiar with, and that take care of that sort of last mile of the cloud-to-ground part. So, I think just be open and spend some time in the Customer Success Center. Learn how to script a little bit. The ones that I showed you here, were actually written in Python, because I just prefer an actual language to Bash. I know you prefer Bash, but whatever. You've got your own thing going on there. But this is something that you should experiment with. Set up a lab. That's the great thing about like the environment that all of this was built out of was, a lot of this is actually free-tier. And I got started with it and set it up, and I can build it and tear it down, and I don't have to worry about it. And learn them before you go and actually do this.
Yeah, that's how they get you addicted-- low price for addiction.
Yeah, they get you on the comeback. All right. So you think we got it all covered?
Yeah, I think so. I'm slightly concerned we were moving too fast in the session. But you and I know Orion really well at this point. You've been running in AWS and Azure awhile, I guess.
Four years and two years, yeah.
And if I was watching this as a customer, the takeaway might be more anxiety rather than enthusiasm.
That's possible. I hope that's not the case. And if they experiment, they'll probably find that that's not-- but that's also why there's replay and lots of links in the description here. And of course, we're going to be on live chat. So we would love to hear your questions and talk to us about your experiences. And there's tons of conversations about this out on THWACK community, about how and why they're relocating their monitoring systems to the cloud. And really, it's not all that hard. I mean, it parallels most of the rest of the sort of data center, workload migration. It has some amazing benefits, and I think it's really cool.
Yeah, I bet right about now, you wish you'd actually grabbed some of the customer interviews at Cisco Live.
Yeah, because basically I'd say, "Hey, check out some of these customer interviews "from Cisco Live. Roll that tape." And then you guys would basically hear what Chris and I did, not just at Cisco Live, but at events for the last year--which is, they basically said, "Oh, I'm kind of nervous. I feel like I'm being forced." And then actually, it wasn't that hard.
You try it; it's not that hard.
And in some cases, it runs better than it even did on-prem.
Yeah. Well, hopefully you've enjoyed our session today. Please keep your questions and comments coming. We're in the THWACK forums every day. And of course, you can always ping us directly. Patrick, are you sad you're going to have to tear down all of your pretty infrastructure here?
I am, but it was a lot to set up just for one session and for training. But there's a certain budget that goes along with this that I don't want to exceed. But really, I built most of this using scripts and using the command line tools from all three of the cloud providers. So I could probably respawn maybe 90% of it-- with the exception of the VPN's--in about an hour.
Yeah, of course. Developers?
I hope to resemble that remark. Well, thank you all for joining our session today. And we'll see y'all in THWACK.