Monitoring is Your Best Friend When Moving to the Cloud
You have decided to move one or more workloads to the cloud. Now what? It is not an easy task. The cloud can be intimidating, but you’re not alone. Monitoring can be your friend. It can inform decisions, confirm that you are on the right path, and make sure everything works.
In this session, you will hear from product experts and from SolarWinds IT and their experience in moving workloads to the cloud. The session will provide a roadmap to help you have a smooth transition. After the session, you will have a better understanding of what needs to be done and a solid grasp of best practices to follow when shifting workloads to any cloud.
Thanks for joining our session today; Monitoring is Your Best Friend When Moving to the Cloud. I'm Steven Hunt, product manager of Server & Application Monitor and Web Performance Monitor. And today, I have Patrick Hubbard, Head Geek, joining me.
Oh, it's always good to be here.
And John Martinich as well.
Thanks for having me.
And so, John, tell us a little bit about who you are and what you do for SolarWinds.
I'm the Web Operations Manager. I manage our customer- facing web properties as well as our business applications. And I've moved several properties into the cloud for various reasons.
Excellent. So, you have quite a bit of experience when it comes to dealing with application workloads and their potential transitions to the cloud.
Yeah, I've used all of our tools because I get them for free. And using those tools has really helped me learn about what types of opportunities we have and how the cloud might solve some of those.
And there's something else to it too, right? Is that you've been in IT for a while, right. You come from that same traditional sort of waterfall-based IT. Help desk-driven, maybe slower, traditionally paced for enterprise, right? And as you were transitioning into the cloud, it's the business that's actually pushing you there because they want to see things done faster, or to be able to adopt new technologies. So it's as much an evolution and overall approach to be able to adapt to new technologies faster as it is to adapting to new technology.
Yeah, it's funny. When you're moving to the cloud, some people feel like you're flying to the moon. And other people might think that you're just moving houses. Is it a new data center or is it a whole new world?
Yeah, it's both.
So why would you move to the cloud?
Well, there might be a lot of reasons. But I think what the most important is, defining those reasons up front. Because you'll probably have an executive sponsor. And if you don't have an executive sponsor, you really need to know what you're driving toward so that you start measuring that early. There could be reasons for cost. But I think if you're moving into the cloud, cost could be a dangerous one because you are already spending for on-prem, you might end up spending more. And as that project gets delayed, the more and more you have this overlapping spend. But there's many other reasons like security offerings, the tools that the cloud offers you. Maybe you need to scale up quickly and scale down. Or if you're a new company and you want to have the ability to scale up quickly, there's all kinds of offerings there. We've used it just to get data centers in regions. So as we go more international, I can put an application, that I can't really cache with a CDN, I could put it closer to my end-user so they get better performance.
Are those kind of the things that you hear about from when you're dealing with our customers out in the field?
Yeah, that's the main thing that drives it is, it's either to reduce latency. So they're relocating services that are closer to end-users, or increasingly, and not to invoke buzz words like DevOps but there is now a— The thing about developers actually building more applications, or at least adopting maybe open-source frameworks and platforms, is that those are really prescriptive about how they work. So what happens a lot of times, with that transition, is not only are you now monitoring your on-prem, but you are now learning how to monitor things like you did before. Like, you can go through a whole career and not do in-memory caching. Or maybe that's a service or something that's built-in the underbelly of SQL Server, for example. Well now, you're monitoring Redis, right. And then maybe you've got a horizontally scaled set of microservice applications that are interacting with that. And a MongoDB database, and it runs on a LAMP stack. And there are a lot of new layers to monitor. And so thinking through how that works is something that's a little bit outside of the traditional role of IT. But it's really not that different. And because it's applied programmatically, you have an opportunity to actually inject monitoring at the beginning, instead of doing it last. Which is the way that I think we always end up doing it in IT, right. It's like, first you buy hardware, then you pay for servers, and then you might have some budget for monitoring and I guess, what, security's last, right? But this way it gives you a chance, if you're thinking about it ahead of time to actually inject those monitoring endpoints into your applications as you deploy.
Let's be honest, when we migrate anything to the cloud, it goes right to the C-Levels. These guys are interested in what we're doing. And we need to prepare to have a success story to talk about when we're done. So if we don't plan up front to monitor the metrics we're trying to improve, we don't have that story to tell at the end. So it's so important that whether you're targeting cost or latency that you have those metrics defined up front.
So, do you have a couple of fresh in your mind use cases that we here at SolarWinds have taken on when it comes to transitioning a workload from on-premises to the cloud itself?
Yeah, one example is our online demos. Using WPM, or Web Performance Monitor, I identified that internationally, we had really poor load times over in Ireland and in the APAC regions. But in the US, we had really good load times. I considered putting it in front of a CDN. But I thought that that might be not really true to our customers.
It's not very static. It's pretty dynamic.
That's right. And so a lot of this content can't be cached. But by putting these, essentially clones of these environments in region, we get much better performance, and so our users see a true experience rather than the latency that they would have if they had to travel across the country.
So it's interesting, you mentioned that you used WPM and recognized that there was some performance issues with the on-premise solution itself. And then identified that the cloud is a potential solution to those performance problems. But then I suspect ultimately, you're going to have to continue monitoring. Not only the same way that you were, but also potentially introduce some additional tools to identify, "Do you have a measure of success?"
Yeah, WPM set up kind of the criteria of success. I could tell that sometimes the page didn't load or it took 30 seconds. Well, that's just not a good experience. And so that was what we were trying to drive down. But then, in the migration, we really needed to pay attention to how are our operations changing? How is the architecture changing? How do I configure a server? How do I deploy new bits? And so, there's a lot of parts of your operations that you have to take into account to make sure that you're not disruptive to the rest of the team trying to chase this goal. Which is super exciting, but you might end up leaving yourself short because you don't plan in advance.
Well it's highly important for us that work that he did. Because I know you know as well as I do, when we're out at conferences or interacting with customers, that demo environment is really something that we rely upon. So I think you and I have experienced that first hand, performance problems associated with it. So, transitioning to the cloud was a good benefit.
Well, and especially because you think about the Orion Platform for example, right? It's an enterprise application. It's designed— Even if you're a very large customer, you may have a few hundred users that log into it every day. But in the case of the online demo, there are tens of thousands of users. So it's an example in the way that it's used in that role of a highly stressed enterprise application. And so one of the things that is interesting about migrating applications to the cloud, in terms of more elasticity of provisioning capability is the first time you also start thinking about cost, budget, value delivery to service. So in that example, the way that it normally works, and tell me if I'm wrong, is that when you build something on-premises, you make your best guess about what your customer experience is going to be like based on the hardware and software that you implement and storage. And you spec it all out, and you get a great big Capex order, and you order it, and it arrives, and comes up online. And then you pretty much are going to get the performance you're going to get based on the guesses you made. I mean, you can tweak it over time and you're going to be able to maybe extend it. But without going and buying something to add onto it, or maybe switching to flash for storage or something else, that can actually be an issue. But with cloud, you can do those things. Like, let's say you have an event. Or in the case of the demo system, maybe. We did one a few years ago, I think it was a free t-shirt for everybody who visited the demo and ran through a demo challenge.
Generated a lot of traffic.
Yeah, and I was in London at, where was it, it was at Cisco Live. And all of a sudden, 60,000 people came to the site. So then I was scrambling around, because it was all based on VMware and the data center, and I had to scale it out as best I could through the VPN on a terrible hotel network. If that, the current infrastructure existed, I'd have just scaled out a few more servers, turned up the load balancer, and maybe paid a premium for three or four days’ worth of service, and then turn that back down again. So, cost becomes part of the reporting to management with cloud in a way that maybe it wasn't before when you were just using Capex to buy stuff.
But also scale, just like you said. You have this one event that blows the scale. And you're able to scale for that short period and pay a burst cost, and then come back down to your typical cost for the rest of the year.
So John, what are the things that are really important to consider before you get started in any transition from on-premises to the cloud with your applications?
The first thing, like I said before, was let's meet with this stakeholder and really understand what we're trying to drive. Because they're going to be adamant about making sure it's successful. So you really want to be on the same page they are. So monitoring, whether you're looking at latency, if you're looking at just capabilities, whether you're getting new tools. Or if you want to establish just a new data center. But I really want to be on the same page with my stakeholder because they're going to be relying on me to deliver something that makes them successful as well.
I think that's something everyone can relate to, constantly having to report back to the stakeholders of how these projects are going.
Then once you establish what you want to measure towards success, you need to put together a project plan. What are you doing, currently, in order to operate this application? When you need a new VM, how is that deployed? How are you going to do that in the cloud? How are you going to provision new users as they come to the team? Is that going to be through Active Directory or some IAM accounts? How are you going to push your new bits? As they get checked into your source control, how does that connect into your new provider so that you can push those bits? Do you need to establish a new VPN tunnel? Well now, you're starting to talk about your lift and shift, having a little bit different of an architecture. And if you start doing geo-routing, so IP geo-routing, maybe you added a new layer, something like Route 53 so that your APAC traffic goes to this set of servers, and your Europe goes to Europe, and your North America comes over here to our data center.
So with that, are there significant gotchas that you really have to pay attention to? Things that are just significantly different than what you were dealing with when these workloads were on-prem?
Well, I'd be cautious about what your goals are and whether or not you think you can achieve them. If the cost is the target, how could that be a gotcha? Because you're going to be paying for your one-year contract at your data center. And what happens if you go over schedule? Can you go month to month? Because during that whole time, you're spending even more money. So you're losing the cost battle. And what happens if you suddenly get somebody that attacks your site? They're going to cost you a lot of money, whereas previously, the data center really probably helped to ensure that that wasn't going to be a problem.
Is that something that you can effectively monitor as you've made that transition?
You can, but you need to really pay attention that those are the things that you need to monitor. If you're not monitoring those things, they can catch you off guard. And before you know it, you've spent quite a bit of money.
Have you ever experienced increased cost when you've been working on cloud projects?
Never. That just never ever happens. You're not essentially now running two duplicate infrastructures, which are actually very, very diverse. Well, I was going to actually say I do find that one thing that is helpful is to do almost sort of a stack diff, right. As a way of communicating with management, as a part of that planning process. Because I think they make decisions about what they want to do. And it might just be, maybe the CIO has a peer who's been doing a lot with cloud and he's heard, oh, cloud's good, do some of that. And there's some budget assigned and there's no specifics about what the technologies are that are applied. Sometimes it's a little bit more prescriptive. But in both cases, there's going to be a lot of decisions that are based on specific technologies, or the demands of specific technologies. And having a conversation that you've had for many, many years with management about cost. If you can't explain the difference of inheriting a bunch of new technologies. There's going to be learning curves for all of those. There may be new tools that are necessary. And certainly, there's going to be investment of time to actually extend not only monitoring, but troubleshooting and all the other things that you're going to do with that data. So, almost doing a diff of the different platform technologies and making sure that that's one of the things that you present as a part of that conversation to management is important. Because you're going to start using these new terms. They don't know is elastic scale out expensive from a time perspective? Or am I going to use a service to actually manage that?
You don't think the CIO learned that on the golf course when he was talking with his peer about cloud projects?
No, but it's not only a CIO problem, it tends to be more like sort of a technology director. Or depends on the size of the team. Like, it might be a small team, might be a senior manager, and they're talking to a VP. Or it's the CIO and maybe technology directors or VPs over part of those units. But that needs to be that, hey, our infrastructure is evolving. The mix of technologies is changing as a part of this evolution. And so here's the taxonomy that are now going to be a part of these conversations about cost and performance and everything else. So, sort of planning ahead of those conversations can help really kind of grease a lot of those transitions.
So, cost is definitely one of those. But what else?
Well, this is going to be kind of an epic. This is such a huge project, moving into the cloud. Everybody can have a different idea of what it's going to provide. And so meeting with that executive sponsor to say this thing is so huge, we might have problems with schedule. How can we reduce this down into maybe a pilot that we can see is being successful to say, the spend is kind of in where we expect. The management of user permissions and provisioning is kind of what we expect. But you really need to kind of get used to riding some training wheels before you just go full bore. Because once you launch it, and the marketing team or whoever finds out about it, then it's production. And next thing you know, you're kind of scrambling to make sure everybody on your team is trained up and that you guys have all the monitoring and operational tasks that used to be automated. You want to make sure that those are automated up front.
I was going to say, you've got a really great example of that. So, John's got a bunch of systems set at one of his machines that are sort of a little bit forward facing as part of the— To improve security, you basically blow all those away every night and recreate them. So you've got a lot of dynamic provisioning and you're using Chef for that, right? So you've gone from what used to be configuration, like GUI-based configuration for systems or maybe scripting, to using a coordination tool for that. But to then say, when management says, "Well, this is going to be really fast," but I've got to learn some Ruby. And everyone on the team needs to learn that. Well, why? Well, because the tool drags Ruby with it. Chef recipes, that's part of that. So it's not hard to learn. It's not terribly hard to learn, but it's a cost that goes into the ultimate goal of "go faster." It's like we're going to go faster, but during this transition, we're going to have to learn some specific skills.
Right, just because you automated something doesn't mean you get to set it and forget it. That automation needs to constantly be updated. You need to be monitoring to make sure that none of that's breaking. And always improving it. So, I would say that automation allows you to scale, but it doesn't allow you to just save money.
But things like automation; those are tools that you have to implement to make that happen. And so potentially in your transition, you're going to have to adopt new tool sets to do those same functionality, because the previous one doesn't handle the use case. So it kind of comes back to cost. But also, it's resource consumption, learning curve, and time and effort to make that happen.
Yeah, let's bring it back to the example about the demos. When I first pursued this idea of the in-geo demos, I was talking about taking our operation responsibility from a set of six servers to a set of 18 servers. Well, that's not really the greatest message. As exciting as cloud is, it's a little bit overwhelming for the team. So, how are you going to ensure that the team is prepared for what it will be like in production when you're at three or four times the number of servers that you previously managed? And you have different access methods as well. You have different dependencies, like your VPN tunnel. Or maybe your database that used to be right next to your servers is now remote. And so you have this additional latency. In fact, NetPath has been a fantastic tool to help me identify the latency. Because when I'm dealing with data centers in different regions, one of the things that I can't account for is problems that other people have introduced but not being able to identify them. Using NetPath identifies that the latency exists in a path that I'm not in control of. But I know that until that's resolved, I can move forward. That's really helped us when we’re targeting latency.
Actually, I was going to say, in the session for moving Orion to the cloud, I'm actually going to show that. So I've got three way connections between Google, Azure, and AWS. All three of those VPN tunnels are different. Different sets of technologies, different virtual gateways. And I'm using NetPath to actually map that performance. And it's amazing because it's so asymmetric and the technologies that are involved— you will see sometimes 20%, 30% improvements in latency sometimes across those boundaries by being able to monitor those.
So, you're going to show them in your session what that's going to look like.
I will show it live.
Yeah, I find that to be some of the most exciting parts, is leveraging the monitoring to tell the story of success. Whether you're looking at the latency or something else, it really just paints this picture to say that you're doing the right thing. What I'd add to that is you can really reduce a lot of your risk by using the monitoring. The cloud providers now have these proxy tools that allow you to send 5% of your traffic to your new services. So, instead of having a hard cut over date where you say, "January 1st, we're moving to the new infrastructure," you say, "on January 1st, we're going to send 5% of internal traffic to the new set of servers." And so those guys can start telling you about problems. You can really mitigate the risk. And what we've done is we have these performance reports where our guys are looking at the performance of our previous install versus the new install, comparing what type of 400 errors are we getting? What type of 500 errors are we getting? How much traffic? What's the conversion rate? We're really looking to make sure that that site is performing as well or better, so that when we move into the cloud, there's little risk and we've already seen success.
But I was going to say one of the things, too, is you got to be open to— You adopt a new technology, but you a lot of times find a lot of advantage when you learn the full capabilities of that technology. So like in your case, Route 53 is really great for geo-routing and a bunch of other things. And I'm discovering that one of the biggest problems, especially with multicloud, is DNS resolution. Just basic name resolution. And especially where you've done lift and shift with package applications and they have a bad habit of using host names. Well, they're connecting across to maybe even a different availability zone, or actually, maybe a different region in AWS. But if you're multicloud like a lot of our customers are, name resolution's the problem. And Route 53 will actually do internal VPC name resolution so that you can actually use it for that and distribute it across multiple environments.
Well, that's great to segue into as you kind of lift and shift, let's say that that's the initial pilot. We just want to lift and shift and have almost an identical architecture. But once you get to there, start discussing with your stakeholder what opportunities you have. You have the opportunity of a firewall. You have the opportunity of a CDN. You have opportunity of shifting to Lambda. There's just so many opportunities to take of once you've made it into the cloud that you may take a little bit of deficit moving in, but then once you ramp up, you have the opportunity to scale huge.
So it's going to be really important to, along with all those points, is define your goals, right? Make sure that you know what it's going to be for success. But you're going to need to define your goals so once you've gone through this process you can actually have something to measure against.
Okay, that's a great point. So let's pick up there. John, give us an example of an uber goal, a top-level goal that any cloud migration project should try to achieve.
Well, I think you're starting it off right where an exec would start off. They have an uber goal that's probably an epic and it may not be achievable. So, realize that you're going to be the one driving this. Try to figure out what some sort of success story that you can expose with some monitoring is. And make sure your stakeholders bought in. The last thing you want to do is expose a pilot and then that becomes production, so...
That never happens.
What can the pilot achieve that can allow you to move into your next goal? So if I just lift and shift and I get on-prem to prove that we can run our application on-prem, what can I do from there? I could get a Waff in front of it. I could put a CDN in front of it. I could maybe shift over to Lambda. There's all kinds of opportunities, but it's important to expose to your stakeholder where you can get if you just have a lift and shift that really may not show value. But it leads to value.
Well in terms of pilot, we talk about sort of minimum viable product or features. Do you recommend also sort of a minimum viable monitoring approach to that? So that you define at the beginning of that project as a part of those goals, is we will not deploy something new without a base level of monitoring?
Like I said, it's really important for you to be building that in early. So, if you're talking about monitoring, that's kind of my team's responsibility. No one else is concerned with monitoring or definitely not to the level that I am.
Certainly the CIO isn't.
So I want to build that in early. That's part of my team's competence. Because if that's not built in, we're going to go live because of the value to the company. So I always want to build that in early.
So, are those the things that you're looking towards to define your KPIs to measure that overall success of this transition and this project?
Well, I use monitoring for a variety of reasons. Certainly, I want to make sure I have the KPIs monitored so that I ensure we're delivering the success criteria. But I also want to make sure that we're up and running. So just up/down monitors are important. And while that's not necessarily— Well, it could be a failure criteria. It's not really a success criteria. It's very important for my team to make sure that we have that same level of monitoring.
But isn't that actually sort an unintended benefit? Any time you have a big transition in technology, you have an opportunity as a part of IT to sort of restate what those KPIs are based on one experience and an understanding of the business and what management expects to hear. So, it's actually a blessing in disguise in a lot of ways because you get to go and redefine sometimes metrics that were sort of pressed upon you by previous administrations or other people in the organization. So you can actually not only pick KPIs that are realistic, but actually show leadership in what you are surfacing and the way that you analyze it.
Well not only that, but you might be monitoring something that you intend to be a benefit. And then retroactively, you want to show that as a value. You may not have realized...
You don't have to expose everything.
That the database would perform faster in one of the cloud providers. And if you're not capturing those metrics early on, it's hard to dictate that story. And so, tools like SAM to get the performance metrics there. Or Papertrail. I find a huge value so I can identify errors that are occurring in my on-prem that may or may not be in the cloud and vice versa.
Oh, log aggregation without building a whole big data infrastructure to capture it? That's kind of handy.
One of my favorite uses in Papertrail is to try to figure out who was pivoting where. So I'll search for an IP and see what servers they've hit. That's typically for some bad actor, but it's interesting to see if a bad actor is pivoting through our properties. I can see in one pane what systems they're hitting.
You're not hitting a bunch of separate HTTP logs by connecting to machines separately and polling them.
No, while that's... I can technically do that. I don't have time to do that.
So I want to come back to, you were talking about the example where you had to transition the online demo into a cloud environment. What were your goals that you set out to do in that particular scenario? It'll help some of the people watching think about the types of goals that they should be looking towards setting.
Right, it was interesting. As we evaluated the goals, so many things came up as opportunities. And that's why I kind of drive back to trying to set it to something small and chewable. Because the first goal was to make sure that the demo loaded internationally. I mean, if the demo's down, we're not selling products, right. And so we wanted to make sure that just with WPM, that we could prove that the demo was performing faster. One of the other goals that came from that was how do I make sure that my team can continue to operate without adding staff? Because as we pursue this, I don't want to have to ask for a bunch of budget to pursue this. It's an exciting opportunity for the team, but how are we going to do it with the existing staff?
That kind of comes back to the new tools, right. Like if you have to adopt new tools, you potentially have to adopt more staff or more budget to make that happen.
Or at least a new learning curve.
Yeah, planning the learning curve into your budget. You've been on-prem. Now you need to expose all these tools to your team. There's going to be a lot of ramp up. There's different ways of creating users. There's different ways of decommissioning users. How are you going to make sure that when somebody leaves the company that they're no longer hidden in your IAM access?
That's a good point. What other goals did you guys set for yourselves when you were transitioning?
Well, one of the other goals was just the automation aspect to ensure that we didn't have to hire more operations engineers. As we scaled up to three times the environments, how are we going to be able to support that? So, automation and making sure that our Chef configurations were supported in all the environments, whether you have to have a VPN tunnel up or not. Just being able to leverage the automation so that our daily tasks weren't impacted in a way that we couldn't sustain.
So, how would you go about measuring the success of those goals? What can you define in terms of the values of each one of those? So some of them are tangible. Some of them potentially aren't so tangible. So if you can kind of highlight for everyone, what is that— those KPIs, what are the actual values that you're looking for that you can help define?
Yeah, that's interesting because they evolve, right. Initially, you're just looking at load time. You want to make sure that the pages are loading. But after a while, now that you've proved that they're loading, now I want to see an up-tick in users. I want to see that people are actually staying and participating in those demos. And what we've seen over the course of the year is that the online demos over in APAC and Ireland have really shot up. The engagement has taken a major uptick. And so that's another success metric that we've been able to prove.
And that's the kind of metric that maybe you wouldn't have considered in traditional infrastructure monitoring, where you were looking at, let's say, page views per session. That would be a good measure of engagement from one person on that page, the more pages that they click on. So, you typically wouldn't worry about that if you were just measuring on-premises. You say, how long is my average response time? Well if you're actually looking to see user satisfaction or ideally, you're delighting your users, you would see increased engagement. So, it's another example of sort of starting to add in KPIs into IT that traditionally might even be part of another department.
So that brings up a really good point. Let's kind of talk about what tools you use to monitor all these metrics that you've got defined for success. So from your perspective again, back to the transition of the online demo. What tools do you utilize, or did you utilize, when you started, ongoing, and helped you measure your overall success?
Yeah, well luckily I get to use so many of SolarWinds' products that I get to take a stab at it from many different angles. But I wanted to start with what is my existing monitoring? Because I don't want to take on a bunch of different monitoring for the things in the cloud and try to measure apples and oranges in terms of graphs.
Coming back to that new tools aspect. Try to leverage what you've got first.
Right, especially because I'm going to try to present two graphs side by side and say that one is a success over the other. So I've started with SAM. I use that to basically tell me if my team needs to be participating in an alert. It shows me behind the firewall how my cluster is operating. And for the end-user, they don't know a problem exists. But one of my servers may be down to allow my guys work on. From a different angle, I use Pingdom to tell me when I actually have a sky-is-falling issue. If Pingdom tells me the site is down, I know a customer is experiencing the site is down. And so I get those alerts immediately.
So SAM's giving you what's going on behind the scenes, under the covers, what is the application, the server performance, across the entire workload that you've got going on. And Pingdom's really giving you the experience of the user. Getting an understanding of how is this application look towards the user? Because SAM's going to tell you, like you said, a server's down. But does that mean that the actual application is down to the end-user?
Well, I was going to say or regionally down. I've got the Pingdom dashboard here. When you look at application performances, actually being able to go and look at it from different areas of the world. Especially for a geo-distributed application. Or a waterfall graph of what's taking longest as a part of a page load. I mean, once upon a time, a page was this wonderfully monolithic thing that just came with everything in it. And now, especially applications that are based on Angular or other frameworks, there are so many components that you might be using a CDN for one component and then everything else is dynamically generated off of another part of your distributed service. So what appears to be a big user slowdown, actually may be one of those sub components. So it's not enough to just say how long does it take before I feel that the page is loaded? Being able to isolate what parts of an application are delivered or are impacting user experiences is the goal.
I'm so glad you mentioned the waterfall. It identifies the slow parts of the application. And so frequently, it identifies these third-party parts of the application that I don't have control over. But people say my website's slow. Well I have these marketing tracking pixels that are from third parties. And if those are slow, the site can be slow. And so if I'm trying to improve performance on something that I don't have control over, then I might not be successful. So it's really important to analyze what are the slow parts of your site?
People blaming the performance on somebody else's ads. Yeah, and then the other one too is like, once you started to get into RUM monitoring, right. Sort of real user experience monitoring. That's injecting, that's sort of the beginning of code injection, adding just a little bit of code as a part of one of the base templates for a page and then being able to see that in the end transport of the data for that end-user.
Absolutely. If I want to figure out how people are performing internationally, we're definitely looking at those metrics. As we move into the cloud, I've talked about how important it is for that new site to be performing well. And so one of the things that we--Another tool that we use is Papertrail. And we're graphing out the log messages, the errors that are coming out of each server. Because as you move into the cloud, the VMs are a little different. The database may be a little different. And I want to make sure that the way that the application is performing is telling me the exact same type of log patterns as my on-prem.
So did you have a baseline before you got started that allowed you to understand what you saw with the on-prem environment versus what you had going on in the cloud environment? Did you have that, like Papertrail kind of gave you that baseline on the types of areas, the things that you're probably going to experience and if there's anything new that you experienced after you transitioned?
Yeah, so one of the things that my team does is run performance reports of all of our sites. And that's not just how fast is it loading? It's what are the top error messages that we're getting? Where are they coming from and why? And so we're constantly giving this feedback to the web dev team so that they can either fix these errors or remove them if they're benign. That way, when we move into the cloud, we're graphing those same things to identify is this application performing differently?
I was going to say, a lot of it is— when you can aggregate logs, especially at a really high volume, it makes it easier when you don't know what you don't know. Because you can sort of use it as a laboratory to look at lots and lots of data in aggregate and discover opportunities to monitor maybe that you hadn't considered before. So like this one. This is a view in Papertrail from part of the cloud demo for the other session. So this thing is taking about 90,000 or so events an hour that are being aggregated from a Cooper Netties cluster, right. So I've got about 150 Docker containers that are kind of spinning up and spinning down. And every time they do, they are then running workloads and they are spitting out an awful lot of messages I don't necessarily know, especially in containerized applications, what those events are going to look like. So being able to, especially when you stand up a new application, or you use it in a new context, or package it in a new way than you would have on-prem. It may do something very different when it's actually running in the cloud. And so it's a chance to sort of see it much quicker than you would have trying to log into it remotely, and also be able to go and query three or four days of data to find that one weird novel exception that you weren't seeing before. Where if you hadn't been watching it, you wouldn't have caught it. It's in there somewhere and you can actually test it out.
Would those be one of your success metrics to identify hey, we didn't see anything that new come up in the logs. The process didn't generate any new problems.
So it's probably not a success criteria for my stakeholder. My stakeholder is expecting lift and shift just works. They're looking to provide value to the company. So it's definitely a goal of mine because if they application isn't running, I really can't move forward. And so it's something that I have to take into account. But it's not really what my stakeholder is after.
Makes sense, makes sense. What other tools? Are there anything else that you leverage to make this transition happen?
Well, we talked about the waterfall. I find WPM to be a huge value. It captures the waterfall when I fail. It captures the waterfall so that I can see it over time. Which allows me to see that it's the tracking pixel that's causing these slow load times, or maybe if it's a certain region that is slow, we could stand up another data center in that region.
So it's interesting that you bring up— you mentioned Pingdom, and you also mentioned WPM. The aspect of the differences between the two products. WPM is going to allow you to have control over where that probe exists. You can place that probe very specifically where you want it. Versus Pingdom, it's kind of— you can probe from everywhere and anywhere at any given time. We've got several different aspects of what Pingdom's doing.
Yes and no, right. So Pingdom is designed to look at the publicly facing front of an application. And it does it from a number of different locations around the world.
Spread out very significantly, right.
But it's not running a probe inside your environment. So if you have an application, which maybe it's— you've migrated that service to the cloud, but it's still being delivered for a largely internal audience or maybe people who are connecting over VPN. You may need to be able to put probes somewhere in your environment on the inside of your firewall. And so WPM is a really great alternative for that.
That's a fair point. So, WPM gives you that ability: internal and external probe, if you need it.
Yeah, we actually run WPM internally, so I know if a specific transaction, our most important transactions, are failing. And then I know exactly which server it's failing on. If I run it from Pingdom, I know that the transaction may be failing but I'm not exactly sure which one. But both of them offer a great value. If there's a very important transaction and it's failing for the public, I want to know about that. But I also want to know if that transaction's failing on one of my servers in the cluster.
So it really comes down to the use case and which tool you need to use to leverage to actually be able to measure that indicator, that key performance component.
So what other— any other tools that you've leveraged thus far, or any that you guys are looking to continue to leverage as you do the ongoing monitoring of the cloud environment?
TraceView is huge from a custom application. I really want to know how my application is working with an agent. So it's showing me error rates and my load time across different regions. I use Papertrail for log consolidation. I'm now expanding to too many servers to comb through logs. I needed a consolidated log. I use SAM for my internal clusters. I use WPM to evaluate my performance. I think that rounds out a core set of our products.
So it really takes a lot of products to A) understand what the kind of the baseline is to start with before you make your transition. B) What is actually happening as you make that transition to understand, did anything come up? And C) What is it going to take to actually continue to monitor the solution as you move forward, after you've made this transition.
Yeah, one of the things that moving to the cloud brings up is that sometimes roles even shift. Where I wasn't previously exposed to network access lists, I'm now in full capability of those network access controls. So, all of those things, all that new responsibility requires a lot of different monitoring from different aspects depending on what I'm trying to accomplish.
Well, it also requires us to learn new skills. You mentioned TraceView before. But TraceView is a distributed tracing technology. So it's actually watching real transactions go through all the layers of an application. And that's something where monitoring the infrastructure is the way that we typically do it. We look at the whole stack and we monitor all of the different elements. But in this case, looking at it from the application-traffic-flow perspective and actually watching real transactions, especially at really, really high volumes to say, hey the 30,000 of these that happened in the last hour— What are some of the commonalities that we see in performance between the interaction of the different layers? I was going to say, there's also things like, if you haven't used it before, the Pingdom Server Monitor is kind of great. If you don't have a lot of experience with Linux, it's a really handy way to be able to throw agents on that again, are hosted; the monitoring is hosted in the cloud. And so you're just projecting that agent into a part of your Chef recipe to make sure that it's deployed at the time. And it's not, again, typical infrastructure monitoring. And maybe you're used to Windows. This is an easy way to start looking at Linux systems, especially with a lot of plug-ins.
So would it be fair to say that SolarWinds has a portfolio of products that allows you to monitor all aspects of your application, whether it's on-prem or it's in the cloud?
I would say, don't think of it as what SolarWinds offers or doesn't. I think it's bigger than that. Whether you're using SolarWinds products or not, you have different perspectives on the products that you use. So the answer with cloud more than ever is there is no one perfect magical thing that you go and buy and it addresses everything all in one place. Whether it's responsibilities to the business, or as we talked about before, being able to have metrics that mean something that can actually qualify a very large investment, in some cases, with risk and a lot of question marks to management. Or being able to group performance in new ways, or be able to measure user experience, and incorporate that as one of the KPIs. That's a part of how IT is actually graded. And redefining that, is that selecting specific tools, ours or anybody else, that is the goal. When you were setting those goals and establishing your plan for how you're going to move forward as you migrate, the main thing is just monitor. Monitor all of the things and just let the things that you need to monitor drive the selection of tools. Try out a lot of stuff and see what works for you.
I think you would agree, wouldn't you?
That's going to help drive those success metrics.
I think so.
You think that's enough for today?
I think we're doing good. So, I think we've talked about several things. I think we've given our audience a little bit to chew on. So from me, I'm Steven Hunt, this is John Martinich, Patrick Hubbard. We appreciate you guys joining.