SolarWinds hosted a Twitter chat using #syschat to talk about hybrid cloud, software-defined data centers, converged infrastructure, and container technologies. We invited three distinguished technology SMEs:
We discussed the tech constructs, how monitoring would be affected, how an IT pro's career path would be affected, and what resources are available to IT pros.
Below are some of the highlights of the #syschat. The complete #syschat can be viewed here: Hybrid Cloud, SDDC, Converged Infrastructure & Containers on Storify.
Please join Christopher, Dennis, John, and me on March 3rd at 11AM-12PM CT for the 2015 Converged, Cloud, and SDx Predictions for IT: Reality or Fiction? webcast hosted by SolarWinds. Look forward to the conversation and engagement.
Back in 1989, I received a copy of a program that promised to be groundbreaking. It was called, “Forest-and-the-Trees.” The software that came on a 3.5” floppy proposed to scan my computer's hard drive (a whopping 80Mb!), detect patterns, trends, data, and present them in such a way that I could have insight into my business—impossible to imagine just a few short years before.
For the sake of this discussion, it doesn't matter that the software was less than impressive (or at least, it was less than impressive with the negligible amount of “business” data I had on my computer). What matters is that our desire to use all the computing power at our disposal to find connections we might not otherwise see is deeply-ingrained, and goes as far back as computers themselves.
We at SolarWinds have been spilling a lot of digital ink talking about “AppStack”
As a member of the SolarWinds team, I think it's worth the time and space to discuss because it satisfies that same deep-seated desire to have the computer find the connections, and show us how everything fits together in ways we may not otherwise see.
On a personal level, I love it for the sheer cool factor and for the fact that it's effectively “free.” SolarWinds AppStack isn't as much a product as it is a philosophy and an over-riding architectural goal. That's right, can't run out and buy a box of AppStack. It's not available in stores at any price. Operators are not standing by to take your call. You “get” a bit of AppStack in every SolarWinds product you buy, and the insight grows with each solution you add to the pile (thus the “stack” of Apps).
You can get a glimpse of the power of AppStack in SolarWinds lab episode 25 (jump to the 11:35 minute mark for just AppStack, but the whole episode is worth watching).
But what about us network guys? Based on at least half of its name, you'd think there wouldn't be much for the average router jockey to care about, right?
There are a few reasons this attitude is wrong:
Breaking the silo mentality
We all have our niche, which I recently wrote about on my personal blog. In that blog, I discussed how important it is to choose a specialty to avoid the special kind of hell called “IT Generalist.” IT professionals can no longer afford to get caught up in the “that's not my area” mentality. Sure you have three routers and two core switches under your desk. But that doesn't mean you can't know or care what is running on your wires. AppStack lets you quickly familiarize yourself with how all the parts fit together and in turn you never have to attend a DevOps meeting…ever!
Monitor! All! The! Things! We know its coming. Wearable devices, warehouse geo tracking, real-time shipment data, and more. The Internet of Things is going to create pressure not just on applications, but the networks that all that data rides on. Having a tool like AppStack will allow you to discern the pressures being placed on the wire from those being placed on the application infrastructure.
Which leads us to...
Standing for “mean time to innocence,” this is the speed with which you can prove it's NOT the network's fault. AppStack allows you to show the entire connection chain of an application—from the customer-facing webserver to the back-end database, and even further back to the storage array—pinpointing the source of the problem. SolarWinds Lab Episode 25 provides another great example of what I’m talking about—it’s that good (Jump to the 19:40 mark). In the case on the video, what are the odds that the network would have been blamed for slow data, instead of the real culprit—an overworked RAID5 array connected via iSCSI to a demanding SQL server?
Back in 1989, the idea of software correlating insight from multiple systems was an enticing vision. It’s nice to know that if you stick around long enough, you get to see some of those visions become reality.
Edit 20150303: Adding links to other AppStack posts, for your reading pleasure
Perusing through my virtualization blog feeds, a post from Amy Manley (@wyrdgirl) on her blog site, Virtual Chick, entitled "Solarwinds AppStack with SRM Flavor and More!" caught my attention. Amy is a fellow VMware vExpert and a certified virtualization unicorn. She was also a delegate at Virtualization Field Day 4, so imagine how awesome it was to find out that she wrote about SolarWinds AppStack, especially after our launch announcement.
Amy walks through what the AppStack consists of, which is Server & Application Monitor, Storage Resource Monitor, Virtualization Manager, and Web Performance Monitor. She also covers what I think will be most useful view for IT admins—the AppStack view. And then, she summarizes AppStack perfectly by stating that it is "trying to make it easier to troubleshoot with all the information in one place" and "maps out the underlying dependencies and infrastructure to help bring about swift resolution." AppStack definitely puts the application front and center with all the contextual connections into the stack layers.
The part that I'm most ecstatic about was that our AppStack demo at VFD4 was memorable and worthy enough for Amy to write about it and share with her network
There is much talk in the IT profession about automation. “Automate all the things” is written in some shape or fashion across a variety of blogs and social media platforms. I even briefly mentioned it in my last Geek Speak post about configuration management. You have read that already... Right?
I get the movement. Why do everything manually, wasting your time on tedious, trivial tasks when you could be working on the newest design for your data center or something better? And even though I could probably consider myself a new age networking professional, there’s still one task I enjoy doing the old-fashioned way: network discovery.
Call me crazy, but the task of learning a network for the first time in my opinion is best done manually. There are so many nuances that could be lost if this process is done automatically. Dissecting a network, one device at a time, port by port truly allows the ability to intimately understand the complexities of that network. Here are some tips and tricks that I have learned along the way and also seen other networking professionals speak of when discovering a network for the very first time:
What about you all? Do you prefer automated network discovery or would you rather do it manually? Have any tips for either method? I look forward to hearing from you all.
An IT debate has been brewing for years now: the old school IT management, the one built on experience and practice versus the new school IT management, the one built on policy based management and analytic engines that leverage vendor knowledge base and learning algorithms.
This debate is reminiscent of politics. IT is viewed as slow and inefficient bureaucracy by developers and end-users; and yet, IT needs to perform due diligence and impart rigor and processes for compliance, governance, and security. Policies are derived from best practices of well-understood problem-solutions, while experience hardens IT pros toward resolving issues quickly. Oh how IT weaves a twisted tale, since best practices come from IT experience and form the foundation of the policy and analytic engines. In the as-a-Service IT world, strategies are being built around automation and orchestration, which facilitates auto scaling and self-healing. And automation and orchestration implies heavy doses of policy based management actions.
Disruptive innovation is one of the primary drivers for the adoption of new school IT management. As the hardware stack converge towards the commodity limit, businesses view the software application as the layer where differentiation will be realized. This differentiation will come in the form of:
The real value of an IT pro is during the times of need—when things break and systems misbehave, and how quickly the IT pro can perform root-cause analysis and bring things back to working order. The counterbalance to this is that there are specific breaks that always follow specific, well-known steps to remediate. These are the opportunities to automate and self-heal. There are also many tasks that may appear repetitive and that self-identify as automation candidates; but requires connected context of the stack and application. Otherwise, it’s a whole lot of wasted actions and time.
Where are you on this spectrum? Is your organization looking at fully automating IT management, hybridizing IT management between some automation & orchestration with some manual intervention, or completely manual? And are you being represented in the policy based IT management world?
I like cake. Most people do, I suppose. And that’s why I use cake as an analogy in my classes and presentations whenever I talk about users, data, and performance.
See, there are many, many layers of cake between users and their data. Some layers are thicker than others, some more delicious, and some more important. But together the layers make the cake complete. When done correctly, these layers form something that you want to consume more and more, just like a good cake should.
You’ve been hearing a lot about AppStack this month. Truthfully whenever I hear “AppStack” I just think of cake. It makes it easier for me to understand all the layers, and I think the analogy works for others as well. Why is that? It is because AppStack means different things to different people. Just look at the previous posts and you’ll get a sense for the differing points of view about what is important.
Today I want to talk about AppStack from the database administrator’s point of view. For many people a database is a black box; Queries go in, data comes out. It’s a mystery as to what happens inside that magical box. And it’s not just a database that’s a mystery either. Admit it; you have no idea what a DBA does for work all day long. Besides a lot of Twitter and Facebook activity we are often busy providing evidence as to why the database itself is not the bottleneck in whatever business process is being seen as “acting slow”.
Let’s break this down into some delicious layers of cake:
The backbone of any infrastructure, your data is stored somewhere (Earthed or Cloud) on something (traditional hard drives or flash drives). Data is read from, or written to, these storage devices. Often these devices are shared among many systems, making it difficult to know if your data requests are the cause, or victim, of performance issues. All of this disk activity ultimately needs to travel through the network, too.
It is no longer uncommon for database servers to be virtualized. Unfortunately, virtualized database servers are often improperly configured, leading to performance issues and concerns. Improper configuration choices at the host and guest layers are often the root cause for what is perceived to be a database issue. Knowing where to look (database, guest, host, storage) first is important for any administrator.
The database engine itself is often blamed as the source of performance issues, but the reality is that the true bottleneck exists elsewhere. In many cases, issues directly related to the database engine are caused by configuration options selected upon installation of the engine. I’ve lost track of the number of times I’ve tracked down the root cause of a database issue to be the fact that it’s so easy to click “Next, Next, Next” during an install.
The icing on our cake is, of course, the end-user. But let’s think about what is under that icing. The end-user is interacting with an application (may or may not be local to their PC), they send a request through the network, the application takes the request and it may (or may not) need to interact with a database, and the database may (or may not) need to interact with storage, and once the data is retrieved/written, back up the chain we go.
The end-user knows nothing of the series of tubes known as the internet. And what if, at the end of all this, the end-user needs to wait for a report to be rendered? They can be sitting there, watching the hourglass turn, silently muttering under their breath about how slow the database is behaving, when in fact the issue is a piece of reporting software that takes minutes to draw a pretty graph.
And that’s why AppStack is so important to someone like me. I get the opportunity to visually show someone where the bottleneck truly lies. This saves time and reduces frustration. And in the end it makes for a more stable infrastructure for which a business is able to leverage.
In my Linkedin Profile, I write that I’m a fan of “elaborate IT Metaphors” yet, in a very literal way, I’ve never actually written a list of my favorites.
Listing out my favorite IT metaphors, sayings, aphorisms and such is risky. Too much pithiness, and I risk not being taken seriously. Too much cynicism and no one wants to talk to you.
And yet I must take that risk, because if you’re a practitioner of the IT arts as I am, then you’re used to engaging in these sorts of thoughtful/silly/humorous reflections.
Or maybe you don't but will find humor and value in them nonetheless. Enjoy and if you like, add your own!
|Metaphor||Meaning||Used in Sentence||Origin/Notes|
|Dark side of the moon||Waiting for a host or device to reply to pings after reload/reboot||Server's gone dark side of the moon, standby.||NASA obviously|
|Eat our own dogfood||Applying same policies/tech/experience to IT that apply to users||That dogfood tastes pretty nasty and we've been dishing it out to our users for years||Not sure but heard on TWiT|
|DNS is like a phonebook||Computers speak in numbers humans speak in words||Look, it works like a phonebook, alright? Do you remember those things?||My own metaphor to explain DNS problems|
|Fat finger||A stupid mistake in perhaps an otherwise solid plan (eg IP address keyed incorrectly)||Jeff totally fat fingered it||Former boss/Homer Simpson?|
|Go FQDN or Go Home||Admonishment to correct lazy IT tendency to code/build with IP addresses rather than FQDN||There are to be no IP addresses listed on the support page. Go FQDN or Go Home, son.||My own|
|Garbage in Garbage Out||You get out of a system that which you put in||I can't make beautiful pivots with a GIGOed set||Unknown but s/he was brilliant|
|Debutante in the Datacenter||A server or service that is high-profile/important but prone to failure and drama without constant attention||Hyperion is doing its debutante thing again||Heard it from someone somewhere in time|
|Cadillac Solution||A high-priced solution to a problem that may require only ingenuity/dilligence||Don't come to me with a Cadillac solution when a Kia will do||My own but really…Cadillac…I'm so old|
|The Sun Never Sets on Infrastructure||A reference to the 24/7 nature of Infrastructure stack demand by way of the British Empire||And the sun never sets on our XenApp farm, so schedule your maintenance||I used this metaphor extensively in last job|
|Infrastructure is Roads/Applications are cars/Users are drivers||Reference to the classic split in IT departments||See here||Former colleague|
|Two Houses Both Alike in Dignity||Another reference to AppDev & Infrastructure divide in IT||-||My own liberal abuse of Shakespeare's opening line in R&J|
|Child Partition/Parent Partition||Reference to me and my son in light of Hypervisor technology||Child partition is totally using up all my spare compute cycles||My own|
|Code is poetry||There is something more to technology than just making things work/be an aristan technologist||Just look at that script, this guy's a poet!||Google but adapted by me for scripting and configs|
|Going full Fibonacci||The joy & euphoria inherent in a well-designed subnetting or routing plan wherein octets and route summaries are harmonized & everything just fits||He went full Fibonacci to save memory on the router||My own abuse of the famedFibonacci Sequence which honestly has nothing to do with IP subnetting and more to do with Dan Brown. Also applies to MAC address pools because encoding your MAC address pools is fun|
|Dystopian IT||Dysfunctional IT departments||I thought I was going to work in the future, not some dystopian nightmare IT group!||Not sure|
|When I was a Child I thought as a Child||How I defend poor technical decisions that haunt me years later||-||A (perhaps blasphemous) homage to St. Paul|
|There are three ____ and the greatest of these is ____||Another St. Paul reference||And then there were three storage systems: file, block and object, and the greatest of these is file||Useful in IT Purchasing decisions|
|IT White Whale||Highly technical problems I haven't solved yet and obsess over||I've been hunting this white whale for what seems like forever||Borrowed from Herman Melville's Moby Dick|
|Servers are cattle, not pets||A pithy & memorable phrase to remind systems guys not to get attached to physical or virtual servers, to view them as cattle that are branded at birth, worked hard throughout life, then slaughtered without pomp or circumstance.||No more Star Wars references as server names, ok? It's a cow, not your pet labrador!||The guys who built Chef|
|Drawer of Tears||The drawer in your desk/lab where failed ideas -represented by working parts- go. My drawer of tears is filled with Raspberry Pis/Surface RTs etc||Yeah I tried that once, ended up in the drawer of tears||My own|
“Cyber terrorism could also become more attractive as the real and virtual worlds become more closely coupled, with automobiles, appliances, and other devices attached to the Internet.” -- Dorothy Denning
In my recent posts I have laid bare the landscape of cyber security, some of the risks, and some of the solutions and possible solutions to the current, untenable state of our networks. One of the biggest risks to the consumer, to everyone really, I haven’t yet brought up. The so called Internet of Things, or the Internet of Everything, is going to bring a level of risk an order of magnitude greater than anything we have seen to this point.
The Internet of Things is a marketing term for sure, but what it represents conceptually is very real: the interconnection of everything from light bulbs, to refrigerators, to door locks. All of these things and many others, all connected together via our home and public networks, all potential risks. When everything is an entry point to the network, and every device we own is connected, imagine the chaos of a large scale hack, virus, or distributed denial of service attack.
We’re not there yet, of course. These things are just beginning to become connected. Devices like the Nest thermostat, the Phillips Hue light bulbs, and some of the remote door locks and security systems accessible from our phones are just the opening act, the beginning of a future where literally everything in our lives are connected. How does our standard model of perimeter security look now? Where exactly is the perimeter we’re defending?
I don’t have many of the answers, and I doubt a lot of other people do either. That’s not to say that there aren’t a lot of smarter people out there working on the problem, it’s just a very hard nut to crack. The security landscape is moving so quickly, the things we think of today are all but ineffectual tomorrow. All I know is that the more things we connect, and the more we depend on those things, the bigger the imperative is that we figure something out soon.
A couple of years ago at Cisco Live, I was chatting with another admin about the future of monitoring. It was relatively quiet in the expo hall and we’d shooed away a couple of Cisco PMs to play with the UCS configurator demo. Mike (not his real name) was primarily a network administrator, but was increasingly involved with very non-network engineering, like compute (in the guise of UCS), and storage performance, at least in the context of connective to VM clusters. He kept saying “App.. Stack,” with pejorative emphasis on “App.” I thought at the time, and have only become more convinced since, that perhaps his discomfort with the term actually stems from its historical incompleteness.
You might be thinking, “AppStack. Great, yet another thing I have to wrap my head around. All it needs is a trademark.” In past posts of this series, joeld Joel Dolisy's A Lap around AppStack described related SolarWinds product evolution, LGarvin 's A Look at Systems’ Role in the AppStack discussed systems considerations, and, most recently, kong.yang 's AppStack Addresses the Dynamic Changes in the Virtualization and Storage Layers both broke out the role of virtualization and storage and set the record for longest article title. I expect that Leon Adato and Thomas LaRock will have more to say in upcoming posts as well, especially on the subject of databases. So, SolarWinds has been on a bit of a tear about AppStack recently, and you might think they have a new product, perhaps with AppStack in the title.
Nope. You’d be wrong. You and Mike already know what it is and it’s not necessary to reinvent anything. I think it’s simply an effort to draw attention to the inescapable and perhaps even unpleasant truth that we can’t think of just “our part” of IT, “our part” of the network, or even “our part” of the help desk. To win at IT, or maybe even just to win back our weekends, we have to keep the extensive tendrils of all elements of applications in the front of our minds AT ALL TIMES.
Way back in history, between the discovery of fire and invention of the wheel, applications were simple affairs. They sat on a mainframe, micro, or PC, with local storage. Yes, remote access was developed from green screen, to fat client, and ultimately to Web. But in most cases, the app below a minimal veneer of presentation infrastructure was unchanged. It’s that App-at-the-top context that, even with shared storage and server virtualization, belies a modern restatement of the meaning of AppStack. It’s A to Z, user to spindle (or EFD flash controller), and everything in between. And, with the complexity of today’s applications, “everything” includes quite a pile of… let’s just say bits.
With a broader definition of AppStack, I thought back to Mike’s solution for operating a hybrid-cloud hosted virtual classroom system. I white-boarded out the components we discussed, and then projected all the components we didn’t discuss. HTTP services for presentation and data tier- check. Storage and virtualization- check. Core, distribution, WAN, and firewall also check. But what about the VPN and firewall to the AWS VPC? Traditionally, that’s infrastructure, but from the context of AppStack, that’s a required component of his service.
What about security certificate expiration for the Web portal? Or, the integration of durable subscription endpoint services? Or, Exchange servers, which transmitted student notification emails and relayed student questions? None of those components seem like part of a traditional application, but are critical to Mike’s uptime. He should not only be monitoring all of these elements, but ideally connecting them via context(s) so that specialist admins across his teams may troubleshoot quickly. And therein lies the greater initiative we should lead as trusted administrators—education.
Don’t be afraid to open up the definition of AppStack in your environment, human, and silicon. Encourage your team, from desktop support to the LUN master, to sit in the user’s chair. List out every element of the connectivity chain you can think of. Go to the white board and discover or imply some more. Lastly, identify linking context. I think you’ll find with the inclusion of previously unmonitored services, the proactive AppStack approach can keep users happy and the help desk quiet. OK, quieter.
Engineers Unplugged is a Cisco production and has had many wonderful engineers enlighten and delight the audience with technical knowledge with a unicorn flair. At the VMware PEX 2015, I was given an opportunity to join the ranks of the “I Drew the Unicorn” on Engineers Unplugged.
My fellow white-boarding engineer was Mike Columbus, a Cisco Consulting Systems Engineer. Our topic was simple: the elusive IT Management Unicorn. What exactly is the IT Management Unicorn? Well it’s one-part monitoring, two-parts troubleshooting, three-parts reporting with a horn for more bacon. Trust me it will be EPIC.
Next, SolarWinds Head Geeks, patrick.hubbard and adatole are teaming up with Lauren on something fantastic for Cisco Live 2015. It has so much software-defined networking goodness. The SolarWinds Head Geeks and Cisco Engineers Unplugged crew will make magic when you see us at Cisco Live 2015. Register for Cisco Live 2015. And follow Cisco Live news on Twitter with #CLUS.
Finally, two more reasons to attend Cisco Live 2015:
“One hundred twenty countries currently have or are developing offensive cyber attack capabilities, which is now viewed as the fifth dimension of warfare after space, sea, land and air…” - Jamie Shea, NATO Director of Policy Planning
In cyber warfare, as in any kind of warfare, there are two types of players: those on offense and those on defense. Many countries have offensive capabilities, though none rival those of the United States. All we have to do to confirm this anecdotally is to review some of the past years’ headlines of hacking attacks against foreign nations, ostensibly by the NSA or NSA-affiliated entities. I would also encourage you to view the 2014 Data Breach Investigations Report compiled by Verizon for an extensive list of attack vectors from the past year.
While offensive capabilities are, if not easy to develop and execute, at least fairly ubiquitous, defense is another matter entirely. As a nation we are largely able to defend key strategic military assets, but not all of them. We are able to defend our financial sector somewhat, though not as successfully as with our military assets. Most private enterprises that fall into the “other” category are either undefended today, indefensible due to lack of knowledge, staffing, or willpower, or have already been compromised. We are not good at defense.
Two approaches to this are needed. The first is for entities like the NSA and private corporations to have the ability to share information (not a forced mandate) and react using the resources of each, while still maintaining privacy. Most large enterprises immediately call in the government once they’re aware of a hack, so why can’t the government work with them proactively to mitigate attacks in the early stages?
The second approach, taken by only the largest of companies, is to monitor everything on their networks at full wire speed (no small task), and then to feed that real-time information into a big-data engine. Run real time analytics using something like SAP Hana where the sheer volume of information can be analyzed in real time. This generates alerts based on real-time anomaly detection in a much more sophisticated way than any IDS/IPS ever could, but it’s still missing one piece: the ability to remediate in real time. This is one of the use cases for SDN—something that I may explore in another post.
What approaches are you taking today? What approach would you take if you had an unlimited budget to work with? What other suggestions do you have beyond the things we’re doing today?
On Friday the 13th, Kaspersky, a Russian anti-malware and research firm, released a report documenting a significant campaign to infiltrate banks worldwide to steal hard cash. Somewhere between 300M and 1 billion dollars are estimated to have been pilfered.
Attackers entered the banks systems using phishing lures and then compromised systems that had known vulnerabilities. The actual monetary theft came from observing the banks processes and compromising users that had the right access to the banks financial applications.
This was an extensive and coordinated operation as the cybercriminals moved electronic money through SWIFT (an international interbank transfer program), and cash through reprogrammed ATMs – essentially turning your local ATM into a Las Vegas jackpot. Clearly creating so many fraudulent receiver accounts and spewing cash required an extensive money mule network.
Given that the actual theft involved deep understanding of the target banks audit and reconciliation procedures, and actual understanding of banking software, this was a well-researched and staged attack – the essence of an Advanced Persistent Threat (APT). So if a sophisticated, regulated organization like a bank is vulnerable are there any lessons for the rest of us?
Here are a few takeaways we can all apply in our own organizations.
1. Staff security awareness
Your staff is your front line infantry in the battle against cybercrime. Even small organizations can put together a meaningful security awareness program with open source tools.
2. Backup and Patch
If you have a good backup program, you can be more aggressive about patching. Depending on your organization size and how mission critical your systems are, backup – test – patch is a tried and true method for avoiding infections that do not use 0-days.
Use your logging, log management and Patch management systems to find:
For more information see:
Kaspersky report of Carbanak APT
Free Security Awareness
"Do you think the guys running Azure or AWS care if a server gets rebooted in the middle of the day?" I asked the Help Desk analyst when he protested my decision to reboot a VM just before lunch.
"Well, uhh. No. But we're not Azure," He replied.
"No we're not. But we're closer today than we have ever been before. Also, I don't like working evenings." I responded as I restarted the VM.
The help desk guy was startled, with more than a little fear in his voice, but I reassured him I'd take the blame if his queue was flooded with upset user calls.
Such are the battles one has to fight in IT Environments that are stuck in what I call the Old Ways of IT. If you're in IT, you know the Old Ways because you grew up with them like I did, or because you're still stuck in them and you know of no other way.
The Old Ways of doing IT go something like this:
I don't know about you, but working in environments stuck in the Old Ways of IT really sucks. Do you like working evenings & weekends? I sure don't. My evenings & weekends are dedicated to rearing the Child Partition and hanging out with the Family Cluster, not fixing broke old servers tied to RFC-1918 IP addresses.
As the VM rebooted, my help desk guy braced himself for a flood of calls. I was tempted to get all paternalistic with him, but I sat there, silent. 90 seconds went by, the VM came back online. The queue didn't fill up; the help desk guy looked at me a bit startled. "What?!? How did you...but you rebooted...I don't understand."
That's when I went to the whiteboard in our little work area. I wanted to impart The New Way of Doing IT upon him and his team while the benefits of the New Way were fresh in their mind.
"Last week, I pushed out a group policy that updated the url of Feature A on Service 1. Instead of our users accessing Service 1 via IP Address 192.168.200.35, they now access the load-balanced FQDN of that service. Beneath the FQDN are our four servers and their old IP addresses," I continued, drawing little arrows to the servers.
"Because the load balancer is hosting the name, we can reboot servers beneath it at will," the help desk guy said, a smile spreading across his face. "The load balancer maintains the user's session...wow." he continued.
"Exactly. Now you know why I always nag you to use FQDN rather than IP address. I never want to hear you give out an IP address over the phone again, ok?"
"Ok," he said, a big smile on his face.
I returned to automating & building out The Stack, getting it closer to Azure or AWS.
The help desk guy went back to his queue, but with something of a bounce in his step. He must have realized -the same way I realized it some years back- that the New Way of IT offered so much more the the Old Way. Instead of spending the next 90 minutes putting out fires with users, he could invest in himself and his career and study up a bit more on load balancers. Instead of rebooting the VM that evening (as I would have had him do it), he could spend that evening doing whatever he liked.
As cliche as it sounds, the new way of IT is about working smarter, not harder, and I think my help desk guy finally understood it that day.
A week or two later, I caught my converted help desk guy correcting one of his colleagues. "No, we never hand out the IP address, only the FQDN."
You sit down at your desk. It's 9:10AM and your coffee is still warm. There is a smell of bacon in the air.
Suddenly your phone rings. The trading system is down. The time for quick thinking is now.
Where would you begin to troubleshoot this scenario?
A lot of people will reach for the biggest hammer they can find: a tool that will trace all activity as it hits the database instance. For SQL Server, that tool is typically SQL Profiler.
The trouble here is this: you are in a reactive mode right now. You have no idea as to the root cause of the issue. Therefore, you will configure your trace to capture as many details as possible. This is your reaction to make certain that when the time comes you are prepared to do as thorough a forensics job as possible in the hope that you can fix the issue in the shortest amount of time.
And this method of performance monitoring and troubleshooting is the least efficient way to get the job done.
When it comes to performance monitoring and troubleshooting you have two options: tracing or polling.
Tracing will track details and capture events as they happen. In an ironic twist this method can interfere with the performance of the queries you are trying to measure. Examples of tools that use the tracing method for SQL Server are Extended Events and SQL Profiler.
Polling, however, is also known by another name: sampling. A tool that utilizes polling will gather performance data at regular intervals. This is considered a light-weight option to tracing. Examples of tools that use this method are Performance Monitor (by default it samples once per second) and 3rd party applications like Database Performance Analyzer that query dynamic management objects (which are system views known as DMVs in SQL Server, and x$ and v$ objects in Oracle).
See, here's the secret about performance monitoring and troubleshooting that most people don't understand: when it comes to gathering performance metrics it's not the what you gather as much as it is the how you gather.
Knowing what to measure is an easy task. It really is. You can find lots of information on the series of tubes known as the internet that will list out all the metrics an administrator would want. Database size, free disk space, CPU utilization, page life expectancy, buffer cache hit ratio, etc. The list of available metrics seems endless and often overwhelming. Some of that information is even useful; a lot of it can be just noise, depending on the problem you are trying to solve.
So which method is right for you?
Let me explain.
Think of a surgeon that needs to operate on a patient. There's a good chance that before the surgeon cuts into healthy skin they will take an X-ray of the area. Once they examine the X-ray, they know more about what they need to do when they operate.
Polling tools are similar to X-rays. They help you understand more about what areas you need to investigate further. Then, when you need to take that deeper dive, that's where you are likely to use a tracing tool in order to return only the necessary information needed to solve the problem, and only for the shortest possible duration.
I find that many junior administrators (and developers with novice database performance troubleshooting skills) tend to rely on tracing tools for even the most routine tasks that can be done with a few simple queries against a DMV or two. I do my best to educate when I can, but it often is an uphill battle. I lost track of the number of times I've been thrown under the bus by someone saying that they can't fix an issue because I won't let them run Profiler against a production box as a first attempt at figuring out what’s going on. Rather than make people choose between one tool or the other I do my best to explain how they work well together.
I never use a tracing tool as a first option for performance monitoring and troubleshooting. I rely on polling in order to help me understand where to go next. Sometimes that next step requires a trace but often times I'm able to help make positive performance improvements without ever needing to run a trace. Then again, I'm lucky that I have some really good tools to use for monitoring database servers, even ones that are running on VMWare, or Amazon RDS, or Microsoft Azure.
There's a lot for anyone to learn as an administrator, and it can be overwhelming for anyone, new or experienced.
So it wouldn't hurt to double check how you are currently monitoring right now, to make certain you are grabbing the right things and at the right frequency.
I’ll be honest, when I initially saw the words configuration management, I only thought of managing device configurations. You know, things like keeping backup copies of configurations in case a device bit the bucket. However, the longer I’ve been in the IT field, the more I’ve learned how short-sighted I was in relation to what configuration management truly meant. Hopefully, by the end of this post, you will either nod and agree or thank me for opening your eyes to an aspect of IT that is typically misunderstood or severely neglected.
There are several components of configuration management that you, as an IT professional should be aware of:
Let’s briefly go over some of these and why they are so integral to maintaining a healthy network.
Most (hopefully all) IT teams keep an inventory of hardware and software that they support. This is imperative for things like service contract renewals and support calls. But, how you keep track of this information usually calls for question. Are you manually keeping track of this information using Excel spreadsheets or something similar? I would agree that it works, but in a world so hellbent on automation, why risk human error? What if you forget to add a device and it goes unnoticed? Wouldn’t it be easier to have software that automatically performs an inventory of all your devices?
One of my favorite components of configuration management is configuration backup and the ability to view those backups as well as compare them to previous backups. If your Core switch were to fail today, right now, are you prepared to replace it? I’m not talking about calling your vendor’s support to have them ship out a replacement. I’m talking about rebuilding that new shiny piece of hardware to its predecessor’s last working state. If you have backups, that process is made easy. Grab the latest backup and slap it on the new device when it arrives. This will drastically cut down the recovery time in a failure scenario. Need to know what’s changed between the current configuration and 6 months ago for audit purposes? Having those backups and a mechanism for comparing them goes a long way.
There are a number of ways to know when an intruder’s been in your network. One of those methods is through the detection and alerting of changes made to your devices. If you don’t have something in place that can detect these changes in real-time, you’ll be in the dark in more ways than one. How about if a co-worker made an “innocent” change before going on vacation that starts to rear its ugly head? Being able to easily generate real-time alerts or reports will help pinpoint the changes and get your system purring like a kitten once again.
In conclusion, configuration management is not just about keeping backups of your devices on hand. It involves keeping inventories of those devices as well as being able to view, archive, and compare their configurations. It also includes being able to easily detect and alert on changes made to your devices for events like catching network intruders. Are you practicing good configuration management techniques?
Whether it be at work or in my personal life, I like to plan ahead and be readily prepared. Specifically, when it comes to allocating storage, you definitely need to strategically plan your allocation. This is where Thin Provisioning comes in—organizations can adopt this strategy to avoid costly problems and increase storage efficiency.
Efficiently optimizing available space in storage area networks (SAN) is known as Thin Provisioning. Thin Provisioning allocates disk storage space between multiple users based on the requirement by each user at a given time.
Days before Thin Provisioning:
Traditionally, admins allocated additional storage beyond their current need—anticipating future growth. In turn, admins would have a significant amount of unused space, directly resulting in a loss on capital spent on purchasing disks and storage arrays.
Applications require storage to function properly. In traditional provisioning, a Logical Unit Number (LUN) is created and each application is assigned to a LUN. Creating a LUN with the traditional method meant a portion of empty physical storage space from the array is allocated. For the application to operate, space is then allocated to the application. At first, the application will not occupy the whole storage space allocated, however gradually the storage space will be utilized.
How Thin Provisioning Works:
In Thin Provisioning, a LUN is created from a common pool of storage. A LUN in Thin Provisioning will be larger than the actual physical storage assigned to the LUN. For example, if an application needs 50GB of storage to start working. 50GB of virtual storage is assigned to the application so that the application can become operational. The application uses LUN in the normal procedure. Initially, the assigned LUN will only have a portion of the actual needed storage (say 15GB) and the rest—35GB will be virtual storage. As the actual utilization of storage grows, additional storage is automatically taken from the common pool of physical storage. The user can then add more physical storage (based on requirement) without disturbing the application or altering the LUN. This helps admins eliminate the initial physical disc capacity that goes unused.
A use case:
Consider an organization that has 3 servers running different applications—a database, a file processing system, and email. All these applications need storage space to work and the organization has to consider storage space for future growth.
While using traditional provisioning, and say each application needs 1 TB each to operate. But out of 1 TB only 250 GB (25 %) will be used initially and the rest will be utilized gradually. With the whole 3 TB already allocated to the existing 3 applications, what happens if you need a new server/application in the organization? In this case, you will need more storage and unfortunately, it won’t be cheap—you will need to search for budget.
Now let’s look to see how thin provisioning can help with the aforementioned situation. For example, in this scenario each server/application is provided with a virtual storage of 1 TB, but the actual storage space provided is just 250 GB. The space from the storage is only allocated when needed. When a new server/application is added, you can assign 250 GB from the physical storage space, but the server/application will have a total 1 TB of virtual storage. The organization can add the new server/application without purchasing additional storage. Also, increase the physical storage as a whole when needed.
Thin provisioning has 2 advantages in this use case:
When to use thin provisioning:
This type of provisioning used is more related to the use case and not technology. Thin provisioning is beneficial in the following situations:
Thin provisioning is not the silver bullet in the virtualization world. It too has its limitations. For example:
Even though thin provisioning has drawbacks all these can be overcome by continuous storage monitoring. Now what you need to do is transform your ‘fat’ volumes to thin ones. But there are issues that can arise while doing so. Have you experienced any issues while moving your storage? If so, how did you resolve your issues?
In the first installment of the AppStack series, Joel Dolisy took you for A Lap around AppStack—providing a high level overview of the concept. Lawrence Garvin then connected the dots from a systems perspective in A Look at Systems’ Role in the AppStack. As Lawrence concluded his piece, he stated, “The complexity of the systems monitoring space is continuing to grow. Virtualization and shared storage was just the first step.” So, let’s take a look at how those and the application affect the AppStack.
Recent announcements from VMware (vSphere 6), Microsoft (Windows 10), and cloud service providers like Amazon Web Services highlight the advances made to accelerate rapid provisioning, dynamic resource scaling, and continuous application & services delivery. These capabilities extend IT consumption of anything-as-a-Service (XaaS) from on-premises to off-premises, from private cloud to public cloud, from physical to virtualization to cloud and back again.
Policy based storage aka software-defined storage is the latest trend that abstracts storage constructs from the underlying storage hardware. The objective is to port the advantages inherent to virtualization over to storage for actions involving storage capacity, performance, and utilization in order to meet Quality-of-Service (QoS) service level agreements (SLAs).
The Application is What Matters
Constantly changing variables in each layer make it more complex to manage the entire environment. The bottlenecks and trouble spots can either be virtual or physical constructs. And the only thing that matters is the application delivery and consumption. IT management is needed to monitor, troubleshoot, report on this complex and quickly changing environment. To adapt to that speed of IT and business, SolarWinds AppStack provides the context to connect these layers and quickly provide a single point of truth on any given application as SolarWinds CTO/CIO, Joel Dolisy, pointed out. And as fellow Head Geek, Lawrence Garvin, pointed out—monitoring is converging towards consolidated monitoring and comprehensive awareness of the end-user experience.
Indispensable to Software-Defined IT Professionals
All of the above make AppStack indispensable to software-defined IT professionals that make their living in multiple clouds, driving multiple container vehicles, and engineering the automation & orchestration of self-healing and auto scaling policies in their ecosystem.
For the next installment of this series, Patrick Hubbard will share his insights and experience on the AppStack concept.
IT professionals are admittedly a prideful bunch. It comes with the territory when you have to constantly defend yourself, your decisions, and your infrastructure against people who don’t truly understand what you do. This is especially true for network administrators. “It’s always the network.” Ever heard that one before? Heck, there’s even a blog out there with that expression created by someone I respect, Colby Glass. My point is, as IT professionals, we have to be prepared at a moment’s notice to provide evidence that an issue is not related to the devices we manage. That's why it's imperative that we must know our network very well inside and out.
With that being said, It should be no surprise to you that when I started my career in networking in 2010, I thought NMS platforms were pretty amazing. Pop some IP addresses in and you’re set.¹ The NMS goes about its duty, monitoring the kingdom and alerting you when things go awry. I could even log in and verify it for myself by looking if I wanted to be certain. I could even dig in at the interface level and give you traffic statistics like discards and errors, utilization, etc. I had instant credibility at my finger tips. I could prove the network was in great shape at a moment's notice. Want to know if that interface to your server was congested yesterday evening at 7pm? It sure wasn't and I have the proof! Can’t get much better than that, right?
I saw netflow for the first time. Netflow has a way of really opening your eyes. “How did I ever think I knew my network so well in the past?”, I thought. I had no visibility into the traffic patterns flowing through my network. Sure, I could fire up a packet capture pretty easily, but that approach is reactive and time-consuming depending on your setup. What if that interface really WAS congested yesterday evening at 7pm? I have no data to reference because I wasn't running a packet capture at that exact time or for that particular traffic flow. It’s helpful to tell someone that the interface was congested, but how about taking it a step further with what was congesting it? What misbehaving application caused that link to be 90% utilized when traffic should have been relatively light at that time of the day? The important thing to realize is that I’m not just an advocate for netflow, I’m also a user!² Here’s a quick recap of an instance where netflow saved my team and I.
I recently encountered a situation where having net flow data was instrumental. One day at work, we received multiple calls, e-mails, and tickets about slow networks at our remote offices. They seemed to be related, but we weren't sure at first. The slowness complaints were sporadic in nature which made us scratch our heads even more. After looking at our instance of NPM, we definitely saw high interface utilization at some, but not all of our remote sites. We couldn't think of any application or traffic pattern that would cause this. Was our network under attack? We thought it might be prudent to involve the security team, in case it really was an attack, but before we sounded the alarm, we decided to check out our netflow data first. What we saw next really baffled us.
Large amounts of traffic (think GBs/hour) was coming from our Symantec Endpoint Protection (SEP) servers to clients at the remote offices over TCP port 8014. For those of you who have worked with Symantec before, you probably already know that this is the port that the SEP manager uses to manage its clients (e.g. virus definition updates). At some point, communication between the manager and most of its clients (especially in remote offices) had failed and the virus definitions on the clients became outdated. After a period of time, the clients would no longer request the incremental definition update; they wanted the whole enchilada. That’s okay if it’s a few clients and the download process ends in success the first time. This wasn't the case in our situation. There were hundreds of clients all trying to download this 400+MB file from one server over relatively small WAN links (avg. 10Mb/s). The result of this was constantly failing downloads which triggered the process to start over again ad infinitum. As a quick workaround, we decided to QoS the traffic based on the port number until the issue with the clients was resolved. With this information at our disposal, we brought it to the security team to show them that their A/V system was not healthy. Armed with the information we gave them, they were quickly able to identify several issues with the SEP manager and its clients which helped them eventually resolve several issues including standing up a redundant SEP manager. Without net flow data, we would have had to set up SPAN ports on our switches and wait for a period of time before analyzing packet captures to determine what caused the congestion. By having netflow, we were instantly able to capitalize on it by viewing specific times in the past to determine what was traversing our network when our users were complaining.
That’s just one problem netflow has solved for us. What if that port was TCP/6667 and it was coming from your CFO’s computer? Do you really think your CFO is on #packetpushers (irc.freenode.net) trying to learn more about networking? No, it’s more likely a command and control botnet obtaining its next instructions on how to make your life worse. From a security perspective, netflow is just one more tool to add in the never-ending fight against malware. So what are you waiting for? Get with the flow… with netflow!
1. Of course it's never quite that easy. You'll have to configure SNMP on all of your devices that you want to manage and/or monitor.
Most networks by now are slowly making the transition from IPv4 addresses to IPv6. This new availability and abundance of global IPv6 addresses will enable businesses to easily provide services to their customers and internal users. However, there are a few things to make note of when you have IPv6 running in your network.
Mitigate Risks from IPv6 in Your Network
Here are a few important tips to help you stay in control of your network, while maintaining optimum use of your IPv6 address space:
It’s important to know your IPv6 management needs and be aware of what is required to efficiently and securely manage your IPv6 address space. IPv6 addresses are more complex, longer and harder to remember. Also, IPv6 does not have the concept of static IP addresses. It uses SLAAC locally and DHCPv6 remotely. Furthermore, the existence of both IPv4 and IPv6 addresses in the network increases the complexity in management of the entire IP space.
IPv6 is much more complex, so spreadsheets simply won’t work for IPv6—as the address boundaries are much more difficult and longer. Further, dynamic assignment of addresses makes it very difficult to manually update spreadsheets and maintain up-to-date information.
In short, IPv6 networks need a comprehensive & automated IP Address Management solution. Utilizing automated software allows you to effectively track all IPv6 addresses in the network, manage IPv6 network boundaries, and track dynamic IPv6 assignments—while helping you ensure that the existence of IPv6 in the network is not causing a security threat.
There’s no way to write a good opening sentence for this post. Yesterday we lost Head Geek LGarvin, who passed away of natural causes at home. He’ll be missed by more than all the admins he interacted with on thwack, the huge TechNet community he touched in a decade as a devoted Microsoft MVP, the Houston Rodeo and other organizations where he volunteered, and of course most of all by his family. I will miss my friend.
Lawrence taught me more about how to be a productive technical creative, wonderfully curmudgeonly yet optimistic when published, opinionated yet receptive in debate, and above all how to dig in when you’re right, than anyone I’ve met. Yes he occasionally sent terrifically long emails, but they were always worth the read, each word with a purpose. The Exchange server won’t miss them, but the Geek team certainly will.
The next SolarWinds Lab will be his last (taped) appearance, and the prospect of his first absence from live chat is difficult to imagine. However, the video team is putting together a tribute and we’ll get the whole gang together for signoff, so if you’re a fan of Lawrence you’ll want to be there. I found a couple of fun Lab promos today and linked them below, as well as our first episode when it was just the two of us trying to figure out what the heck we were doing.
Larry, you’ll live on in thousands of posts on admin communities all over the world, your articles and among your friends here. Thanks for spending this time with us.
Classic YouTube Lawrence:
When it comes to the enterprise technology stack, nothing has captured my heart & imagination quite like enterprise storage systems.
Stephen Foskett once observed that all else is simply plumbing, and he’s right. Everything else in the stack exists merely to transport, secure, process, manipulate, organize, index or in some way serve & protect the bytes in your storage array.
But it's complex to manage, especially in small/medium enterprises where the storage spend is rare and there are no do-overs. If you're buying an array, you've got to get it right the first time, and that means you've got to figure out a way to forecast how much storage you actually need over time.
I've used Excel a few times to do just that: build a model for storage capacity. Details below!
Open up Excel or your favorite open-sourced alternative, and input your existing storage capacity in its entirety. Let’s say you have –direct, shared, converged, or otherwise- 75 terabytes of storage in your enterprise.
Now calculate how much of that is available for use both in absolute terms (TBs!) and as a percentage of used, committed space. Of that 75TB, maybe you've got 13 terabytes left as usable capacity. Perhaps some of that 13TB is in a shared block or NFS storage array; perhaps all that remains for your use is direct-attached storage inside production servers. If the latter, you need to find a way to reflect the difficulty you’ll have in using the free capacity you have.
Now array that snapshot of your existing storage (75TB, 13TB free, 83% Committed) in separate columns, and in a new column, populate Month 1, 2, 3 all the way out to 36 or 72. What's your storage going to look like over time?
The reality is that unless you work in one of those rare IT environments that’s figured out the complex riddle of data retention, most IT environments will see storage demand grow over time. Use this to your advantage.
Sometimes that growth is predictable, other times it’s not. Reflect both scenarios in the same column. Assume months 1-12 there will be a .3% demand for more storage from your 13TB of remaining storage (about 39 GB every month).
In month 13, imagine that some big event happens; a new product launch or merger/acquisition, etc, and demand that month is extreme: the business needs about 15% of what (by then is) about 12.7TB of your storage. If you’re in a place that only has direct-attached storage, this might be the month your juggling act becomes especially acute; what if none of your servers has 2 Terabytes of free space?
By now your storage capacity model should be taking shape. You can plug in different categories of storage (Scale up your S3 or Azure Blob storage to meet sudden spikes in demnad, for instance,) you can contemplate disposal of some of your existing legacy storage too.
Once all your model is finished, you should be able to make a nice chart to present to your CIO, a model that says you've done your homework answering the question "How much storage do we need?"
“If someone steals your password, you can change it. But if someone steals your thumbprint, you can’t get a new thumb. The failure modes are very different.” –Bruce Schneir
In my last post I talked about how the traditional security model is dead, and that companies have to start thinking in terms of “we’ve already been hacked” and move into a mitigation and awareness strategy. The temptation to put a set of really big, expensive, name brand firewalls at the edge of your network, monitor known vulnerabilities, and then walk away smug in the knowledge that you’ve not only checked a box on your next audit, but done all you can to protect your valuable assets is a strong one. But that temptation would be shortsighted and wrong.
Since I wrote last, one of the largest security breaches ever—and possibly the most damaging—was reported by the insurance giant, Anthem BlueCross BlueShield. Over 80 million accounts were compromised, and what makes this hack worse than most is that it included names, addresses, social security numbers, income, and some other stuff—pretty much everything that makes up your identity. In other words, you just got stolen. A credit card can be shut down and replaced, but it’s not so easy when it’s your whole identity.
Anthem is using wording suggesting that the company was the victim of “a very sophisticated external cyber attack” which, while plausible and largely face-saving, is almost guaranteed to not be the case. While the attack was probably perpetrated by an external entity, the sophistication of said attack is probably not high. In most of these cases it’s as simple as getting one employee inside the company to open the wrong file, click the wrong link, reveal the wrong thing, etc. The days of poking holes in firewalls and perpetrating truly sophisticated attacks from the outside in are largely gone, reserved for movies and nation-state cyber warfare.
The one thing we can take from this attack, absent of any further details, is that the company self-reported. They discovered the problem and responded immediately. What isn’t known is how long the attackers had access to the system before the company’s security team discovered and closed the breach. Hopefully we’ll get more information in the coming days and will get a better picture of the scope and attack vector used.
So, what do you think of the Anthem attack? Do you have processes in place today to respond to this sort of breach? Would you even know if you’d been breached?
If you are a security practitioner and haven’t heard about the 80 million personal records lifted from Anthem’s database yesterday you missed some exciting news, both good and bad. Clearly the loss of so many records is bad news and very troubling. However, the good new was that Anthem identified the breach themselves. Even though they caught the breach at the end of the kill chain (see below), they still did catch the breach before the records were exploited or showed up on a cyber underground sale site.
Targeted breaches such as Anthem are notoriously difficult to identify and contain, in part because the trade craft for such attacks is specifically designed to avoid traditional detection solutions such as anti-virus and intrusion detection. So as the FBI tries to determine who hijacked these records, the rest of us are trying to figure out why. Although motive, like attribution, is difficult to nail down, motive is a useful data point if you are trying to predict whether your organization is at risk.
In the absence of your own security analyst or FBI task force to determine motive or attribution, what can ordinary practitioner do to lower organizational risk?
First – Determine if your organization is a possible target
Don’t think because you are a smaller or less well know that you are not a target. Cyber thieves not only desire data they can sell, they need compute power to launch their attacks from, and then need identities they can use to trick their ultimate target into allowing a malicious link or payload into their environment.
Who has not recently noticed a strange email from a colleague or friend that upon further inspection is not their legitimate email address?
Second – Learn the kill chain and use it to validate your security strategy
Do you collect information from available sources across the kill chain into your SIEM? The earlier in the kill chain you identify a potential attack, the lower the risk, and the simpler the mitigation. For example:
Collecting and reporting on unusual email activity may allow you to catch a recon attempt. An identification of such behavior might lead you to increase logging on high value targets such as privileged accounts, domain controllers, or database servers.
Another useful indicator is spikes in network traffic on sensitive segments, or increases in authorized traffic exiting the organization.
In the worst case, by evaluating all log sources and ensuring you are collecting across the kill chain – you will empower your IT or security team to conduct forensics or a post incident analysis effectively.
Finally – Have an incident response plan
It does not need to be elaborate, but executives, marketing, and IT should all know who is going to be the team coordinator, who is going to be the communicator, and who is going to be the decision maker.
By following these guidelines you are doing your part to leverage the value in your security investment, and reduce organizational risk.
About the kill chain.
The kill chain was originally conceptualized and codified by Lockheed Martin. Today it is used by cyber security professionals in many roles to communicate, plan and strategize how to effectively protect their organization.
As Joel Dolisy described in A Lap around Appstack, the first installment of the AppStack series, there are many components in the application stack including networking, storage, and computing resources. The computing resources include hypervisors, virtual machines, and physical machines that provide applications, infrastructure support services, and in some cases storage. Collectively, we refer to these resources as systems.
Systems really are the root of the application space. From the earliest days of computing when an application ran on a single machine, the user went to that machine to use the application and the machine had no connectivity to anything (save perhaps a printer). Today systems offer a myriad of contributions to the application space and each of those contributions has its own monitoring needs.
Historically, with the one-service/one-machine approach, a typical server ran at only ten to twenty percent of capacity. As long as the LAN connection between the desktop and the server was working, it was highly unlikely that the server was ever going to be part of a performance problem. Today, it is critical that servers behave well and share resource responsibility with others. (Other servers, that is!) As a result, server monitoring is now a critical component of an application-centric monitoring solution.
One of the components that is often overlooked in the monitoring process, are the systems used directly by the end-user. The typical user may have two or three different devices all accessing the network simultaneously, and sometimes multiple devices accessing the same application simultaneously. Tracking what devices are being used, who is using those devices, how those devices are impacting other applications, and ensuring that end-users get the optimal application experience on whatever device they’re using is also part of this effort.
The benefit of monitoring the entire application stack as a consolidated effort is a comprehensive awareness of how the end-user is experiencing their interaction with the application and an understanding of how the various shared components of an application are co-existing with one another.
By being aware of where resources are shared, for example LUNs in a storage array sharing disk IOPS or virtual machines on a hypervisor sharing CPU cycles, performance issues affecting one or more applications can be more rapidly diagnosed and remediated. It’s not unusual at all for an application to negatively impact another application, without displaying any performance degradation itself.
The last thing to be aware of is that the complexity of the systems monitoring space is continuing to grow. Virtualization and shared storage was just the first step. For the next blog in this series, Kong Yang will discuss how that impacts the AppStack.
(and how monitoring can solve them)
I spend a lot of time talking about the value that monitoring can bring an organization, and helping IT professionals make a compelling case for expanding or creating a monitoring environment. One of the traps I fall into is talking about the functions and features that monitoring tools provide while believing that the problems they solve are self-evident.
While this is often not true when speaking to non-technical decision makers, it can come as a surprise that it’s sometimes not obvious even to a technical audience!
So I have found it helpful to describe the problem first, so that the listener understands and buys into the fact that a challenge exists. Once that’s done, talking about solutions becomes much easier.
With that in mind, here are the top 5 issues I see in companies today, along with ways that sophisticated monitoring addresses them.
Ubiquitous wireless has directly influenced the decision to embrace BYOD programs, which has in turn created an explosion of devices on the network. It’s not uncommon for a single employee to have 3, 4, or even 5 devices.
This spike in device density has put an unanticipated strain on wireless networks. In addition to the sheer load, there are issues with the type of connections, mobility, and device proximity.
The need to know how many users are on each wireless AP, how much data they are pulling, and how devices move around the office has far outstripped the built-in options that come with the equipment.
Monitoring Can Help!
Wireless monitoring solutions tell you more than when an AP is down. They can alert you when an AP is over-subscribed, or when an individual device is consuming larger-than-expected amounts of data.
In addition, sophisticated monitoring tools now include wireless heat maps – which take the feedback from client devices and generate displays showing where signal strength is best (and worst) and the movement of devices in the environment.
We work hard to provision systems appropriately, and to keep tabs on how that system is performing under load. But this remains a largely manual process. Even with monitoring tools in place, capacity planning—knowing how far into the future a resource (CPU, RAM, disk, bandwidth) will last given current usage patterns—is something that humans do (often with a lot of guesswork). And all too often, resources still reach capacity without anyone noticing until it is far too late.
Monitoring Can Help!
This is a math problem, pure and simple. Sophisticated monitoring tools now have the logic built-in to consider both trending and usage patterns day-by-day and week-by-week in order to come up with a more accurate estimate of when a resource will run out. With this feature in place, alerts can be triggered so that staff can act proactively to do the higher-level analysis and act accordingly.
We’ve gotten very good at monitoring the bits on a network – how many bits per second in and out; the number of errored bits; the number of discarded bits. But knowing how much is only half the story. Where those bits are going and how fast they are traveling is now just as crucial. User experience is now as important as network provisioning. As the saying goes: “Slow is the new down.” In addition, knowing where those packets are going is the first step to catching data breaches before they hit the front page of your favourite Internet news site.
Monitoring Can Help!
A new breed of monitoring tools includes the ability to read data as it crosses the wire and track source, destination, and timing. Thus you can get a listing of internal systems and who they are connecting to (and how much data is being transferred) as well as whether slowness is caused by network congestion or an anaemic application server.
“Slow is the new down”, but down is still down, too! The problem is that knowing something is down gets more complicated as systems evolve. Also, it would be nice to alert when a system is on its way down, so that the problem could be addressed before it impacts users.
Monitoring Can Help!
Monitoring tools have come a long way since the days of “ping failure” notifications. Alert logic can now take into account multiple elements simultaneously such as CPU, interface, and application metrics so that alerts are incredibly specific. Alert logic also now allows for de-duplication, delay based on time or number of occurrences, and more. Finally, the increased automation built into target systems allows monitoring tools to take action and then re-test at the next cycle to see if that automatic action fixed the situation.
Automatic Dependency Mapping
One device going down should not create 30 tickets. But it often does. This is because testing upstream/downstream devices requires knowing which devices those are, and how each depends on the other. This is either costly in terms of processing power, difficult given complex environments, time-consuming for staff to configure and maintain, or all three.
Monitoring Can Help!
Sophisticated monitoring tools now collect topology information using devices’ built-in commands, and then use that to build automatic dependency maps. These parent-child lists can be reviewed by staff and adjusted as needed, but they represent a huge leap ahead in terms of reducing “noise” alerts. And by reducing the noise, you increase the credibility of every remaining alert so that staff responds faster and with more trust in the system.
So, what are you waiting for?
At this point, the discussion doesn’t have to spiral around whether a particular feature is meaningful or not. As long as the audience agrees that they don’t want to find out what happens when everyone piles into conference room 4, phones, pads, and laptops in tow; or when the “free” movie streaming site starts pulling data out of your drive; or when the CEO finds out that the customer site crashed because a disk filled, but had been steadily filling up for weeks.
As long as everyone agrees that those are really problems, the discussion on features more or less runs itself.
As I was catching up on the latest IT industry news, I landed on Amit’s Technology Blog. Amit (@amitpanchal76) was a delegate at this year’s Virtualization Field Day 4. I found it very cool that Amit was highlighting his visit to SolarWinds and providing his view on AppStack in his blog post, “How SolarWinds Aims to Offer a Simple Perspective – VFD4.”
In his blog, Amit says, “At the end of the day, application owners and end users only care if their application is working and healthy and don’t want to know about the many cogs and wheels that make up the health.” I’d have to say, Amit’s premise is spot on. And SolarWinds shares this vision and delivers this clear, concise value to IT admins and their end-users. The SolarWinds AppStack removes all the noise from information overload and quickly surfaces the root cause of trouble with an application from a single point of truth.
Amit also shares an AppStack use case. He believes that “AppStack will come in useful as you could simply provide a link to a custom dashboard for a particular application and let the application owner have this as their monitoring dashboard.” This customization of the monitoring, troubleshooting, and reporting dashboards to the application owner implies that the platform needs to be easy to use and easy to consume. Because, again, the app owner only cares about whether the app is working and healthy. Ease of use and ease of consumption are core tenets of the products that form AppStack. So AppStack inherits those properties by default.
As Solarwinds CTO/CIO, Joel Dolisy states, “AppStack is related to the products…such as Server & Application Monitor (SAM), Virtualization Manager (VMan), and Storage Resource Monitor (SRM).” AppStack is a natural extension of the app-centric view with connected context to all the major subsystems like compute, memory, network, and disk across the physical and virtual layers. Amit thinks this would be useful. Do you?
For background on AppStack, check out Joel’s AppStack blog post.
Robert Mueller, former Director of the FBI, has said of security that “there are only two types of companies: those that have been hacked, and those that will be”. From Home Depot and Target to Skype and Neiman Marcus, it often seems as if nobody is safe any more. What’s worse is that most of these attacks have come from within the security perimeter and were undetected for long periods of time, leaving the attackers plenty of time to do what they came to do.
According to a Mandiant M-Trends report from 2012 and 2013, the median length of time an attacker went undetected in a system after compromise was 243 days with an average of 43 systems accessed. What’s worse is that in 100% of those cases valid credentials were used to access the system, and 63% of victims were notified of the breach by an external entity. Those are certainly not promising statistics for those of us trying to manage IT operations for a large enterprise. It’s even worse for smaller companies who can’t staff quality security personnel.
While some companies might believe they are immune, or not a high value target based on any number of factors, they couldn’t be more wrong. Hackers these days are not the script-kiddies of the last 20 years, but rather nation states or organized collectives with a variety of motivations. Sometimes the attackers are looking for money and target credit cards, other times they are looking for identities—social security numbers tied to names and addresses—with which they can take a more advanced and longer term view of the value of their attack.
The other threat that a lot of companies fail to plan for, however, is reputation damage. Even if you have nothing of value that a hacker might want access to, you likely have a brand value that can be severely and systemically damaged. Consider the most recent attacks against Yahoo, Sony Corporate, and both the Playstation and Xbox networks. These attacks may not cause direct financial damage, but the lasting brand damage can cost millions more.
Given this current state of security, I’m curious what you do to secure your network and monitor for advanced persistent threats against your infrastructure. Are you relying on logging and firewalls alone, or have you moved into a more advanced monitoring model?
"And real estate applications don't work well when they're virtualized," she insisted, face lowered, eyes peering directly at me over the rims of her Warby Parkers.
For a good 5-8 seconds, all you could hear was the whirring of the projector's fan as the Infrastructure team soaked in the magnitude of the statement.
I had come prepared for a lot of things in this meeting. I was asking for a couple hundred large, and I had spreadsheets, timelines, budgets, and a project plan. Hell, I even had an Excel document showing which switch port each new compute node would plug into, and whether that port would be trunked, access, routed, a member of a port-channel, and whether it got a plus-sized MTU value. Yeah! I even had my jumbo frames all planned & mapped out, that's how I roll into meetings where the ask is a 1.x multiple of my salary!
But I had nothing for this...this...whatever it was...challenge to my professional credibility? An admission of ignorance? Earnest doubt & fear? How to proceed?
It was, after all 2014 when this happened, and the last time I had seen someone in an IT Department resist virtualization was back when the glow of Obama was starting to wear off on me....probably 2011. In any case, that guy no longer worked in IT (not Obama, the vResistor!), yet here I was facing the same resistance long long long after the debate over virtualization had been settled (in my opinion anyway).
Before I could get in a chirpy, smart-ass "That sounds like a wager" or even a sincere "What's so special about your IIS/SQL application that it alone resolutely stands as the last physical box in my datacenter?" my boss lept to my defense and, well, words were exchanged between BAs, devs, and Infrastructure team members. My Russian dev friend and I glanced at each other as order broke down...he had a huge Cheshire cat grin, and I bet the bastard had put her up to it. I'd have to remember to dial the performance on his QA VMs back to dev levels once if I ever got to build the new stack.
The CIO called for a timeout, order was restored, and both sides were dressed down as appropriate.
It was decided then to regroup one week hence. The direction from my boss & the CIO was that my presentation while, thorough, was at 11 on the Propellerhead scale and needed to answer some basic questions like, "What is virtualization? What is the cloud?"
You know, the basics.
Somewhat wounded, I realized my failure was even more elemental than that. I had forgotten something a mentor taught me about IT, something he told me to keep in mind before showing my hand in group meetings: "The way to win in IT is to understand which Microsoft Office application each of your teammates would have been born as if they had been conceived by the Office team. For example, you're definitely a Visio & Excel guy, and that's great, but only if you're in a meeting with other engineers."
Some people, he told me, are amazing Outlookers. "They email like it's going out of style; they want checklists, bullet points, workflows and read receipts for everything. Create lots of forms & checklists for them as part of your pitch."
"Others need to read in-depth prose, to see & click on footnotes, and jot notes in the paper's margin; make a nice .docx the focus for them."
And still others -perhaps the majority- would have been born as a Powerpoint, for such was their way of viewing the world. Powerpoint contains elements of all other Office apps, but mostly, .pptx staff wanted pictures drawn for them.
So I went home that evening and I got up into my Powerpoint like never before. I built an 8 page slide deck using blank white pages. I drew shapes, copied some .pngs from the internet, and made bullet points. I wanted to introduce a concept to that skeptical Business Analyst who nearly snuffed out my project, a concept I think is very important in small to medium enterprises considering virtualization.
I wanted her to reconsider The Stack (In Light of Some Really Bad Visualizations).
So I made these. And I warn you they are very bad, amateur drawings, created by a desperate virtualization engineer who sucks at powerpoint, who had lost his stencils & shapes, and who was born a cell within a certain column on a certain row and thought that that was the way the world worked.
The Stack as a Transportation Metaphor
Slide 2: What is the Core Infrastructure Stack? It's a Pyramid, with people like me at the bottom, people like my Russian dev friend in the middle, and people like you Ms. Business Analyst, closer to the top. And we all play a part in building a transportation system, which exists in the meatspace (that particular word was not in the slide, and was added by me, tonight). I build the roads, tunnels & bridges, the dev builds the car based on the requirements you give him, and the business? They drive the car the devs built to travel on the road I built."
Also, the pyramid signifies nothing meaningful. A square, cylinder, or trapezoid would work here too. I picked a pyramid or triangle because my boy would say "guuhhhl" and point at triangles when he saw them.
I gotta say, this slide really impressed my .pptx colleagues and later became something of an underground hit. Truth be told, inasmuch as anything created in Powerpoint can go viral, this did. Why?
I'd argue this model works, at least in smaller enterprises. No one can argue that we serve the business, or driver. I build roads when & where the business tells me to build them. Devs follow, building cars that travel down my roads.
But if one of us isn't very good at his/her job, it reflects poorly on all of IT, for the driver can't really discern the difference between a bad car & a bad road, can they?
What's our current stack?
"Our current stack is not a single stack at all, but a series of vulnerable, highly-disorganized disjointed stacks that don't share resources and are prone to failure," I told the same group the next week, using transitions to introduce each isolated, vulnerable stack by words that BAs would comprehend:
My smart ass side wanted to say "1997 called, they want their server room back," but I wisely held back.
"This isn't an efficient way to do things anymore," I said, confidence building. No one fought me on this point, no one argued.
What's so great about virtualizing the stack?
None of my slides were all that original, but I take some credit for getting a bit creative with this one. How do you explain redundancy & HA to people woefully unprepared for it? Build upon your previous slides, and draw more boxes. The Redundant Stack within a Stack:
The dark grey highlighted section is -notwithstanding the non-HA SQL DB oversight- a redundant Application Stack, spread across an HA Platform, itself built across two or more VMs, which live on separate physical hosts connecting through redundant core switching to Active/Active or A/P storage controllers & spindles.
I don't like to brag (much), but with this slide, I had them at "redundant." Slack-jawed they were as I closed up the presentation, all but certain I'd get to build my new stack and win #InfrastructureGlory once more.
And that Cloud Thing?
Fuzzy white things wouldn't do it for this .pptx crowd. I struggled but to keep things consistent, I built a 3D cube that was fairly technical, but consistent with the previous slides. I also got preachy, using this soapbox to remind my colleagues why coding against anything other than the Fully Qualified Domain Name was an mortal sin in an age when our AppStack absolutely required being addressed at all times by a proper FQDN in order to to be redundant across datacenters, countries, even continents.
There are glaring inaccuracies in this Hybrid Cloud Stack, some of which make my use of Word Art acceptable, but as a visualization, it worked. Two sides to the App Stack, Private Cloud (the end-state of my particular refresh project), and the Public Cloud. Each have their strengths & weaknesses, each can be used by savvy Technology teams to build better application stacks, to build better roads & cars for drivers in the business.
About six weeks (and multiple shares of this .pptx) later, my new stack complete with 80 cores (with two empty sockets per node for future-proofing!), about 2TB of RAM, 40TB of shared storage, and a pair of Nexus switches with Layer 3 licensing arrived.
And yes, a few weeks after that, a certain stubborn real estate application was successfully made virtual. Sweet.
I’ve worked with a few different network management systems in my career, some commercial and some open source. Based on my experience with each one of them, I’ve developed certain qualities that I look for when deciding on which product to recommend or use.
One common theme that always seems to come up when comparing experiences with my peers is how easy it is to implement and operate product $A vs product $B?
In my opinion, implementation and operation of a product are critical when accessing any product, not just a NMS. If it’s not easy to implement, how are you ever going to get it off the ground? Will training be necessary just to install it? IF you ever do get it off the ground and running, will it take a small army to keep it going? Will you have to dedicate time and resources each day just to cultivate that product? How can you trust a product with the management and monitoring of your network environment, if it crashes all the time or if you have to be a Jedi to unlock all of its mystical bells and whistles?
With that being said, what do you look for in a network management system? Easy to install? Intuitive interface? All the features you could wish for? Is cost the ultimate factor? I’d love to hear what you all think.