NPM 12.3 is available today, May 31st, on the Customer Portal! The release notes are a great place to get a broad overview of everything in the release. Here, I'd like to go into greater depth on Network Insight for Cisco Nexus including why we built it and how it works. Knowing that should help you get the most out of the new tech!
What's all this "Network Insight" talk? If you haven't heard of this big theme we've been building on a few years, start here. If you know the story, skip ahead to the Network Insight for Cisco Nexus section.
We live in amazing times. Every day new technologies are invented that change how we interact, how we build things, how we learn, how we live. Many (most?) of these technologies are only possible because of the relatively new ability for endpoints to talk to each other over a network. Networking is a key enabling technology today like electricity was in the 1800s and 1900s, paving the way for whole wave of new technologies to be built. The better we build the networks, the more we enabling this technological evolution. That's why we believe in building great networks.
A great network does exactly one thing well: connects endpoints. The definition of "well" has evolved through the years, but essentially it means enabling two endpoints to talk in a way that is high performance, reliable, and secure. Turns out this is not an easy thing to do, particularly at scale. When I first started maintaining, and later building networks, I discovered that monitoring was one the most effective tools I could use to build better networks. Monitoring tells you how the network is performing so you can improve it. Monitoring tells you when things are heading south so you can get ahead of the problem. Monitoring tells you if there is an outage so you can fix it, sometimes even before users notice. Monitoring reassures you when there is not an outage so you can sleep at night.
Over the past two decades, we believe as a company and as an industry we have done a good job of building monitoring to cover routers, switches, and wireless gear. That's great, but virtually every network today includes a sprinkling of firewalls, load balancers, chassis switches, and maybe some web proxies or WAN optimizers. These devices are few in number, but absolutely critical. They're not simple devices either. Monitoring tools have not done a great job with these other devices. The problem is that we mostly treat them like just another router or switch. Sure, there are often a few token extra metrics like connection counts, but that doesn't really represent the device properly, does it? The data that you need to understand the health and performance of a firewall or a load balancer is just not the same as the data you need for a switch. This is a huge visibility gap.
Network Insight is designed to fill that gap by finally treating these other devices as first class citizens; acquiring and displaying exactly the right data set to understand the health and performance of these critical devices.
Network Insight for Cisco Nexus
Network Insight for Cisco Nexus is our third installment in the Network Insight story, following Network Insight for F5 and Network Insight for ASA. Nexus chassis switches are used to build high performance, scalable, and virtually indestructible data center networks. Thats why Nexus are at the heart of many of the largest data centers. Nexus are switches so our traditional switching data is still important, but a $300k chassis switch has a lot of additional capabilities that a $5k switch does not. As you saw with F5 and ASA, Network Insight for Cisco Nexus takes a clean slate approach. We asked ourselves (and many of you) questions like:
- What role does this device play in connecting endpoints?
- How can you measure the quality with which the device is performing that role?
- What is the right way to visualize that data to make it easiest to understand?
- What are the most common problems that occur with this device? What are the most severe?
- Can we detect those problems? Can we predict them?
With these learnings in hand, we built the best monitoring we could from the ground up.
Similar to ASA's, Nexus can be split into virtual instances. Nexus calls them Virtual Device Contexts while ASA calls them Contexts. VDCs are to Nexus what VMs are to servers, allowing a single piece of hardware to be split into several logical nodes. Each logical node, or VDC, is configured separately and provides a full set of technology services. All of the features you read about below discover complete information about each VDC.
Adding the Admin VDC for a Nexus to monitoring lets NPM map out all of the VDCs, which will then appear on the Node Details screen:
Anytime you go to Node Details for any of the VDCs, you'll get this new resource so it's easy to navigate between them. NCM users will also find it easier than ever to make sure all of their VDCs are backed up. If you're well setup for catastrophic failures, they're less likely to occur, right? More info on what NCM is doing for VDCs can be found here.
So Many Interfaces
The first big difference between Cisco Nexus and most other devices is simple interface count. Thanks to the distributed nature of a Nexus deployment, particularly Fabric Extenders, a single Nexus 7k is likely to have hundreds or even thousands of ports. Dealing with thousands of ports on a single device is different than dealing with the usual couple dozen, and we wanted to make sure this fundamental part of Nexus monitoring was done right.
First, the Node Details page now contains a simple summary of all of the interfaces:
Like Network Insight for ASA, we have a new sub-view for each major technology service provided by the device. Clicking on Interfaces, in the above resource or on the sub-view tabs on the left, will bring you to the Interfaces sub-view showing all interfaces. Clicking on any of the status icons or numbers will bring you to a list of only those interfaces.
This is built on the relatively new List View that's part of our Unified Interface Framework. UIF is an important component to make sure the UI across all Orion Platform based tools from SolarWinds have a consistent UI experience so when you learn how to do something in one tool, you know how to do it in all tools. The list view is made for management of large lists, including:
- Multi-level filtering, for example, interfaces with status Up AND (utilization Warning OR Critical).
- Colored highlighting of values over your thresholds for that specific entity.
- Pagination control with up to 100 items per page.
I particularly like the search function for looking up ports on a certain module. Entering a "1/" in the search field will show you all the ports on slot 1. Easy.
These are straight forward improvements but I think you'll find it much more pleasant dealing with the large interface counts on your Nexus devices. And good news: we extended this sub-view to all nodes so you have a super polished interface interaction model on your smaller switches too.
Virtual Port Channels
A big part of why people are willing to shell out for the huge cost of a Nexus is more reliable connectivity to endpoints like servers. Nexus should provide an order of magnitude higher reliability connectivity to servers. Cisco accomplishes with vPCs, a Multi-Chassis Etherchannel (MCEC) technology that allows a single endpoint to uplink to multiple switches. Traditional port channels can only connect a single upstream switch, resulting in a single point of failure.
Believe it or not, vPCs are a serious departure from how networking works. In fact, a pair of Nexus have to "conspire" (a fancy word for lie) to present themselves as a single switch to the endpoint they're connected to. Cisco has a bunch of technology to make it work, and in our research we found this was making it hard for administrators to understand, monitoring, and troubleshoot their vPCs. When we dug into this, we found that expert administrators will spend several minutes to understand the health of a single VPC. They do things like:
- Login to Nexus
- "show vpc"
- "show interface port-channel..."
- "show interface...", repeat 2-4 times
- "show run interface...", repeat 2-4 times
- Find peer switch, login, and do all the commands again.
When all is said and done, they've mapped 5, 7, 9, or even more different components, each with its own status, performance, and config. Our goal was to have this expert level data set available to experts and non-expert users in seconds. The vPC tab accomplishes that:
On the left we see the vPCs. Each vPC is mapped to the local port-channel. We find the peer switch and map the vPC to the port-channel on the peer. Mousing over allows you to see the member ports of each port-channel and navigate to them:
Again we're using the List View, so you have filtering, sorting, searching, pagination, and so forth as expected. Click to drill into any interface for all the details we have about that interface. Of course all of this is can be alerted upon and reported on to keep you ahead of problems without staring at monitoring all day. There's some really cool additional stuff you can do with NCM specific to vPCs. If you're interested, check out their upcoming post.
During beta and RC we found environments where folks had spent hundreds of thousands to more than a million dollars and countless hours setting up high resiliency. Once they pointed NPM at their Nexus, they found that resiliency had deteriorated over time. They had failures and the redundancy saved them, but it also meant they didn't know the problem existed so they never restored redundancy. This leaves them one failure away from a catastrophe in a multi-million dollar high redundancy environment.
If you're in IT, you're strapped for time. Our monitoring tools have to help us do better here. I'm happy that NPM will now help you keep your vPCs running clean!
One thing that surprised me is how many of you are running ACLs on your Nexus. There's a trend of moving security closer to the endpoint, and Nexus devices are the access layer for many data center environments. This results in lots of Port Access Control Lists (PACLs) and VLAN Access Control Lists (VACLs). Fortunately, we recently worked on this for Cisco ASA. The latest NCM release extends and enhances the ACL backup and analysis capability, including new support for MAC ACLs and non-contiguous subnet masks. All of the Access List functionality is based on pulling and analyzing configs, so you'll need the NCM tool to get this feature. Check it out NCM's post - and also, bonus, my favorite part: Interface Config Snippets!
Traditional Routing and Switching
While working on the enhanced capabilities, we also revisited some core technology of ours to make sure it was covering Nexus well. Things like routing protocol monitoring and hardware health should work better than ever. We think we've got everything covered but there's a huge number of combinations of hardware (platform and modules) and software (trains and versions). If you notice any gaps please shoot me a private message with the data that's not showing up for you and a SNMP walk of your device.
I would have started this guide with setup if not for the fact that setup is so darn easy. To get this feature working, add a node as usual and you'll notice a new check box on the last step of the Add Node Wizard:
Check that box, enter your CLI creds (read only is fine) and you're good to go. If you have existing Nexus under monitoring and you'd like to get the enhanced monitoring, head over to manage nodes. You can edit an individual node and check this box, or you can find all of them with Machine Type and/or search and enable all at once.
There's nothing else you need to configure or define. Simple right?
Other Deep Dives
We've got a couple other deep dives for new Orion Platform features included in NPM 12.3. Check 'em out!
That does it for now. You'll be able to click through the functionality yourself in our online demo starting around June 6th. If you're on active maintenance for NPM, head over to the Customer Portal to get your upgrade now. I'd love to hear your feedback once you have it running in your environment!