Updates to the Network Troubleshooting View

Version 18

    We recently updated our Network Troubleshooting View in the online demo.  If you missed the first part where we go over building the initial view, you can catch up with Creating the Network Troubleshooting View.

     

    We've updated the map and added two new resources.  The new resources include a Quality of Experience (QoE) resource and a custom table resource for CBQoS Drops (emphasis added below).  The map's been updated to show some servers and clients which we deem critical for this particular Network Path.  I'll go over each of the new resources and how we configured them.

    NTBv2_Demo.png



    Extending the Group

    In the previous post we discussed how we chose what to include in the view.  Our logic hasn't changed at all, except that we've added two clients and three servers to the list.  If you want to update your group, you can do so by going to main settings page and click on Manage Groups.  Check the box for your group and click "Add/Remove Objects."  Just add the new devices to buff up your group.  Submit it and you are done.


    Updating the Map

    Updating your Orion maps is super simple.  Open Network Atlas, drag the new devices to the page.  Click "Connect Now" and boom! the links are built.  You can customize your maps as much as you like, but I kept them pretty simple for this view because it's all concentrated on troubleshooting performance issues between the branch office and the headquarters.


    Quality of Experience Statistics

    Leveraging the Quality of Experience (QoE) statistics gives you a quick view into whether degraded performance is from the Application or from the Network.  It does this by analyzing incoming packets either at a server or a network device level.  Configuring the QoE Agent to perform packet analysis is beyond the scope of this post, but if you want some more information, I'd suggest that you check out SolarWinds Packet Analysis Sensor Deployment Guide.

     

    Just like whenever you want to add new resources to any page, start by clicking on Customize Page in the top-right corner.  HINT: If you don't see "Customize Page," then you don't have sufficient permissions and you should speak with your Orion Admin.

     

    From the Customization Page, click the green plus sign in any column to get the Add Resource box.  Search for QoE Statistics, check the box and click "Add Selected Resources."

    Add_Resource_QoEStatistics.png

    Click Done on the Customization Page and you've got the first part done.  This resource is already filtered for our group and you should only see devices in the list which are in your group.  If that's all that you want, feel free to stop.  However, since this page is mostly about watching out for critical problems ("a la a NOC-view"), I'd suggest tweaking the display and filter slightly.  I've elected to apply a filter to only show threshold violations within the last 24 hours (essentially, "only show me potential problems").  Actually, it's pretty accurate in displaying problems because it auto-baselines and "knows" your environment.  Enough on that - let's change the display of the resource.

     

    To do this, click on the Edit button in the resource, check the box for "Show values exceeding thresholds only," and (optionally) update the title or subtitle to better describe what you are seeing.  I decided to update the subtitle to "Only Values Exceeding Thresholds (Last 24 Hours)" so that anyone who happens upon this resource knows exactly what it's displaying.  Hit Submit to lock in these selections.

    QoE_Resource_Customization.png

    And that's it.  The QoE Resource is configured in about 7 clicks (not that I counted or anything) and some typing.  Congratulations, you've got even more insight into the traffic and performance of all of your network connected gear.  Now you can dig directly to a problematic QoE Application or Node.


    Custom Table of CBQoS Drops

    The first iteration of the view included an embedded Orion Report showing the Class-based Quality of Service (CBQoS) drops.  In an environment where CBQoS is critical (I'm looking at you VoIP and Video-On-Net environments), keeping track of your QoS policies and any associated drops is a must have to understanding slow or poor network performance.  As network engineers, we all know that slow application performance is synonymous with the network being down.  ("Slow is the new down")

     

    Although the embedded report showed the metrics we wanted, I felt that it could be better.  With that stuck in my head, I took some time and came up with what I feel is a better solution.  Here's an example of the results (in a plain table):

                                                                                                                                                                                                                                               

    NODEINTERFACEPOLICY NAMECLASSDIRECTIONTOTAL BYTESBITRATE
    HQWANGi0/1.2021QoS-WAN-EthernetCIFS_CMEgress3.97 GB2.52 Mbps
    BOWANGi0/1.2022QoS-WAN-EthernetCIFS_CMEgress4.34 GB1.17 Mbps
    HQWANGi0/1.2021QoS-WAN-EthernetWebTraffic_CMEgress2.04 GB316.88 Kbps
    BOWANGi0/1.2022QoS-WAN-EthernetWebTraffic_CMEgress404.49 MB45.24 Kbps
    HQWANGi0/1.2021QoS-WAN-Ethernetclass-default (QoS-WAN-Ethernet)Egress14.15 MB21.98 Kbps
    BOWANGi0/1.2022QoS-WAN-Ethernetclass-default (QoS-WAN-Ethernet)Egress23.02 MB1.12 Kbps

     

    This was created as a Custom Table in Orion with hyperlinks (indicated by the underlines) to the NetFlow resources for the elements in the Node, Interface, and Class columns.  This way you can just "jump" right to the view that will provide you the best information.  I could go over how I developed this query step-by-step, but we'd both be asleep very bored while I go over the details.  Instead, let me do the heavy lifting and you can skip to the fun part!

     

    Like before, let's update your existing view by clicking on Customize View in the upper-right and add a new resource to any of your columns.  Search for and select the resource type of "Custom Table," check the box, and click Add Selected Resources.

    Add_Resource.png

    Once that's in your list, don't click "Done" yet!  Scroll down and select the Group Name from the View Limitation area and copy it to your clipboard.  You'll need it for the next part.  Got that copied?  OK, now you can click "Done."

     

    View_Limitation.png

     

    Now you've got a new blank resource on your page.  It's brilliant, but... ... ... blank.  Don't fret, that's next.

     

    Scroll to your new resource and click on either the Edit button or the "Configure this resource" link.  They both take you to the same place.

     

    Empty_Custom_Resource.png

    Let's start with a new name.  "Custom Table" is nice, but not very descriptive.  I've elected to go with "QoS Drops" and added a subtitle.  Your mileage may vary.

     

    Now, let's create the Datasource.  Click on the button and let's define the details.  Start by altering the selection method to "Advanced DataBase Query (SQL, SWQL)," move the selector from Query Type to SQL, give it a descriptive name in the Selection Name.  Finally, copy the below query into the large field.

     

    DECLARE @GroupName VARCHAR(255);
    SET @GroupName = 'Branch Office ↔ Headquarters';
    
    SELECT  '<a href="/Orion/TrafficAnalysis/NetflowCBQoSDetails.aspx?NetObject=NN:'
            + CAST(NetFlow_CBQoSAlertsView.NodeID AS VARCHAR(25))
            + ';T:Last%2024%20Hours;FD:' + NetFlow_CBQoSAlertsView.Direction
            + '">;' + NetFlow_CBQoSAlertsView.NodeCaption + '</a>' AS [Node] ,
            '<a href="/Orion/TrafficAnalysis/NetflowCBQoSDetails.aspx?NetObject=NI:'
            + CAST(NetFlow_CBQoSAlertsView.InterfaceID AS VARCHAR(25))
            + ';T:Last%2024%20Hours;FD:' + NetFlow_CBQoSAlertsView.Direction
            + '">' + Interfaces.Caption + '</a>' AS [Interface] ,
            NetFlow_CBQoSAlertsView.PolicyName AS [Policy Name] ,
            '<a href="/Orion/TrafficAnalysis/NetflowCBQoSDetails.aspx?NetObject=CCM:'
            + CAST(NetFlow_CBQoSAlertsView.[PolicyID] AS VARCHAR(25)) + ';I:'
            + CAST(NetFlow_CBQoSAlertsView.InterfaceID AS VARCHAR(25))
            + ';T:Last%2024%20Hours;FD:' + NetFlow_CBQoSAlertsView.Direction
            + '">' + NetFlow_CBQoSAlertsView.ClassName + '</a>' AS [Class] ,
            NetFlow_CBQoSAlertsView.Direction ,
            NetFlow_CBQoSAlertsView.TotalBytes AS [Total Bytes] ,
            NetFlow_CBQoSAlertsView.Bitrate
    FROM    NetFlow_CBQoSAlertsView
            INNER JOIN Interfaces ON NetFlow_CBQoSAlertsView.InterfaceID = Interfaces.InterfaceID
    WHERE  ( NetFlow_CBQoSAlertsView.Bitrate > 0 )
            AND ( NetFlow_CBQoSAlertsView.TotalBytes > 0 )
            AND ( NetFlow_CBQoSAlertsView.StatsName = 'Drops' )
            AND NetFlow_CBQoSAlertsView.NodeID IN (
            SELECT  [EntityID] AS NodeID
            FROM    [Containers_AlertsAndReportsData]
            WHERE  CAST(GroupName AS VARCHAR(255)) = @GroupName
                    AND EntityType = 'Orion.Nodes' )
    ORDER BY NetFlow_CBQoSAlertsView.Bitrate DESC; /*${FromTime}*/
    
    
    
    
    
    
    
    
    
    
    
    
    
    

    Download the script here: QoS_Drops_Report.sql

     

    When that's done, it should look something like this:

    Datasource.png

    Before you leave this screen, you need to change the variable near the top to be the group name in your environment.  If it's still on your clipboard, paste it to replace 'Branch Office ↔ Headquarters' in your query (Line 2), but be sure to keep the single quotes surrounding the name.  Once done, click Update Data Source to move onto the next step.

     

    Caveat: Please note that this particular query works fantastically for my lab because it references a static group (a group with only nodes and no dynamic objects).  It may not work 100% in your environment depending on how your group is defined.  If you run into problems, I'd encourage you to post questions to the NTA Forum and feel free to reference this page.  Also, I'm not a DBA - I'm more of an Accidental DBA so the query may not be as optimized as it can be.  But I'm always learning, so if you have suggestions, please contact me.  Enough stalling, moving on...

     

    Next we'll need to select the columns that we want to show in this resource.  Click "Add Column..." to select the columns.

    Select_Columns.png

    The important part in selecting the Columns is the order (this is how they will appear in the resource).

    Add_Column.png

    For my uses, I checked them in this order:

    Selected_Columns.png

    Then click "Add Columns" to add them to the resource.

     

    Three of the columns are actually links, so we need to allow HTML Tag support.  You need to do this for Nodes, Interfaces, and Class.  Click on the "+" next to Advanced and change the settings to check the Allow HTML tags.  It should look like this:

    Allow_HTML_Tags.png

    Now I come from an old school mentality when it comes to displaying numbers in a table.  (I blame too many years using Excel.)  I always like numbers aligned right, so that's what I did.  We'll also change the display settings to automatically convert the values for bits/second and bytes.  I went to the advanced settings for the Bitrate and Total Bytes and tweaked a few settings for them as well.  Start by selecting "Data Unit" from the "Add Display Setting."

     

    Total BytesBitrate
    TotalBytes.pngBitRate.png

     

    That's it!  Hit Submit, and you're golden!


    Using the View to Troubleshoot

    Since this view is all about troubleshooting, I should give you some insight into use the view to accomplish some simple (and advanced) troubleshooting.

     

    Application Response Time (ART) Threshold Violation

    If I find a violated threshold due to Application Response Time (ART) like we do for the MS SQL QoE Application, then I click directly on the node name so that I can get access to the AppStack Environment for for that Node (Mini-Stack).  This let's me pinpoint problems to the Transaction, Application, Virtualization, or Storage layer in one shot (for the lab-dem-sql-02.demo.lab server it looks like a problem with the Read-Latency on the LUN which is affecting the Datastore).  Were this not a demo, I'd go immediately to the storage team and ask, "Why is a critical SQL Server sharing the same storage resources as multiple virtual disks on a hypervisor?"  Since this is my own lab, and I built all of this, I only have myself to blame.

     

    Network Response Time (NRT) Troubleshooting

    If I find a a threshold violation due to Network Response Time (NRT) like we do for the HTTP Application, then I'd start by looking at the network on the whole.  Since it looks like we have a QoS Policy which is stopping web traffic on the HQWAN router, it's probably related to that.  Taking a look at the NCM Events, we can see that there was a recent change.  Examining the Config Change Report, we see that a QoS Policy was applied to the device, but wasn't there before.  We can leverage NCM's advanced configuration management to revert this device to a previous, known-working config file.


    Extending it even further!

    If there's one thing that I absolutely love about the Orion platform is that it is crazy-customizable.  There are just so many resources that are available out-of-the-box, but on the off-chance that you are missing something that you really, really want, you can always create your own!  This is just one more example.  Please leave comments and share some of your own customizations!  I'm always on the lookout for new ideas!