User: "The Network is so slow"
In most cases, it's a bad idea to leap to the conclusion that more bandwidth is required without a thorough investigation into how bandwidth is currently being consumed.
Tell us about your techniques and experiences tracking down the elusive "bandwidth beast", and earn Thwack points in the process!
Working with low-level tools only gives you one small part of the piece of the puzzle at a time. And then you have to keep a mental model of how everything fits together. Unfortunately, the human brain doesn't scale to encompass large data sets.
We all know that monitoring tools let you visualize the flow of bandwidth and spot bottlenecks. But how you get that job done is the heart of the matter. Do you start with a "top 10" list and drill in? Do you wait for an alert? Do you review NetFlow reports every week? Check for significant deltas? Just dig into specific wheels when they (or their users) start to squeak?
What do you think are best ways to see what’s really going on with network bandwidth? Are there any stories you can share about what was causing slow performance or acting as a bandwidth hog?
We've included a few true stories and best practices in a new whitepaper “Is It Really the Bandwidth? Three Steps to Diagnose Bandwidth Complaints.”
What other stories do you have?
1. Edit web.config in C:\inetpub\SolarWinds and change
<add key="HubbleActive" value="False" />
When Hubble is active, it calculates and counts a number of performance related statistics on every page load.
The first value lists how long in milliseconds it took to load this page.
The second value lists how many SWIS queries had to run in order to load the page.
The third value shows the number of business layer calls took place.
The warnings typically indicates duplicated queries, or long running queries.
The last figure is the viewstate size - the larger the viewstate, the more will be stored in memory both client and server side, and the slower the page will load.
Click on the "Details" link to drill down and view more statistics about this page load.
This page will explain in full detail how to read and interpret the detailed statistics: Hubble
2. Edit SWNetPerfMon.DB to aid slow DB connectibity in C:\Program Files (x86)\SolarWinds\Orion and change (save the file)
! Connection timeout in seconds
! Database Command timeout in seconds
! SqlCommand.CommandTimeout in seconds
3. Disabled unused NICs
Please exclude the following from AntiVirus software:
“C:\Documents and Settings\All Users\Application Data\Microsoft\Crypto” folder.
Please see the link below for SQL File Segregation
Separating the data files from the log files in storage can boost performance significantly. The SolarWinds Orion database consists of data files with .mdf or .ndf file extensions, and log files with an .ldf extension. These files exist
for the main database and for the temporary (temp) database SQL uses for moving data
File Segregation --> http://www.solarwinds.com/documentation/Orion/docs/OriondbBestPractices.pdf
++ Basically we would like to see your SQL database on a RAID 1 for better IO performance ++ The better your SQL performs the better the overall Orion products will do.
How-to Measure Database File I/O Performance and Configure Cloud DB Monitoring
Database Performance Analyzer Guided Tour
In a previous life we had a unique challenge to monitor a client environment to try and determine under-provisioned circuits. We used custom bandwidth settings on all of the external interfaces and then built an alert that checked for sustained latency without either upload or download utilization > 80%. This wasn't a fool-proof alert and still required some triage on our side to see if the problem was legitimate but it worked. Having those alerts and being able to back up our claims with data definitely helped when we had to go to the service providers and tell them that they had under-provisioned a circuit.
He who has the most convincing data wins!
Of course, now that NPM has QoE via DPI that changes the game, especially for companies that host custom applications on a Windows platform. We can get super-specific about whether that squeaky wheel needs to have network or application resources applied instead of hauling in a room full of people.
In my career, I have spent a lot of time looking over this issue. Most recently, I had the opportunity to review the "Squeal" method. I utilized both the NPM to look over the overall connectivity and then drilled into the actual interface of the user who was squealing. 99% of the time it is not a bandwidth issue and not a latency issue but something in the application the user was using or how they were requesting data. I am a big believer in having objective data available versus a subjective point of view of a situation. Sometimes I have resorted to using tools like MRTG against a particular remote switch to assist in gathering individual switch port statistics over time that I can then show the user(s) what is really occurring with their connections versus what the "think" is occurring.
These usually end up finding it was how the application was being used that caused the issues. Like a blanket sql request returning millions of records when a typo occured that should have returned 10 records.
In a previous life, I worked at a company that had over 800 users pulling data across a bonded 2 T1 link and used tools like NPM and MRTG to prove that bandwidth was indeed the issue for the issues.
Mostly I wait for that squeaky wheel.
I always make sure to monitor all interfaces for statistics. Configuring most ports as "Un-pluggable" so they don't alert but I still retain all statistics so that when the squeaky wheel shows up, we have the historical statistics to look back on.
I cant say how many times a company has wanted to swap out to 10GB to all users when they have never reached even close to 10GB aggregate across the trunk uplinks.
It used to be the same argument in the old days of folks wanting to go from 100Mbs to 1GB. It is really surprising to see how little the total aggregate is most of the time.
(Of course this changes a lot in the new landscape of things with so much multimedia going across the internet.)
On critical WAN ports I do keep an alert turned on for consistent throughput so I am alerted in that.
I also like to group high bandwidth interfaces and use a diagram to depict them with the atlas program in a specific custom view. For example I have the iSCSi storage interfaces in a single picture for our ESXi farms so we can easily verify bandwidth load is evenly distributed and we can see high spikes and such.
-What do you recommend Leon?
especially if someone has NTA, what additional reports or alerts would be most useful?
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.