Skip navigation

8020-rule.jpgAs we wrap up the fourth installment of my Help Desk Adventure Series (I'm going to trademark that), I've described the journey from building out a help desk, defining SLAs, and incorporating workflow automation. Looking back, I think the one resource that was left out from this discussion was time. This resource is finite, difficult to find, and often buried in other daily tasks. Ever hear of the 80/20 rule? It's a slightly modified idea taken from the Pareto Principle: that we as IT professionals spend 80% of our time on trivial tasks, and 20% innovating (and providing some real impact). In some of my positions, it might as well be the 99/1 rule (or the 100/20 rule, in which I spent nights and weekends on the innovation part).

 

How did my team and I get the time to build out a new help desk system? I made a few "bold statements" to management to help get this off the ground and created some daily team exercises. Again - support from the upper echelons is critical!

 

  • The help desk, and related SLA and workflow creation, was considered an IT project with time allotments. We could budget team time towards the help desk and could give it priority in the queue.
  • The team would cover for one another when other issues arose. For example, I would spend time working issues on the call center floor so that a member of the team could focus on building a workflow. Broken concentration kills innovation.
  • I used the term "technical debt" quite often: this means that if we put off a help desk now, we pay for it with more work later. We wanted to pay off our operational debt quickly and efficiently.
  • A morning "red zone" meeting would be held at the start of the day. We'd review the backlog of work to complete and determine what we wanted to get done that day. It was also a great time to figure out how we could best help each other with various daily tasks, and communicate progress.

 

Knowing that it's very difficult to carve out time for any new work, I'm curious if you have any other tips to add to my list? How have you managed to free up time for your help desk creation, updates, workflows, or just general tasks that make your help desk better?

In my last post, I took a look the DNS protocol with tcpdump; and as it turns out you can do some really useful stuff with the embedded protocol decoders. So, how far can we take troubleshooting with tcpdump? Well pretty far; but in troubleshooting you have to decide whether the fastest resolution will come from the tools you have to hand or grabbing the capture and using something better. Whilst you *can* definitely do some analysis of HTTP, as we’ll see, once the initial handshake is complete, it gets messy, real quick.


ASCII - Still a thing


The most useful switch for debugging http is -A, which decodes the traffic in ASCII format, which kinda makes it human readable. To kick of a capture on our server we run:


[~] # tcpdump -i eth0 -pnA port 80


Capture 0.PNG


For sanity’s sake; I've snipped out the initial handshake. After a few packets we can see the client's request, the ACK, and the server’s response. Probably the most interesting parts are highlighted in yellow;


  • The HTTP GET request from the client (GET / HTTP/1.1)
  • The server HTTP 200 return code (HTTP/1.1 200 OK)
  • The Content-encoding (gzip)
  • The content type returned (text/html)


Sorry, I've got no HEAD


Various other headers are displayed, but they not usually that useful. Beyond that, it’s HTML:


Capture 1.PNG


But, that doesn't look anything like HTML. There are no lovely <HEAD> or <HTML> tags. The clue is in the client and server headers. Whilst the sessions is not encrypted; with gzip compression enabled, for a human it may as well be. You can’t see the conversation between the client and server once the TCP and HTTP parameters are established. However, we can divine the following:


  1. The URL the client requested
  2. The server was happy to accept the request
  3. The parameters/session features enabled (such as gzip compression)
  4. But not much else


Somewhere in this packet exchange, a redirect sends the client to a different TCP port. However, from just the tcpdump we can’t see that. There *may* be an arcane way of getting tcpdump to decompress gzip on the fly, but I’ll be darned if I can figure it out. As a workaround, you could disable compression on your browser, or use a CLI web tool such as cURL. However, changing things just to troubleshoot is never a good idea; and that wouldn't help if your problem is with gzip compression.


404 - Not Found


Another example shows a less than healthy connection:


Capture 2.PNG


This time, the client is requesting reallybadfile.exe; so the server returns a 404 Not Found error. Random clients attempting to request executables is of course an example of virus or other malicious activity. Many firewalls can filter this stuff out at the edge, but this a job best suited to a load balancer or Application Delivery controller (posh load balancer).


If you are just interested in the negative status codes, you of course can just pipe the output to grep:


[~] # tcpdump -i eth0 -pnA port 80 | grep '404\|GET'


Capture 3.PNG


This is an especially quick and dirty method; of course you could pipe in multiple status codes, or use egrep and a regex, but from the CLI you run a pretty big risk of missing something.


(insert Wireshark pun)


Sometimes, it’s best to admit defeat and make a permanent capture for analysis elsewhere. To do this we use the -w switch to write the packets to disk. The verbose switch is also helpful here as it reports the number of packets received during the capture so you know you've actually caught something..


[~] # tcpdump -i eth0 -pnAv port 80 -w httpcapture.pcap


Capture 3.PNG


Then, the  session can then be analysed in Wireshark with minimal effort. Just grab the file any way you can, and open it; the built in decoders will work their magic. Once loaded, go to Analyze and Follow TCP Stream. This will show you a cleaned up version of the capture we took in the beginning; but with the payload still encoded.


Capture5.PNG


No problem, just go and find the HTTP response packet, dig through and the payload should be in the clear:


Capture 6.PNG

And here, now highlighted in yellow we can see a tiny piece of JavaScript that redirects my HTTP session to another location. By not using a HTTP/1.1 300 status code to redirect the session, it became much harder (if not impossible) to realise what was going on just using tcpdump. With some elite-ninja regex skills, and a perfect recollection of the HTTP protocol, maybe you could have figured this out without bringing in the Wireshark heavy artillery. However, for mere mortals such as myself, it’s just a case of knowing when to bring in the really big hammer.


So, that’s my little treatise on the deep packet analysis, and with some practical applications. Please let me know in the comments your thoughts, and any tips you can share with tcpdump, Wireshark, or any of the other excellent tools out there.


Day in day out, network engineers and admins tackle many complex issues. Often, these issues stem directly from an IP conflict. While IP conflicts might seem like one of the simplest problems that occur, they can actually be quite troublesome. How often do you find yourself spending an excess amount of time troubleshooting network issues and then find the reason to be an IP conflict?

 

We all know what an IP conflict is and ways it can be triggered. The simplest one being the manual assignment of static IP addresses. A misconfigured DHCP server is another reason, where more than one similarly configured DHCP hands out overlapping IP addresses. There is the possibility of IP conflicts occurring if you have multiple wireless access points or other network devices with an embedded DHCP server turned on by default.  DHCP servers also cause conflicts when a statically assigned IP coincides with an address that’s already been assigned by the server. And the most recurrent cause, more so nowadays, is with the increase in devices frequently entering and leaving a network. When nodes like a virtual machine reconnect after an extended period in stand-by or hibernate mode, a conflict is most likely to occur if the address configured on that system was previously assigned to another system that's already in use on the network.

 

With IP conflicts occurring in various ways across the network, its difficulty lies in troubleshooting and recognizing that the cause of a particular network issue is an IP conflict. Moreover, it’s even more difficult and time consuming to identify and physically locate the devices in conflict.

 

The Pain in Troubleshooting IP Conflicts


Recent survey results revealed that IT professionals on an average spent 16% of their total time troubleshooting IP problems. One basic and important criterion for efficient IP address management is to avoid IP duplication. IP conflicts pose a threat because they can cause connectivity issues for the systems in conflict. These connection issues are difficult to troubleshoot as the systems experience erratic connectivity. The administrator often ends up looking in a variety of places for the cause before finally identifying the culprit to be IP duplication. Again, the seriousness of the network issue differs when the conflicting system is a simple work station or a critical application server.

 

For example, say someone brings a notebook into work and plugs it into the network. Soon thereafter, all remote connections to the accounting server go down. Not knowing why, IT starts to investigate the problem. First they reboot the remote access server.  Next, they change the switch port and network cables. Then, as a last resort, they try unplugging all devices from the switch. After that, the problem just goes away. Problem solved, another device was connected to the switch, in turn causing the problem.  In this scenario, IT started looking for an IP conflict and much later found the outside computer plugged into the network to be the problem. Consider all the time spent troubleshooting this issue? In the end, many hours are lost troubleshooting and the business likely suffers downtime.

 

As a network engineer or admin, you’re often under the gun to resolve issues with the least amount of downtime. But hey, you don’t have to be…there’s always a peaceful resolution. So, how do you avoid excessive troubleshooting time or at least limit it? Do you use spreadsheets or an IP management tool?

Note: This post originally appears in Information Week: Network Computing

Why do we hear of new security breaches so frequently? Make sure your organization follows these best-practices and considers these important questions to protect itself.


Three big things have been happening with great frequency of late: earthquakes, volcanoes, and data breaches, most of the latter involving point-of-sale (PoS) systems and credit card information. While I'm certainly curious about the increase in earthquakes and volcanic activity, I simply do not understand the plethora of PoS breaches.

The nature and extent of the breach at Target a year ago should have been a wake-up call to all retailers and online stores that accept credit card payments. I get the feeling that it was not, but I'm not here to point fingers in hindsight. I do, however, want to call your attention to what you are, or are not, learning from these incidents, and how those lessons are being applied and leveraged within your own organization.

Lessons from Target, et al.
Let's revisit the Target breach. In short, it happened because vendor credentials were compromised and subsequently used to inject malware onto Target's systems. At the time, a number of security professionals also suggested that the retailer was likely not the only target (no pun intended).

As a result, three actions should have occurred immediately in every organization around the globe:

  • An audit of every accounts repository throughout every organization to disable/eliminate unused accounts, ensure active accounts were properly secured, and determine if any existing accounts showed any evidence of compromise
  • A full malware scan on every system, including explicit checks for the specific malware identified on the Target systems
  • A reevaluation of network connectivity, with these questions in mind:
    • How could a service vendor's credentials be used to access our PoS network?
    • Which of our networks are connected to which networks?
    • How are they connected?
    • Do firewalls exist where they should?

And yet, in the subsequent weeks after the Target announcement, a litany of big-name retailers, including Neiman Marcus, Michaels, Sally Beauty Supply, P.F. Chang's, Goodwill Industries, and Home Depot have all reported breaches that occurred around the same time or after the Target breach was disclosed.

If you haven't done the three things listed above in your organization, go do them right now!

Patching is a no-brainer
Then there was Heartbleed, perhaps the most saturated vulnerability threat in the history of network computing. Who hasn't heard about Heartbleed? It was a threat with an immediately available and simple to deploy patch. Most organizations deployed the patch immediately (or at least took their OpenSSL devices off the Internet).

And yet, despite this, Community Health Systems managed to give up 4.5 million customer healthcare records to Chinese hackers in an attack that started a week after the Heartbleed announcement. Now, while we might forgive the April attack, this theft actually continued through June! To date, this is the only known major exploit of that vulnerability. (And yet, there are still a quarter-million unpatched devices on the Internet!)

What is your plan for ensuring highly critical security updates are deployed to your devices as soon as possible -- and if not, protecting those devices from known threats?

When is compliance not compliant?
The final aspect of all of this is the alleged value of our compliance regulations, which raises some interesting questions. For example, what good comes from the PCI-DSS regulations in the face of so many breaches? Is this a failure of the compliance standards to actually define things that should be compliant? Is this a case of businesses fudging the compliance audits? Finally, where's the meat in PCI-DSS for organizations failing to be compliant?

And how responsible is management? Perhaps the most infuriating thing about the Home Depot incident is the recent report that management had been warned for years that there were known vulnerabilities, and yet did nothing.

Is your management resistant to acting responsibly about data security? Do you have a plan for changing this resistance?

The bottom line is this: Don't be the next story in this long train of disasters. Go check your systems, networks, accounts, and employees. Most of all, learn from the tribulations of others.

In my last post I looked at how flags can pull useful information out of packet that we otherwise we might struggle to see.  This time, we’re going to use tcpdump to look into the actual applications.


The first application I'm going to look at is the humble Domain Name Service (DNS), the thing that needs to work flawlessly before any other application can get out of bed. Because DNS lookups are typically embedded in the OS IP stack, a packet capture is often the only way to get any visibility.


The NS in a Haystack


In my scenario, I suspect that something is amiss with DNS, but I'm not sure what. So to pick up just DNS traffic, we only need to capture with simple filter:


[~] # tcpdump -i eth0 -pn port 53


Capture.PNG


In the above example, even with minimal options selected, we can see some really useful information. The built-in decoder pulls out the transaction ID from the client (equivalent to a session ID), the query type (A Record) and the FQDN we are looking for. What is unusual in this example is that we can see not one, but two queries, about 5 seconds apart. Given that we are filtering on all port 53 traffic, we should have seen a reply. It would appear that my local DNS proxy (172.16.10.1) for some reason failed to respond. The client-side resolver timed out and tried the Google Public DNS. This may be a one-time event, but but it certainly bears monitoring. If the client configuration has an unresponsive or unreliable DNS server as first port of call, a the very least, this will manifest in a frustrating browsing experience.


Selection of the Fittest (and Fastest)


Selection of DNS servers is pretty important; I hadn't realised that my test Linux box was using Google as a secondary resolver. Whilst it is reliable; it’s actually four hops and a dozen milliseconds further away than my ISP service. When your broadband is as crappy as mine, every millisecond counts.


Anyway, as you can see, Google returns eight A records for google.co.uk; any of them should be fine.


Another thing to look for is what happens when we make an invalid query, or there is no valid response:


Capture2.PNG


In this case we get a NXDomain (non-existent domain) error. This case is an obvious typo on my part, but if we turn up the logging with the very verbose (vv) switch the response is still interesting:


[~] # tcpdump -i eth0 -pnvv port 53


Capture3.PNG


Highlighted above is the SOA (start of authority) record for the domain .ac.uk; this is far as the server was able to chase the referral before it got the NXDomain response. 


Edit - Contributor JSwan Pointed out a small mistake; I've fixed the below version.


Whilst a bunch of stuff is revealed with very verbose enabled; not all of it is useful. One thing to look for is the IP time to live (TTL); this shows how many hops the packet has made since leaving the source. If this number is low, it can be an indicator of routing problems or high latency (I did say it wasn't very useful!).


Furthermore, the DNS protocol-specific TTL can be seen highlighted in yellow, after the serial number in date format. The DNS TTL specifies how long the client (or referring server) should cache the record before checking again. For static services such as mail, TTLs can be 24 hours or more. However, for dynamic web services this can be as low as 1 second. TTLs that low are not a great idea; they generate HUGE amounts of DNS traffic which can snowball out of control. The moral is, make sure that the TTLs you are getting (or setting) are appropriate to your use-case. If you failover to your backup data centre, with a DNS TTL of a week, it will be a long time before all the caches will be flushed.


As JSwan points out in the comments, if you use the very very verbose switch (-vvv), for A records tcpdump will display the DNS TTL in hours, minutes and seconds:


Capture4.PNG


Apparently Google has very short TTL. Interestingly, tcpdump doesn't print the DNS TTL for NXDOMAIN result, although it is still visible in the capture.

 

Why is capturing in context important?


Imagine trying to troubleshoot connectivity for a network appliance. You have configured IP addressing, routing and DNS, but yet it cannot dial-home to it’s cloud service. Unless the vendor has been really thorough in documenting their error messages, a simple DNS fault can leave you stumped. Once again, tcpdump saves the day; even on non-TCP traffic. The built in protocol decoder gives vital clues as to what may be borking a simple communication problem.


In my next and final blog of this series, I’m going to look at another common protocol, HTTP.





flow_charts.pngNow that the IT team has established a help desk and have SLAs defined for various requests and workflows, it was time to get proactive. Perhaps using the data stored in the help desk could further build a sort of analytics engine for making better decisions? This idea came from one of the IT team members, as he saw certain trends that none of us really could.

 

For example:

  • Which activities are the most frequent, or the most time consuming, and is there a correlation?
  • Could we find ways to proactively automate a number of frequently occurring requests, such as password resets?
  • Was there a way to extrapolate our hardware replacement rates to build a reasonably accurate forecast for next quarter (or next year) budget numbers?

 

It turned out that a number of workflows could be created to solve / remediate requests, with our help desk system acting as the front end interface. Password requests ended up being the most common, and so we assigned non-IT supervisor staff the ability to issue time sensitive password resets for their pods (their teams) once per day. The resets were still audited to ensure no funny business took place, but alleviated a fair bit of pain as a call staff member would forget his or her password right as they were about to begin their shift. Interestingly enough, we found out that many employees were getting by this by sharing a password or requiring a supervisor override the login for our VoIP system. As such, our delegation of password resets closed this loop and gave us further visibility into the problem.

 

What sort of workflows have you built, or wish you could build, around the help desk system in place? How would you use workflows to offload some of the "manual, heavy lifting" off you or your team?

A Guide to Navigating Around the Deadly Icebergs that Threaten Virtualized Databases


When it comes to virtualized workloads, databases can be in a size class all to themselves. This and other factors can lead to several unique challenges that on the surface might not seem all that significant, but in reality can quickly sink a project or implementation if not given proper consideration. If you're put in charge of navigating such a virtual ocean liner-sized workload through these iceberg-infested waters, you'll want to make sure you're captain of the Queen Mary and not the Titanic. How do you do that? The first step is understanding where the icebergs are.I recently had a conversation with Thomas LaRock, president of the Professional Association for SQL Server (PASS) who also happens to be one of our Head Geeks here at SolarWinds, to get his input on this very topic. Here's what I learned:

 

CPU & Memory Allocation

First, don't treat a database like a file server when it comes to configuration and virtual resource allocation. Typical configurations allow over allocation of both memory and CPU. However, configuration of CPU shouldn't be more than 1.5-2 times the number of logical cores you have. When it comes to memory, don't over allocate at all if possible, instead going to at most 80 percent. As memory utilization gets near 100 percent, you may not have enough resources to even reboot the host. If you do push your systems, make sure that you not only have a virtualization monitoring tool, but that you actively use it.

 

High Availability Strategy

Most virtualization admins use snapshots and vMotion-Thomas' preferred option-as a primary approach to address high availability concerns. On the Windows side specifically, clustering and availability groups are also common. While either technology can be effective, they probably shouldn't be used together. An example of why not is a database VM being vMotioned to another VM as a result of a performance problem, but the availability group just seeing that as a server instance no longer responding. If you do use both, make sure that you don't allow automatic vMotion so there is always an operator in the mix to (hopefully) prevent problems, otherwise bad things can happen.

 

To Monster VM or Not?

You might wonder if an easy way to overcome the challenges of virtualized databases is simply to allocate one of VMware or Hyper-V's "monster VMs" to the database instance and just solve problems by throwing hardware at them. However, a better approach is to put database VMs on a mixed use host that includes a range of production, development, test and other workload types. The rationale being that if you have a host with nothing but one or more database servers running on it, you have no options if you accidently run out of resources. With a typical mixed use host, you're less likely to be simultaneously hammering one resource type, and if you do start to hit a resource bottleneck, the impact of turning off a development or test VM to provide short term resources will typically be less than shutting down a production database.

 

Taking these considerations and tips into account can help make sure your virtualized databases stay afloat a long time rather than being lost due to a potentially avoidable iceberg on the maiden voyage.

If you're looking for additional information on virtualizing databases, SolarWinds has a number of great white papers available online, including "5 Risks for Databases on VMware" and "Monitoring Database Performance on VMware."

 

Note: This post originally appears in VMBlog at http://vmblog.com/archive/2014/10/23/would-you-rather-captain-the-queen-mary-or-the-titanic.aspx#.VGUUxfnF8uK

 

tumblr_lxysu8rO3k1rnwfzro1_50031.pngIn an earlier post, I discussed the path to successfully implementing a help desk system for a mid sized organization that was new to the idea. Luckily, the hard work of my team and positive reinforcement received from management helped sell the idea further. In fact, I selected a handful of power users - you know, the folks that are "friends of IT" that are out there trying to get their work done - to test the help desk system. I also let it leak that they were using some very cool new system that gave them additional access to IT resources. Those not using the system were curious about this new system and actively sought us out to put in those "cool new tickets" instead of the old email system. It was a type of reverse psychology exercise. :-)

 

Now that users were actively engaged and entering data into the system via tickets, we had some more tough decisions to make around Service Level Agreements or SLAs. The top brass wanted to see results, after all, and those are easiest to swallow if wrapped in some simple numbers. It is hard, however, to define SLAs when all of your data is new and there is not much historical information to work with.

 

A few things we learned along the way:

 

  1. Allowing user-defined ticket priorities to influence which SLAs we could be held against was a mistake. Everyone feels like their ticket is high priority and requires the highest SLA. In the end, our internal IT team ended up learning how to prioritize tickets that entered the queue. The user defined priority was left in to make people feel better. :-)
  2. We ended up mimicking some SLAs from our vendors; if our server vendor offered a 4 hour replacement window, we would match that value (or go slightly above it, to allow for break/fix time after the replacement arrived).
  3. Having historical metadata wrapped around ticket objects - such as knowing how many times a specific phone had stopped working - gave us increased confidence to make actionable decisions. This helped go far beyond our "standard" 4 hour SLA because we could quickly pitch old hardware into a replacement queue, discard queue, or "fix it" queue. Hardware now told us stories about themselves.
  4. Being able to show our SLAs in hard numbers provided valuable protection against fallible human memory. It also pointed out our weak spots that needed improvement, training, or additional skill sets.

 

With that said, I'm curious how you offer Service Level Agreements to your help desk users. Is it a general tiered system (high, medium, low) or something more granular? How did you pick your SLA time tables?

In my last post I talked through some of the reasons why mastering tcpdump is useful. Building on our previous example in this post I’ll focus on using TCP flags for troubleshooting.


Even with our cleaned up filter, we can still see quite a lot of traffic that we don’t care about.  When troubleshooting connectivity issues, the first packet is the hardest; especially when you start involving firewalls. I’m sure you will recall, a TCP packet flagged with SYN is the first sent when client tries to establish a layer-7 connection to a remote host.  On a LAN, this is simple, but when the destination is on a different network, there is more to go wrong with the inevitable NAT and routing.


Troubleshooting approaches differ, but I prefer to jump on the target console and work backwards as traffic is usually only policed on ingress. We need to know whether our packet is reaching our target. This way we can look down the stack (at routing and firewalling) or up to application itself.


This is where a working knowledge of TCP flags and a copy of tcpdump is helpful. 


Signalling with Flags


Each TCP packet contains the source and destination port, the sequence  and acknowledgement numbers as well as a series of Flags (or control bits) which indicate one or more properties of the packet. Each flag is a single bit flipped on or off. The below diagram shows the fields in a TCP packet. The source and destination ports are 16 bit integers (and is where we get the maximum 65535 from), but if you look at offset 96, bits 8 through 15 are individual flag, SYN, ACK, RST, etc.



Capture 0.PNG


When a connection is setup between a client and server, the first three packets will have the SYN and ACK flags set. If you look at them in any packet analyser, it’ll look something like this:


Client -> Server (SYN bit set)

Server -> Client (SYN and ACK bits set)

Client -> Server (ACK bit set)


To make sure we are actually getting the traffic, we want to see the first packet in the connection handshake. To capture packets where only the SYN flag is set, we use the following command from the server console:


[~] # tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-syn) != 0 and port not 22’


The above tcp option filters on specific packet fields. In this case we are using tcpdump’s built-in shorthand to look at the bits associated with TCP SYN (the ‘Flags [S]’ in yellow). 


The other thing we are introducing is using ‘single’ quotes on the filter. This prevents the shell from trying to interpret anything within brackets.


In the below example we can see three packets sent ~500ms apart (also highlighted in yellow). You must consider that almost everything on the wire is filtered out, we are only hearing one side of the conversation.


Capture1.PNG


Three packets with the same source and destination ports transmitted 500ms apart us a clue to what is happening. This is the typical behaviour of a client connection that received no response, assumed the packet was lost in transit and tried twice more.


What does the flag say?


Having taken this capture from the server; we know the outbound communication is working, so it unlikely that an intermediate firewall is causing a problem. My hunch is the server is not completing the connection handshake for some reason. A quick and dirty way is to check for TCP Reset packets; the host’s universal way of asking for a “do-over” and restarting the handshake. Hosts will respond with a TCP reset when there is no application listening; the lights are on; but no-one is home.


[~] # tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-rst) != 0 and port not 22'


Capture2.PNG


I took the above captures a few minutes apart, but for every TCP SYN, there is a TCP RESET from the server. Whether the server is actually listening on any ports or interfaces on that target destination port (3333) is easily confirmed with:


[~] # netstat -na | grep 3333


If no results are returned, the service ain't running. When taking captures from a firewall, you should expect different behaviour. In 99% of cases if a packet doesn't a match policy; it will be dropped without an acknowledgement (ACK) or reset (RST) packet.


These are not the flags you are looking for


With the tcpflags option we can pipe in additional matches. For example, we can look for all traffic where the tcp-syn and tcp-ack flags are not set to 0.


tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-rst|tcp-ack) != 0 and port not 22'


However, everything with SYN or ACK set  doesn’t constitute much of a filter; you are going to be picking up a lot of traffic.


Rather than just filtering on source/destination ports, why do I care about TCP flags? I'm looking for behaviour rather than specific sources or destinations. If you are troubleshooting back-to-back LAN PCs, you wouldn't bother with all the Vexillology. However, with hosts either side of a firewall, you can’t take much for granted. When firewalls and dynamic routing is involved traffic may cross NAT zones or enter an unexpected interface.


It’s easy to switch to source/destination filtering once you've “found” the flows you are looking for; I try and avoid making assumptions when troubleshooting.


In my next post I’ll dig into the payload of two common protocols to see what we can learn about layer 7 using only tcpdump.





It’s the best of times. It’s the worst of times. Well...not quite! However, it certainly is the age of technology offering immediate ROI, sky high cost savings, and even magic that can help add to an organization’s bottom line. It’s also the time when new technology is wreaking havoc on data delivery when implemented without considering the additional traffic load it adds to the network. To think of it, global IP traffic is expected to increase 8-fold before the end of 2015. All of this is making it trickier to deliver data to the cloud, a remote site, or even just out of the edge router.

 

When network engineers need to police and drop unwanted traffic, prioritize business traffic, and ensure data delivery, the answer is QoS or Quality of Service. QoS can provide preferential treatment to desired traffic within your LAN, at the network edge, and even over the WAN if the ISP respects your QoS markings. ISPs have always used QoS to support their own (preferred) services or to offer better chances of delivery at a premium. While ‘end-to-end QoS’ in its real sense (from a system in your LAN, over the WAN, peered links and multiple Autonomous Systems to an end-point sitting thousands of miles away) is challenging, it’s wise to use QoS to ensure that your data at least reaches the PE device without packet loss, jitter, and errors.

 

Alright, now comes the fun part, implementing Cisco QoS! Some network engineers and SMBs are wary of implementing QoS for the fear of breaking something that already works. But fear not, here is some help for beginners to get started with Cisco QoS, its design & implementation strategies.

 

QoS Design and Implementation:

QoS design consists of 3 strategies:

  • Best Effort: Default design with no differentiation or priority for any traffic. All traffic works under the best effort.
  • IntServ: A signaling protocol such as RSVP is used to signal to routers along a path about an application or service that needs QoS. This reserves bandwidth for the application and cannot be re-allocated even when the specific application is not in use.
  • DiffServ: The most widely used option. Allows a user to group traffic packets into classes and provide a desired level of service.

 

The choices for QoS implementation range from traditional CLI and MQC to AutoQoS. For a beginner, the easiest would be to start with a DiffServ design strategy and use Cisco’s MQC (Modular QoS CLI) for implementation. MQC based QoS configuration involves:

  • Class-Maps: Used to match and classify your traffic into groups, say web, peer-to-peer, business-critical, or however you think it should be classified. Traffic is classified into class-maps using match statements.
  • Policy-Maps: Describes the action to be taken on the traffic classified using class-maps. Actions can be to limit the bandwidth used by a class, queue the traffic, drop it, set a QoS value, and so forth.
  • Service-Policy: The last stage is to attach the policy-map to an interface on whose traffic you wish to perform the QoS actions defined earlier. The actions can be set to act on either Ingress or Egress traffic.

MQC QoS structure.png

Now, I would like to show you a sample configuration to put unwanted traffic and a business app in two different classes and  set their priorities using IP precedence.

 

Creating class-maps to group traffic based on user requirements:

Rtr(config)#class-map match-any unwanted

Rtr(config-cmap)#match protocol ftp

Rtr(config-cmap)#match protocol gnutella

Rtr(config-cmap)#match protocol kazaa2

 

Rtr(config)#class-map videoconf

Rtr(config-cmap)#match protocol rtp

 

Associating the class-map to a policy and defining the action to be taken:

Rtr(config)#policy-map business

Rtr(config-pmap)#class unwanted

Rtr(config-pmap-c)#set precedence 0

Rtr(config-pmap)#class videoconf

Rtr(config-pmap-c)#set precedence 5

 

Assigning the policy to an interface:

Rtr(config)#interface Se0/0

Rtr(config-if)service-policy output business

 

QoS Validation:

The next thoughts after implementation should be on how to make sure the QoS policies you created are working - Are they dropping the traffic they are supposed to or are the QoS policies affecting the performance of your business applications?

 

This is where Cisco’s Class-Based QoS MIB, better known as CBQoS steps in. SNMP capable monitoring tools can collect information from the CBQoS MIB to report on the pre and post-policy statistics for every QoS policy on a device. CBQoS reports help determine the volume of traffic dropped or queued and confirms that the classifying and marking of traffic is working as expected.

 

Well, that completes the basics of QoS using MQC and implementation ideas for your network. While we talked about QoS configuration using classification and marking in this blog, there are more options such as congestion management, congestion avoidance, and shaping which we have not explored because they can be complex when starting out. If you have got the hang of QoS configuration using MQC, be sure to explore all options for classifying and marking traffic from here before your first QoS implementation.

 

Good luck creating a better network!

 

mail.jpgI've held a number of different positions at companies of varying size, but one instance clearly stands out in my mind. Several years ago, I had the distinct pleasure of managing a very sharp IT team for a mid sized call center. This was the first time I had ever officially managed a team, being traditionally a very hands-on tech guy.

 

When I began my adventures in this role, the company had rapidly grew from a small-sized operation into an environment with hundreds of staff members handling calls in each shift. It also meant that the current status quo for reporting and tracking issues (sending emails to a distribution list) would have to go; it simply didn't scale and had no way of providing the rich sets of metadata that one expects when handling problem resolution like a help desk ticket can provide.

 

I was challenged with fighting a system that was very simple for the end users but mostly worthless for IT. Broaching the subject of a ticketing system was met with tales of woe and that it "wouldn't work" based on past attempts. I felt, however, that introducing a help desk ticketing system simply required a few core principles to be successful:

 

  1. A simple process with as much abstraction away from "technical jargon" as possible.
  2. Buy-in and participation from the top echelons of the company; if the top brass were on board, their subordinates would follow suit.
  3. An empowered IT staff that could influence the look, feel, selection, and direction of the help desk system without fear of retribution or being iced out as wrong or "stupid."
  4. And, probably most important of all, faster resolution times on issues that went through the help desk through various synergies from related metadata (tracking hardware, software, and historical trends).

These were my four ideas, and I'll share how things went in my next blog post.

 

What about your ideas? Do you have a story to share that explains how you got your team on-board a new help desk system, and if so - how did you do it? :-)

Can something exist without being perceived?


No, this blog hasn't taken a turn into the maddening world of metaphysics. Not yet.

 

I'm talking about event and performance logging, naturally. In the infrastructure racket profession (well, I think it's a vocation, but I'll get to that in a later post), we're conditioned to set up logging for all of our systems. It's usually for the following reasons:

  1. You were told to do it.
  2. SECURITY!
  3. Security told you to do it.

 

So you dutifully, begrudgingly configure your remote log hosts, or you deploy your logging agents, or you do some other manner of configuration to enable logging to satisfy a requirement. And then you go about your job. Easy.But what about that data? Where does it go? And what happens to it once it's there? Do you use any tools to exploit that data? Or does it just consume blocks on a spinning piece of rust in your data center? Have I asked enough rhetorical questions yet?
*  *  *
The pragmatic engineer seeks to acquire knowledge of all of her or his systems, and in times of service degradation or outage, such knowledge can reduce downtime. But knowledge of a system typically requires an understanding of "normal" performance. And that understanding can only come from the analysis of collected events and performance data.

If you send your performance data, for example, to a logging system that is incapable of presenting and analyzing that data, then what's the point of logging in the first place? If you can't put that data to work, and exploit the data to make informed decisions about your infrastructure, what's the point? Why collect data if you have no intent (or capacity) to use it?


Dashboarding for Fun and Profit (but mostly for Fun)

One great way to make your data meaningful is to present it in the only way that those management-types know: dashboards. It's okay if you just rolled your eyes. The word "dashboard" was murdered by marketing in the last 10 years. And what a shame. Because we all stare at a dashboard while we're driving to and from work, and we likely don't realize how powerful it is to have all of the information we need to make decisions about how we drive right in front of us. The same should be true for your dashboards at work.So here are a few tips for you, dear readers, to guide you in the creation of meaningful dashboards:


  1. Present the data you need, not the data you want. It's easy to start throwing every metric you have available at your dashboard. And most tools will allow you to do so. You certainly won't get an error that says, "dude, lay off the metrics." But just because you can display certain metrics, doesn't mean you should. For example, CPU and memory % utilization are dashboard stalwarts. Use them whenever you need a quick sense of health for a device. But do you really need to display your disk queue length for every system on the main dashboard? No.
  2. Less is more. Be selective not only in the types of data you present, but also in the quantity of data you present. Avoid filling every pixel with a gauge or bar chart; these aren't Victorian works, and horror vacuidoes not apply here. When you develop a dashboard, you're crossing into the realm of information architecture and design. Build your spaces carefully.
  3. Know your audience. You'll recall that I called out the "management-types" when talking about the intended audience for your dashboards. That was intentional. Hard-nosed engineers are often content with function over form; personally, I'll take a shell with grep, sed, and awk and I can make /var/log beg for mercy. But The Suits want form over function. So make the data work for them.
  4. Think services, not servers. When you spend your 8 hours a days managing hosts and devices, you tend to think about the infrastructure as a collection of servers, switches, storage, and software. But dashboards should focus on the services that these devices, when cooperating, provide. Again, The Suits don't care if srvw2k8r2xcmlbx01 is running at 100% CPU; they care that email for the Director's office just went down.

 

Don't ignore the dashboard functionality of your monitoring solution just because you're tired of hearing your account rep say "dashboard" so many times that the word loses all meaning. When used properly, and with a little bit of work on your part, a dashboard can put all of that event and performance data to work.

 

Note: This post originally appears at eager0.com.

 

Well hello there, returning like a bad penny, I am here to talk again about Deep Packet Analysis. In my last series of blogs I talked about the use-cases for Deep Packet Analysis but conspicuous by it’s absence was a lack of real world applications. This time I thought I would dust off my old-timey packet analysis skills and share some practical applications. I’ll focus on troubleshooting but no doubt we’ll wander into security and performance as well.


Whilst Solarwinds have some excellent tools for network performance management, there will be occasions where they won’t be available. For example, when troubleshooting remote systems without a full desktop or limited privileges. We’d like to think that those expensive “AcmeFoo Widget 5000” appliances use a custom built operating system. However, instead of an OS written in assembly by virgin albino programmers, it’s usually a headless Linux distribution with a fancy web GUI. As a result, the tools are pretty universal. Wireshark is the kind of tool that most administrator-types would have on their desktop. It has no end of fancy GUI knobs; click randomly long enough you are bound to find something noteworthy. However, when you don’t have access to a desktop or can’t export a complete dump, working with the available tools may be your only option. One of the most of basic, and powerful, is of course tcpdump. Available on most platforms, many vendors use it for native packet capture with a CLI or GUI wrapper.


How does Packet Capturing work?


A packet capture grabs packets from a network interface card (NIC) so they can be reviewed; either in real-time or dumped to a file. The analysis is usually performed after the event in something like Wireshark. By default, all traffic entering or leaving the NICs of your host will be captured. That sounds, useful, right? Well SSH or Telnet (you fool) onto a handy Linux box and run:


[~] # tcpdump

or if you are not root:

[~] # sudo tcpdump


And BWAAA. You get a visit from the packet inception chipmunk.

 

tcpdump 1.PNG


Filling your screen is a summary of all the traffic flowing into the host (press control+c to stop this, BTW).  This mostly consists of SSH traffic from your workstation, which contains the SSH traffic etc. In the days of multicore everything, this is not so much of a problem. On a feeble or marginal box an unfettered tcpdump can gobble cycles like some sort of Dutch personal transport recycling machine. To make what is flying past your screen readable, and stop your CPU from getting caned, a filter is needed.


So, to do a capture of everything except the ssh traffic, Use the following:


[~] # tcpdump not port 22

tcpdump 2.PNG

And BWAAA. Your second visit from the packet inception chipmunk (it’s a slightly smaller than the first, to simulate this just turn down the volume).


This time, you can see all the traffic heading into the NIC; and there is probably more than you thought. What’s not obvious is that tcpdump and other packet interception technologies operate in promiscuous mode. This is a feature of the NIC designed to assist in troubleshooting. A tonne of network traffic arriving at the NIC is not destined for the host. To save host CPU cycles this is silently ignored and is not passed up the stack. If your NIC was connected to a hub (or ethernet bridge) there would be a lot of ignoring. Even on a switched network there is a lot of broadcast noise from protocols such as ARP, NetBIOS, Bonjour, uPNP, etc. Promiscuous mode picks up everything on the wire sends it up the stack to be processed, or in our case, captured.


Troubleshooting with all this rubbish stuff floating around is difficult, but not impossible. If you intend analysing the traffic in Wireshark, filtering after the event is easy. However in our pretend scenario we can’t export files, but we can filter in-place with the --no-promiscuous-mode (or -p) option.


[~] # tcpdump -p not port 22


tcpdump 3.PNG


You’ll still see a lot of traffic from ARP and the like, but it should be much cleaner. Still firmly in the realm of Marmot-family filmic tropes, the process of packet capturing actually generates a lot of traffic. By default, tcpdump will try and reverse map IPs to hostnames in the traffic it collects. You could take the not port filter a bit further and exclude your DNS servers, or the DNS protocol entirely with:


[~] # tcpdump -p not port 22 and not port 53

or

[~] # tcpdump -p not port 22 and not host 8.8.8.8


But both of those are a bit clumsy, as whatever we are trying to fix may be DNS related. When tcpdump attempts this resolution a lot of secondary traffic can be generated. This may alarm a Firewall administrator; as the host may not normally generate outbound traffic.  Furthermore, DNS is sometimes used for data exfiltration. A host suddenly generating a lot of queries (caused by an “innocent” tcpdump) could cause unnecessary panic. DNS resolution can also play tricks; it’s showing you what the DNS server thinks the source or destination is; not what the packet headers actually say. The better option is just turn the darn thing off with the -n option:

 

[~] # tcpdump -pn not port 22

 

tcpdump 4.PNG


So there we have a nice clean real-time output of what is going on, but it’s still a bit vague. Don’t worry if you don’t understand what you are actually seeing here, we will come to that.


Again, the default behaviour is not that helpful. Unless you tell it otherwise, tcpdump will pick a NIC to capture on, usually defaulting to eth0. If you are dealing with a host with a single NIC, this is a good guess. However on a server or firewall, the traffic direction traffic and source NIC matter.


The final option I shall mention is to specify the interface on which to capture. The command:


[~] # ifconfig

eth0      Link encap:Ethernet  HWaddr 00:08:9B:BD:CC:9F 

          inet addr:172.16.10.220  Bcast:172.16.10.255  Mask:255.255.255.0

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:15590165 errors:0 dropped:104804 overruns:0 frame:0

          TX packets:14783138 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:532

          RX bytes:1908396023 (1.7 GiB)  TX bytes:906356383 (864.3 MiB)

          Interrupt:11


lo        Link encap:Local Loopback 

          inet addr:127.0.0.1  Mask:255.0.0.0

          UP LOOPBACK RUNNING  MTU:16436  Metric:1

          RX packets:28772 errors:0 dropped:0 overruns:0 frame:0

          TX packets:28772 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:14490908 (13.8 MiB)  TX bytes:14490908 (13.8 MiB)



Will show you a list of configured interfaces. To tell tcpdump to only capture on one, just use the -i switch followed by the logical interface name.

 


[~] # tcpdump -i eth0 -pn not port 22


tcpdump 5.PNG


If you don’t know which interface the traffic is appearing, use the any switch to pick them all. This disables promiscuous mode at the same time, so you don’t need the -p option. However, I found this to be less than universal; its not not supported by every OS/build of tcpdump.


[~] # tcpdump -i any -n not port 22


So, there we have it. A cleanish packet capture which shows us what’s going on with the upper layer protocols.  In my next post I’ll dig a bit deeper with some filters and advanced switches for “on the spot” troubleshooting.




24ff678.jpgThis month’s installment of our IT blogger spotlight series shines on Scott McDermott, who runs Mostly Networks. Scott can also be followed on Twitter, where he goes by @scottm32768.

 

Check out the Q&A to not only get to know Scott a little better, but also hear his thoughts on everything from SolarWinds Network Performance Monitor (NPM) to the impact of major trends such as IT convergence and SDN.

 

SW: So Scott, tell us about yourself.

 

SM: Well, I’m a network engineer in the public sector. I manage the networks for a small data center and 50 sites. We have public Wi-Fi at all of our locations and that has been a focus of my organization’s services for the last few years. It’s been a good excuse to dive into Wi-Fi more deeply, which I think is my favorite technology to study and work with right now. That said, I started my career as a system administrator and did that for a long time; sometimes I’m also still called on to wear my SysAdmin hat.

 

SW: How’d you get involved with IT in the first place?

 

SM: My mother has been training people to use computers for most of her career, so we always had computers in the house. The first computer we had was a TRS-80 Model 1 with a tape drive. It even had the 16KB RAM upgrade! My father is also very technical and has worked with RF and two-way radio communications systems for most of his career. I like that with my Wi-Fi work, I’m sort of combining knowledge I picked up from both parents. All my friends through school were geeks, so obviously we were always playing with computers. In college, it was natural to get a job in the computer lab. I guess it was really just a natural progression for me to end up in IT.

 

SW: So as a seasoned IT pro with a passion for tech literally flowing through your veins, what are some of the tools you can’t live without?

 

SM: I have three favorites that pop into mind right away. First is my MacBook, because I really think having a Mac as my primary platform makes me more efficient for the kind of work I’m doing. My favorite hardware is the Fluke OneTouch AT because it can do in-line packet capture with PoE. I’ve found that to be really useful for troubleshooting. It also has some nice features for testing Wi-Fi and wired connections. My current favorite bit of software is Ekahau Site Survey. I’ve been doing a lot of predictive site surveys and it’s really a pleasure to use.

 

Speaking of things that are a pleasure to use, I like the ease of use of SolarWinds NPM and we use it as our primary monitoring tool. We’ve tried a number of other specialized products for monitoring various components of our IT infrastructure, but we almost always end up adding another SolarWinds product to the underlying backbone platform. SolarWinds just does what we really need without the management overhead.

 

SW: That fantastic! We’re thrilled to have you as a fan. Diverging from IT for moment, what about when you’re not in front of a computer…what are some of your other interests?

 

SM: My wife and I are big fans of NASCAR, so following the races is one our favorite things. We also enjoy geocaching, which often results in camping and/or hiking. The kids are sometimes less into the hiking bit, but we’ve found going for a geocache turns it into an adventure. It’s a good excuse to get outside and away from the computers!

 

SW: I guess that brings us to Mostly Networks. How did it come about?

 

SM: I had thought about blogging for a while, but didn’t think I had anything to add. I finally started Mostly Networks after becoming involved in the network engineering community on Twitter. Many of the others there were blogging and encouraging others to do so as well. It seemed like a good way to give back to the community that I had found helpful. With that in mind, I most enjoy writing about the things I’ve been working on at the office, and the most rewarding posts are those where I solved a problem for myself and it ended up being useful to others.

 

SW: Outside of Mostly Networks, what other blogs or sites do you keep up with?


SM: Since I’ve been doing a lot of wireless work, WirelessLAN Professionals and No Strings Attached Show are a couple I follow closely. Packet Pushers is a site every network engineer should be following. I also enjoy Tom Hollingsworth’s posts at Networking Nerd.

 

SW: OK, time to put on your philosopher’s hat for our last question. What are the most significant trends you’re seeing right now and what are the implications for IT as a whole?

 

SM: The breaking down of silos due to the convergence of, well, everything is huge. The system, network and storage teams really need to start communicating. If yours haven’t already, your organization is behind. IT workers who are in their own silos need to start branching out to have at least some familiarity with the other infrastructure components. The days of being a pure specialist are going away and we will all be expected to be generalists with specialties.

 

Specifically in the networking space, SDN is picking up steam and looks to be the driver that will get networking to the same level of automation that the system teams already have. Networking has always been slow to automate, for a variety of both good and bad reasons, but automation is coming and we will all be better off for it!

We know there are many organizations out there who do no asset inventory at all, maybe short of slapping an organizational serial number tag on the notebook and noting where it went.

          

But think about the next time you have to replace legacy systems. For example, say you need to replace highly inefficient power supplies that are drawing hundreds of megawatts of power. With new systems, you could achieve 10x the capacity but draw only a quarter of the power load. This could potentially be a self-funding project just in the electrical and cooling savings alone! But not so fast, says your budget approver. Without proper inventory of those legacy systems, alongside your hardware warranty data, it’s near impossible to make a tangible case to your budget approver.

         

OK, so hopefully your systems have energy-efficient power supplies but you get the point: server and IT asset management is a must. It provides you the means to achieve complete visibility into your infrastructure inventory, helping you gain an in-depth understanding of:

  • Where servers and other hardware exist
  • Where components reside
  • How they are used
  • What they cost
  • When they were added to the inventory
  • Is there an expiry date for warranties and upgrades
  • How they impact IT and business services

         

Having this level of visibility into server performance helps SysAdmins improve infrastructure efficiency and performance, doing year-end budgeting, show how existing server hardware assets yielded in a strong ROI, or planning and forecasting. That conversation about budgeting for that next hardware upgrade just got a lot easier.

        

But let’s say you’re one of the few organizations that has an extensive Excel worksheet that captures all relevant information as the system is first unboxed. You’re not encountering any pain points so far, but no doubt, they’re surely to be on the way.

        

In the beginning, it’s a cheap and straightforward answer to having visibility into your inventory. But suddenly, more people are hired, or you’re implementing cloud-based services, or department ‘X’ wants to upgrade/replace their legacy business application; all of this requires additional equipment or the upgrade/replacement of existing equipment. As your IT organization grows, your spreadsheet exponentially grows into a big, hairy mess of manual tracking, taking more and more effort to maintain and falling further and further down your list of priorities.

     

So we want to know, where you do fall in the spectrum of asset management? At what point is something like a tool for automated asset management warranted? And when does a simple spreadsheet do the trick? And at what point is that spreadsheet doomed to inefficiency?

Metrics and measurements are incredibly important to businesses these days. Every second matters, and every dollar matters.  Metrics have always been a topic of discussion, debate, frustration and conversation in the world of IT and technical support, but now shifts in the worlds of technology and business are having effects on what to measure and how to measure.

For years, help desks and service desks measured themselves based on the historical method of contact: Phone. Metrics such as speed to answer, average handle time, time in queue, agent utilization rate, and abandon rate have been on the top of most support center managers’ lists for years.  Technology made these measures increasingly easy. Automatic call distributors logged the number of calls, the wait time, the number of abandoned calls, and the time agents/analysts spend ready to pick up the phone, and produced reports for managers to help them calculate the operational metrics they needed to properly staff and run support centers. First call resolution (FCR; resolving a ticket on the first call, even if there was a “warm transfer”) became king-of-the-hill. “One and done” has been spoken millions of times by thousands of support managers.  As HDI’s in-house subject matter expert, I’m asked about FCR more than any other single topic.

But there’s trouble here.

First of all, phone is slowly declining as a contact method as other channels have come into play. For the HDI 2014 Support Center Practices & Salary Report (due out in October, 2014), we asked about phone, chat, email, web request (tickets submitted directly be the end users), autologging (tickets created without human intervention), social media, walkup, text (SMS), mobile app and fax. (Yes, fax! It’s still supported as a channel in more than 8% of organizations.)

This channel explosion has created puzzles for support center managers. The ACD is no longer  providing enough information to them for staffing, and many of them are scrambling to fit channels like email into the old telephone mold: “What’s the email equivalent of first call resolution?”

From my vantage point, I can see their frustration, but also think that there needs to be some serious adjustment. Metrics that were very useful in the past are no longer the keys to effective, efficient support.

Instead of focusing metrics on ourselves, we need to be looking at our customers and determining what is valuable to them. In another recent report, HDI found that 85% of IT departments—and 87% of support centers—are feeling pressure from their businesses to show value, not just completion of work and efficiency.

Let’s take a look at that formerly paramount metric, first call resolution, and see what it tells us about business value.

  • Most commonly, “one and done” calls relate to issues that are known
    • User calls
    • Analyst checks knowledge base
    • Analyst tells user the solution
    • Incident is resolved
  • The most common FCR call is password reset
    • 30-35% of all calls to the support center are password-related
    • Putting in a self-service password reset tool may drive these calls up because the tool doesn’t work with all passwords users need

So here’s what the support industry has been hanging its hat on: Incidents that have known solutions and password resets that are required because the IT environment is too complex. It’s not really any wonder that many organizations are trying to figure out whether they need a support center at all. (They still do—in most cases—by the way.)

Solutions:

  1. Push as much repetitive work out to self-service as possible. Provide the solutions to common issues in a good, easy-to-understand self-service system. And yes, your customer will use it, and yes, they will thank you for getting them off the phone queue.
  2. Move more technical work to the front line. This is commonly called “Shift-Left” and it works. Use humans for problem solving and assistance, not reading answers to end users.
  3. Start measuring things that show value to your business, such as interrupted user minutes (IUM: number of minutes of interruption X number of users affected).
  4. Start using solutions that are as simple as possible. Software that does not integrate with your organization’s other tools—such as password reset—do not fit your basic requirements and should not be considered.
  5. Use good knowledge management practice. Share what you know and keep it up to date.  Everyone benefits.

You can help that confused support center manager by working together to cut complexity and provide solutions to the customer: Your business.

Back when I used to be an on-call DBA, I got paged one morning for a database server having high CPU utilization. After I punched the guy who setup that alert, I brought it up in a team meeting—is this something we should even be reporting on, much less alerting on. Queries and other processes in our use CPU cycles, but frequently as a production DBA you are the mercy of some third party applications “interesting” coding decisions causing more CPU cycles than is optimal.

 

Some things in queries that can really hammer CPUs are:

  • Data type conversions
  • Overuse of functions—or using them in a row by row fashion
  • Fragmented indexes or file systems
  • Out of date database statistics
  • Poor use of parallelism

 

Most commercial databases are licensed by the core—so we are talking about money here. Also, with virtualization, we have more options around easily changing CPU configurations, but remember overallocating CPUs on a virtual machine leads to less than optimal performance.  At the same time CPUs are a server’s ultimate limiter on throughput—if your CPUs are pegged you are not going to get any more work done.

 

The other angle to this, is since you are paying for your databases by the CPU, you want to utilize them. So there is a happy medium of adjusting and tuning.

Do you capture CPU usage over time? What have you done to tune queries for CPU use?

mellowd

Log aggregation

Posted by mellowd Nov 2, 2014

Way back in the past I used to view logs after an event has happened. This was painfully slow, especially when viewing the logs of many systems at the same time.

 

Recently I've been a big fan of log aggregators. On the backend it's a standard log server, while all the new intelligence is on the front end.

 

One of the best uses of this in my experience is seeing what events have occurred and which users have made changed just before. Most errors I've seen are human error. Someone has either fat fingered something or failed to take into account all the variables or effects their change could have. The aggregator can very quickly show you that x amount of routers have OSPF flapping, and that x user just made a change 5 minutes ago.

 

What kind of intelligent systems are you using on your logs? Do you use external tools, or perhaps home grown tools to run through your logs and pull relevant information and inform you? Or, do you simply use logs as a generic log in the background to only go through when something goes wrong?

Filter Blog

By date: By tag: