Skip navigation

I wrote about visibility via monitoring being the first step in successful IT change management. And as an IT Pro’s career progresses, they will encounter many breaks and failures in their IT infrastructure. The only guarantee in IT is that something will break and IT pros have to be able to fix it ASAP. Experience and a solid process framework, coupled with visibility are key to successfully troubleshooting IT issues.

 

Troubleshooting is a skill that consists of two parts: root-cause analysis and taking corrective measures. In the past, troubleshooting would include:

  1. Reading the fabulous manual (RTFM)
  2. Working with wonderful vendors… post sales
  3. Patching together a keep-the-lights-on solution with putty and duct tape
  4. Or leveraging the reboot all systems and leave it in the hands of FM – fate & magic

Fast forward to today, and troubleshooting is all about collaboration i.e. someone has probably already ran into this issue and has blogged about it or shared the knowledge on an IT community website like thwack. So troubleshooting becomes as simple as Google-ing it or Bing – winner, winner, chicken dinner.

 

But what if you are the first to encounter a problem? Then, you’ll need a framework to troubleshoot issues. If you don’t have one, here’s a template framework that you can leverage. And within that framework root-cause analysis begins with what is happening (a real-time dashboard) and what has happened (logs). Once the problem is identified and cause-effect is understood, the prescriptive measures can be determined, tested, verified as viable fixes, and deployed into production. Troubleshooting success consists of the efficiency and effectiveness of the resolution.

 

In closing, troubleshooting is a constantly evolving skill for an IT pro. When you think you’ve mastered your environment, new technology always intervenes. So learn the art of troubleshooting like your career depends on it.


Let me know what you think in the comment section below. Also feel free to share your troubleshooting process or tips below.

As 2014 draws to a close, I’ve been reflecting on the year and finishing up my planning for 2015. I started thinking about how we could all better measure our alignment with business goals.

 

I struck up a conversation with Karen Lopez datachick and we both agreed that scorecards seem to be all the rage these days. We are inundated with widgets, KPIs, and dashboards. With each passing day we have access to more data, which means we need to continue building ways to interpret that data.

 

But what we don't see happening enough these days is some good old-fashioned introspection, specifically with regards to our careers as data professionals. All of the data that we can collect and yet we continue to overlook the pieces that are most critical.

 

Let's change that. I want to build a scorecard that we can use as a way to evaluate ourselves and our teams.

 

Looking at the traditional IT Balanced Scorecard we find four areas defined:

  1. Corporate contribution
  2. Customer (User) orientation
  3. Operational excellence
  4. Future orientation

 

I believe we can find the right questions and measurements for a balanced scorecard for a DBA. For example, one question for future orientation might be "how often are you learning new things?"


What questions would you want to include in such a scorecard? How would you want to build out the right measures? For example, should the measure be based upon observations or perhaps a peer review?


Leave your thoughts in the comments.

As support teams religiously measure the timeliness and quality of support delivered, it is also important to institute a mechanism to gauge customer satisfaction and understand what impact tech support has made on your customers. Impressing and delighting customers is getting to be a significant focus for all types of support organizations. Knowing what customer satisfaction is today can help you win customer loyalty in the future. In the long-term, this can contribute to business growth, and also make your customers recommend your services to others.

 

SolarWinds®, in conjunction with HDI®, has developed this Infographic to help you understand what technical support teams are doing to improve their customer engagements, and in the process, delight them.

 

The Five Ways:

  1. Knowing your customers & their preferences
  2. Meeting/Exceeding customer expectations
  3. Being proactive in service management
  4. WOW’ing your customers with timely resolution
  5. Showing your customers you care

 

Infographic Highlights:

  • 90% of support centers measure customer satisfaction with ticket resolution
  • 68% of organizations use service management solutions
  • 43% of organizations are resolving more than half of their desktop support tickets through remote support

 

Check Out the Full Infographic:

5_Ways_to_Impress_final.jpg

 

Learn more about delighting your customers and winning their loyalty »

Download the full Infographic here »

Leon Adato

thwackCamp Wrap-Up

Posted by Leon Adato Expert Dec 18, 2014

I'll admit it, we got a late start on planning the 2014 thwackCamp experience. A host of factors, such as major product releases, competing schedules, and several new employees (including me!) - vied for attention, and before we knew it the calendar was running short on pages.

 

With weeks, rather than months, left to plan we wondered if anyone was interested, and if anyone would even bother to come.

 

Our concerns couldn't have been more unfounded. When we put the question "Should we have another ThwackCamp" to the community, the answer was A resounding "Please Say Yes!!"

 

And so, like geeky little elves in a Fry's workshop, we set to work. In order to create the best possible experience, we gave ourselves as much time as we could spare by setting thwackCamp 2014 right before the holidays - thus the thwackCamp Holiday thwack-tacular was born.

 

There was one thing which was clear from the outset, a lesson driven home from thwackCamp2013: while in-depth product reviews are nice, what the thwack community craved was discussions about solving real-world problems. Those problems might be around the way to best implement SolarWinds tools, but just as much it could be centered on taking an issue IT pro's face "in the field" and showing how to bring SolarWinds products to bear.

 

Between takes as we were taping episodes 17 and 18 of SolarWinds lab, we began a series of "water cooler conversations" - the usual "I remember this one time" stories that geeks who've been around the block tend to have, but with the goal of asking ourselves how SolarWinds could have helped.

 

In the "Fill Up Your IT Toybox" session, Patrick Hubbard helped me articulate one of my ongoing concerns: helping IT professionals get better and translating technology issues into management buy-in. Then Patrick Hubbard, Mav Turner, and Francois Caron more or less sat down, turned on the microphone, and had the same type of discussion they have ANY time they end up in the same room talking about ways to perform an end of year "Inventory and Security Clean-Up". Finally, my experiences being saddled with supporting SQL servers led to the "Accidental Admin" session with "SQLRockstar" Thomas LaRock.

 

After putting hours of work into each session - hours often sandwiched in between flights to conventions, hosting webinars, filming SolarWinds lab episodes, and getting up to speed on the amazing features baked into upcoming releases of our products (which my NDA forbids me from discussing. But by Grabthar's Hammer, I gotta tell you guys, some of the stuff we got in the pipeline is SUH-weeeet! If you aren't already subscribed to the "What We're Working On" thread on thwack, do it NOW!) - we knew we had some killer sessions.

 

So where we stood as thwackCamp loomed nearer was that we knew the thwack community wanted it. We knew we had sessions that would hold people's interest. But we wanted it to be an EVENT. Further, we wanted people to tell their non-thwack friends about it.

 

That's when the SolarWinds management team delivered a true holiday miracle, in the form of a $5,000 budget to spend on prizes and give-aways. That gave us all the excuse we needed to let our childlike, creative, holiday-fueled minds run wild. What would thwackCamp attendees want? What else? The same cool stuff WE'D want!! Lego gift cards, Herman Miller chairs, espresso machines, kegerators, and OMG A TETRIS DESK LAMP I MUST HAVE ONE NOW!!

 

But that's all run-up; preparation; prologue. You may be asking, how did it actually GO?

 

It went beyond anything we could have dreamed. Over 1,100 people signed up for one or more of the sessions, with several hundred attendees making time for all 3 sessions. Both during and between sessions, the chat box was spinning with ideas, jokes, questions, and comments to the tune of almost 8,000 messages during camp. At its peak, we were seeing over 10 messages per second. That's an incredible amount of interaction (even if a lot of it focused on beer, cake, and flatulence).

 

On top of that, new heroes were found - the Rob Boss video was so popular that Rob made unplanned appearances both in Vine and during the live session. Our video team - themselves heroes both behind and in front of the camera - went nuts creating spontaneous vines mid-session as well as running some of their best creative videos from the past. The term "AppFart" has now taken on a life of it's own.

 

It was a free-for-all of communication, creativity, and community appreciation.

 

Our feeling at the end of the sessions - 6 hours of constant typing, sharing, advising, joking, and very fast bathroom breaks - was not "what a long day, I need a drink".It was,

 

"How soon can we start planning the next one?!"

In my last post we discussed, the various data breaches that occurred this past year. Further, that using firewalls, intrusion detection, anti-virus, patch management, and related technologies may not be sufficient unless they’re used with the necessary operational controls. Here are a few best practices that can help you consistently achieve compliance while saving valuable time and money.

 

Network Segmentation: The best way to manage and monitor security for the entire network is to isolate sensitive areas that handle confidential data and control access to these sections. This can be achieved with internal partitioning using firewalls and routers. The same goes for parts of the network that deal with card holder data and sealing it off with access only to authorized users. The administrator can confine all card holder data into one network segment and restrict access with perimeter routers and firewalls. These network segments can then be easily monitored and audited regularly to check for compliance with established security policies. Ultimately, the scope of the audit becomes smaller and this means less effort, documentation, time, resources, and money are required to complete the audit process.

 

Network Security Basics: Eliminate fundamental network security weaknesses by ensuring the use of right protocols and basic best practices. Some of these include:

  • Use of secure protocols, such as secure shell and SNMPv3 as they come built-in with basic security measures. When SSH is used, all communication between the client and server systems is encrypted. Similarly, by introducing proper message security, SNMPv3 provides Confidentiality, Integrity, and Authentication which are required to perform network management operations securely.
  • Logging ACLs, which provides insight into network traffic as it traverses the network or is dropped by network devices.  Additionally, ACL logs help detect anomalies in network traffic and determine if there has been an attack.
  • Network devices with default settings and services enabled. If left such, a probing hacker might find them and gain access to the network. Therefore, make sure to change all default passwords and logins to prevent unwarranted access.
  • Finally, regularly backup device configuration files. This ensures data is archived for disaster recovery purposes, especially for critical network devices.


Business-as-usual: Meeting PCI DSS is an ongoing process. Technical controls will eventually lose their affect as human errors occur, new vulnerabilities are discovered, and networks evolve. In order to ensure that technical controls remain effective, it’s important to implement supporting operational controls. Therefore, the following PCI DSS processes should be adopted:

  • Inventory and manage network device lifecycles, especially those on critical routers and switches. Keep IOS and firmware updated, periodically review device configurations for compliance, and create configuration baselines so you can compare to ensure that regulatory standards are met. Maintain up-to-date device data and confirm that there are no obsolete devices that potentially open up security vulnerabilities.
  • Properly configure and test new devices prior to deployment. Have the ‘last known good configuration’ ready so that, in case of an issue, you can restore the network to its previously stable state.
  • All configuration changes must go through approval so that any missed or overseen aspect can be corrected. Reviewing device configurations ensures the necessary policy controls are met, which otherwise stand the chance of being bypassed intentionally or due to negligence.
  • Automate repetitive tasks to save time and increase accuracyfor tasks such as bulk password changes, SNMP community string changes, VLAN changes, etc. In turn, administrators can invest time in other network management activities.
  • Have all of your device configurations stored, catalogued, and backed up. In the event of a hardware failure or bad configuration you will be able to recover quickly.
  • Compliance with internal and external controls should be assessed and monitored continuously. To stay protected it’s mandatory to assess policy effectiveness and compliance with security controls regularly and frequently.

 

To learn more best practices for building PCI DSS compliant networks, watch this Webcast featuring Eric Hodeen of CourtesyIT and SolarWinds® Technical Product Marketing Manager, Rob Johnson.

 

Well these are a just a few tips that can be utilized to save valuable time and money. However, to achieve optimal performance, it’s a good idea to use an automated solution to help meet the aforementioned best practices. Check out this whitepaper for some recommendations to effectively manage device configurations in your network.

Whitepaper-Banner.png

My last couple posts have focused more on the people and process side of things for IT Operations Management, but the right tools are also very important. Choosing the right tool for IT Operations Management might be one of the hardest things to do in IT. You have a lot of different requirements from a lot of different teams (Virtualization, Storage, Network, Apps, Business Management, etc.),  so it makes matching a single tool very difficult. As you see I said "right tool" and not "right tools" and the reason I said this is because we are usually looking for that one tool that serves all our needs. This is one of the areas I think IT Operations limits their decision making when evaluating tools. Chances are you are not going to find just that one tool that meets all requirements of the multiple teams. I've seen too many times where the virtualization team looks for a ITOM tool, but want it to report on compute, hypervisor, storage, and network. They then limit their search to a tool that can perform all those funtions, even if the tool is only sub-par at all those functions. They then end up with a sub-par tool, because they wanted "one tool to rule all". If they only opened up to selecting multiple tools to get their job done, then they might have had tools that can perform all functions very well.

 

One thing that is important when selecting multiple tools for IT Operations management is that the tools provide a way of integration. I'm not saying that the tools have to directly integrate, but you need to be able to integrate the tools into your own process. Tools you select should have API's or SDK's that allow you to abstract needed information from the tool programmatically. This will then allow you to feed other groups' information to tools that they might use that don't have direct integration. This allows you to start aggregating information and getting more of an end-to-end few from apps to infrastructure.

 

I would love to hear from others on what they find important when selecting tools and how you managed integrating multiple tools into your process.

In previous discussions about increasing the effectiveness of monitoring, it has been pointed out that having more eyes on the data will yield more insight into the problem. While this is true, part of the point of SIEM is to have more automated processes correlating the data so that the expense of additional human observation can be avoided. Still, no automated process quite measures up to the more flexible and intuitive observations of human beings. What if we look at a hybrid approach that didn’t require hiring additional security analysts?

 

In addition to the SIEM’s analytics, what if we took the access particulars of each user in a group and published a daily summary of what (generally) was accessed, where from, when and for how long to their immediate team and management? Such a summary could have a mechanism to anonymously flag access anomalies. In addition, the flagging system could have an optional comment on why this is seen as an abnormal event. e.g. John was with me at lunch at the time and couldn’t have accessed anything from his workstation.

 

Would something like this make the security analysis easier by having eyes with a vested interest in the security of their work examining the summaries? Would we be revealing too much about the target systems and data? Are we assuming that there is sufficient interest on the part of the team to even bother reading such a summary?

 

Thinking more darkly, is this a step onto a slippery slope of creating an Orwellian work environment? Or… is this just one more metre down a slope we’ve been sliding down for a long time?

Before we get into my list of patch management tools, we all have used WSUS and some of us have become proficient at SCCM, these tools aren't in my top 3 list... they don't even crack my top 10!  However, from an enterprise point of view, an enterprise that is primarily Windows, those tools are great and they get the job done.  I want to talk about 3 tools that are easy to set up, easy to use and provide a good value to the admin team when it comes to managing updates and patches.  Administrators that have to manage patches (which is just about all of us) want an easy solution that's not going to require a ton of overhead.  I feel like SCCM is a monster when it comes to management and overhead, maybe that's not your experience.  The end result we all desire, is to move away from manually patching and find a solution that will do that work for us.  My list is not be any means definitive, these are tools that I've actually had interaction with in the past and that I've found to be helpful and easy to use.  Without further ado, here's my top 3 list of patch management tools (in no particular order) with an accompanying video:

 

LANDesk

 

 

 

GFI LanGuard

 

 

 

SolarWinds Patch Manager

 

 

 

What do you think?  Am I way off?  Did I leave off any good tools that some of you are using out there?  I'd love to hear from you.

If you have been troubleshooting networks for any length of time (or hanging out on the SolarWinds Thwack GeekSpeak forum) it should be obvious that packet inspection is a technique well worth learning. The depth of insight which packet capture tools like WireShark provide is hard to understate.

 

It's also hard to learn how to do.

 

(although galactic props to glenkemp for walking through some implementation basics recently on GeekSpeak.

 

But unless you have a specific use case, or there's a crisis and your boss is breathing down your neck, it's not easy to find the motivation to actually PERFORM a packet capture and analyze the results. Finding the right data sources, identifying the protocol and port, and calculating the time-to-first-bye or TCP-Three-Way-handshake are - for all but the geekiest of the geeks - simply not something we do for kicks and giggles.

 

That was the driving motivation for us to include Deep Packet Inspection in our latest version of SolarWinds NPM. But even though NPM 11 does a lot of the heavy lifting, as an IT Pro you still need to know what you are looking at and why you would look there versus any of the other amazing data displays in the tool.

 

Which is why we have created the FREE email course on Deep Packet Inspection for Quality of Experience monitoring. Detailed lessons will explain not only HOW to perform various tasks, but WHY you should do them. Lessons are self-contained - no cliffhangers or "please sign up for our next course to find out more", and broken down into manageable amounts so you aren't overwhelmed. Finally, delivery to your inbox means you don't have to remember to go to some website and open a course, and you can work on each lesson at your pace and on your schedule.

 

Monitoring tools have advanced past ping and simple SNMP and can now perform packet inspection. Shouldn't you?

 

Use this link to find out more and sign up: DPI Online Course & Study Group

IT admins are constantly being challenged to do more with less in their production environments. An example is performance optimization of an n-tier virtualized application stack while minimizing cost. IT admins need to deliver the best possible quality of service (QoS) and return on investment (ROI) with limited time and resources.

 

Plan on Optimal Performance: A 6-Step Framework

The first step starts with a plan that needs to produce consistent and repeatable results while maintaining cost-efficiency at scale. One approach is to:

    1. Establish a baseline measurement for the application’s performance with accompanying performance data.
    2. Monitor and log any changes and key performance counters over time.
    3. Define data-driven criteria for good as well as bad, worse, and critical.
    4. Create alerts for bad, worse, and critical.
    5. Integrate a feedback loop with fixes for the degraded states.
    6. Repeat steps 1-5.

This 6-step framework provides a disciplined methodology to troubleshoot performance bottlenecks like disk contention and noisy-neighbors.


Back to Good From Contention and Noisy-Neighbors

Virtualization performance issues usually involve the storage infrastructure and announce their presence in the form of degraded application performance. Storage performance issues could stem from disk contention and noisy neighbor virtual machines (VMs). Disk contention is when multiple VMs try to simultaneous access the same disk leading to high response times and potential application timeouts. On the other hand, noisy-neighbor VMs periodically monopolize storage IO resources to the detriment of any other VMs on the shared storage.

 

Leveraging the framework, any application slowness or abnormalities should generate an alert based on triggers like high disk latencies (avg. sec per R/W), high application response times (millisecond), low disk throughput (Bytes per second) and high CPU & memory utilization. Next, the environment should be examined from a current top-down view of the resources and applications. That way, the degraded state can be compared to the known good state. Afterward, a drill-down should be done on the specific application or storage subsystem.

 

If the bottleneck is disk contention, there will be high IOPs and high response times on a disk. If the bottleneck is from noisy-neighbor VMs, those VMs will have high IO metrics (IOPs, bandwidth) while other VMs on the shared storage will be starved with low IO metrics. Once the issue is identified, counter and preventative measures can be taken.


Three Tips for Contention and Noisy-Neighbors

Tip #1: As a general rule of thumb, RAID 5 can sustain 150 IOPs per spindle in its group and RAID 10 can sustain 200 IOPs per spindle in its group. So distribute all of the VMs’ IOPs across the RAID groups according to these rules to avoid disk contention.

Tip #2: If the disk contention is occurring on a VMFS datastore, an IT Admin can adjust the disk shares for all VMs accessing the datastore from the same ESXi host and/or move some of the VM’s from the VMFS datastore to another datastore on a different LUN to resolve the contention.           

Tip #3: To address noisy-neighbor VMs, IOPs or bandwidth restrictions can be applied to the VMs via features like VMware’s SDRS and SIOC or Storage Quality of Service for Hyper-V.


Closing

A plan in hand coupled with tools like SolarWinds Server & Application Monitor and Virtualization Manager provides an IT admin a complete solution to manage, optimize, and troubleshoot virtualized application performance through its entire lifecycle. Let me know what you think in the Comments section. Plus, join the Community conversations on thwack.

This past year American retailers have seen their fair share of data breaches. To name a few, Target®, Michaels®, Neiman Marcus®, Goodwill®, Home Depot®. The prominent fact is that there have been so many breaches in such a short period of time!

 

The impact of such incidents has not only been that they cause a serious dent in the company’s revenue, but also on the company’s reputation. It is obvious that there has been an increase in cybercriminal activity and that these lawbreakers are becoming more and more aggressive. So, with the holiday season underway, are you worried that more data breaches may occur and are companies prepared?

 

This blog will focus on Data breaches, PCI DSS, challenges retailers face with compliance, and finally some useful tips to achieve PCI compliance.

 

Data Breach – What possibly could have gone wrong?

Some vendors have flawed or outdated security systems which allow customer information to be stolen. Therefore, little attention is paid to ensure that all devices are updated and patched. Moreover, administrators have limited provision to monitor for suspicious behavior and fail to take the necessary steps to check for existing security holes by performing regular vulnerability scans. To top it off, there could be minimal to no documentation of network changes or simply poor communication between various IT departments.

 

PCI DSS

To help companies who deal with financial information and to protect their customer data, the Payment Card Industry Data Security Standard (PCI DSS) defines a set of security controls to help companies process, store, or transmit credit card information in a secure environment. In order to help network administrators maintain such a network, PCI DSS ver3 broadly defines the following controls specifically for network routers and switches:

  1. Build and maintain a secure network and systems
  2. Protect cardholder data
  3. Maintain a vulnerability management program
  4. Implement strong access control measures
  5. Regularly monitor and test networks
  6. Maintain an information security policy


Challenge with Compliance

PCI DSS defines security-specific objectives, but doesn’t lay down specific security controls or a method for these controls to be implemented. Simply using firewalls, intrusion detection, anti-virus, patch management, and related technologies may not be sufficient unless they are used with necessary operational controls specified by PCI DSS policies. Provided below are a few challenges administrators face when trying to implement and maintain compliance:

  • Uncertainty about what’s on the network
  • Insufficient mechanisms for vulnerability assessment and immediate remediation
  • Absence of compliance reporting and continuous monitoring
  • Not just implementing PCI DSS 3.0, but also continuously maintaining compliance


Best Practices to Achieve PCI Compliance

PCI DSS objectives are satisfied when firewalls, intrusion detection, anti-virus, and patch management are used together with the necessary operational controls. Here are a few best practices to achieve compliance while saving valuable time:

  • Segment specific parts of your network to define controls and protection where sensitive data resides
  • Ensure the use of right protocols and security best practices to plug possible network vulnerabilities
  • Implement and follow supporting operational controls like device inventory management, configuration change approvals, regular backups, automation of tasks in addition to compliance with internal and external standards

These practices become more important in networks where there are hundreds of multi-vendor devices and device types operating over many locations. The overall time, effort, and costs involved in achieving compliance is very high. However, the cost of being non-compliant cannot be ignored. Stay tuned for my next post where I will provide a detailed list of tips to achieve PCI compliance while saving time and money.

It goes without saying that patching and updating your systems is a necessity.  No one wants to deal with the aftermath of a security breach because you forgot to manually patch your servers over the weekend, or your SCCM/WSUS/YUM solution wasn't configured correctly.  So how do you craft a solid plan of attack for patching?  There are many different ways you can approach patching, in previous posts I talked about what you are patching, and how to patch Linux systems, but we need to discuss creating a strategic plan for ensuring patch and update management don't let you down.  What I've done is laid out a step by step process in which you will learn how to create a Patching Plan of Attack or PPoA (not really an acronym but looks like a real one).

 

Step 1: Do you even know what needs to be patched?

The first step in our PPoA would be to do an assessment or inventory to see what is out there in your environment that needs to be patched.  Servers, networking gear, firewalls, desktop systems, etc.  If you don't know what's out there in your environment then how can you be confident in creating a PPoA??  You can't!  For some this might be easy due to the smaller size of their environment, but for others who work in a large enterprise with 100s of devices it can get tricky.  Thankfully tools like SolarWinds LAN Surveyor and and SNMP v3 can help you map out your network and see what's out there.  Hopefully you are already doing regular datacenter health checks where you actually set your Cheetos and Mt. Dew aside, get our of your chair and walk to the actual datacenter (please clean the orange dust off your fingers first!).

 

Step 2:  Being like everyone else is sometimes easier!

How many flavors of Linux are in your environment?  How many different versions are you supporting?  Do you have Win7, XP and Win8 all in your environment?  It can get tricky if you have a bunch of different operating systems out there and even trickier if they are all at different service pack levels.  Keep everything the same, if everything is the same, then you'll have an easier time putting together your PPoA and streamlining the process of patching.  Patching is mind numbing and painful, you don't want to add complexity to patching if you can avoid it.

 

Step 3:  Beep, beep, beep.... Back it up!  Please!

Before you even think about applying any patches, your PPoA must include a process for backing up all of your systems prior to and after patching.  The last thing anyone wants to do is have a RGE on their hands!  We shouldn't even be talking about this, if you aren't backing up your systems, run and hide and don't tell anyone else (I'll keep your secret).  If you don't have the storage space to back up your systems, find it.  If you are already backing up your systems, good for you, here's a virtual pat on the back!

 

Step 4:  Assess, Mitigate, Allow

I'm sure I've got you all out there reading this super excited and jonesing to go out and patch away, calm down, I know it's exciting, but let me ask you a question first.  Do you need to apply every patch that comes out?  Are all of your systems "mission critical"?  Before applying patches and creating an elaborate PPoA, do a risk assessment to see if you really  need to patch everything that you have.  The overhead that comes with patching can sometimes get out of hand if you apply every patch available to every systems you have.  For some, i.e. federal, you have to apply them all, but for others it might not be so necessary.  Can you mitigate the risk before patching it?  Are there things you can do ahead of time to reduce the risk or exposure of a certain system or group of systems?  Finally what kind of risks are you going to allow in your environment?  These are all aspects of good risk management that you can apply to your planning.

 

Step 5:  Patch away!

Now you have your PPoA and you are ready to get patching, go for it.  If you have a good plan of attack and you feel confident that everything has been backed up and all risks have be assessed and mitigated, then have at it.  Occasionally you are going to run into a patch that your systems aren't going to like, and they will stop working.  Hopefully you've backed up your systems or better yet, you are working with VMs and you can revert back to an earlier snapshot.  Keep these 5 steps in mind when building out your PPoA so you can feel confident tackling probably the most annoying task in all of IT.

The technology industry is changing at a rapid pace, but let me put that into perspective. It took radio 38 years to reach 50 million users, it took Television 13yrs, the Internet 4 years, Facebook 1,096 days, and  Google+ only 88 days. So what's this rapid change in how business is done mean for us in the IT industry? It means our skill sets are going to need to also change rapidly. It can no longer be "I only do storage, so that's not my problem", there has to be a fundamental shift in how we move forward in the IT industry. We have to be in tune with others outside our group and be more aligned with the goals of the business. I've felt in some of my previous roles that the infrastructure and operations teams gained little visibility into the goals of the business. This is not to say that we don't want to know, or that we have no clue what they are, but a lot of context is lost once it make it to these teams. For example: You may have management ask you to go research this or go do this,  but with little context around how it affects the business or other teams.

 

So how do we change this? Well, part if it's a culture change, but a big part is changing some of our skills. Below I've listed a few skills I think are important to IT operations management.

 

  • Understands business goals
  • Problem solving skills with organizational challenge orientation
  • Effective communicator across teams
  • Holistic view across various technology silos

 

I would love to hear what others think about the skills needed for this role and how you've changed some of your skill set in this rapidly changing IT industry.

CK.jpg

A quick Google image search of Christopher Kusek will show you the most important thing you need to know about him: He’s yet to find a set of fake ears he doesn’t look great in! Of course, you might also be able to discern a few other arguably less impressive things about him: CISSP, vExpert, VCP, BDA, EMCTA, EMCCA, CCNP, CCNA, MCSE, MCTS, MCITP, NCIE, NCSA, NACE, SCP.

 

He’s also the proud owner of PKGuild.com, which is why he’s the focus of this month’s IT blogger spotlight. And in case you’re one of the two people left on the Internet not yet following him on Twitter, where he goes by @cxi, you should be!

 

Read on to learn a little more about Christopher, including his affinity for unique headwear, what it’s like to be an IT pro in the middle of Afghanistan and his thoughts on the most significant trends in IT right now, including SDS, SDN and more.

 

Also, if you have your own questions for Christopher, feel free to leave a comment.

 

SW: OK, so I’ve got to ask, what’s up with the fake ears?

 

CK: There’s a whole other backstory to that, but this section would end up being far too long if I were to tell it, so I’ll stick to the one that most closely relates to the images you see in the aforementioned Google image search. It all began in May 2011 when I was hosting a party in Las Vegas for EMC World. The invitation page for the event had a section requesting a “Logo Image.” I sat there thinking to myself, “Doh, what would be a good image?” So, I scoured through my hard drive and found some pictures I thought were just ridiculous enough. They happened to be of me wearing cat ears. You see, when I was writing my first VMware book back in 2011, I would go to a Starbucks on North Avenue in downtown Chicago and just sit there pegging away at the pages and chapters. I thought, what better way to both get my work done and bring joy to people who would pass through over the course of the hours I’d sit working there than by wearing some cat ears. I mean, everyone loves cats, right?! Eventually the whole idea evolved into brainwave controlled cat ears.

 

SW: Brainwave controlled cat ears, huh? I don’t think there’s enough space here to cover such an in-depth and socially important topic. So, let’s talk about PKGuild. What are you writing about over there?

 

CK: For the most part I tend to write about things that I’m either really passionate about—something that solves a problem, is just absolutely awesome or will benefit other people. It turns out that a lot of the time that tends to fall into the realms of virtualization, storage, security, cloud, certification, education and things that broach realms of Innovation. I’m not limited exclusively to writing about those subjects, but a majority of the stuff I write about tends to cross those spectrums.

 

SW: OK, shifting gears again—outside of the blog, how do you pay the bills?

 

CK: I just recently returned from a two year stint in Afghanistan and am now in a new role as CTO at Xiologix, an IT solutions provider headquartered in Portland. I’m responsible for the technical direction and engineering of the business and for helping customers solve complex technology and IT problems.

 

SW: What was the two year stint in Afghanistan all about?

 

CK: I was the senior technical director for datacenter, storage and virtualization for Combined Joint Operations-Afghanistan. Honestly—and this is something covered at length in various blog posts I’ve written—it was a unique opportunity to do something I enjoy and that I do very well as a way of serving my country. While I may not be able to pick up a gun or run down a group of insurgents, I was able to build some of the most comprehensive, resilient and versatile networks in the world, and help lead others to achieve those same results.

 

SW: So, what was it like being an IT pro in the middle of a warzone?

 

CK: The first thought that comes to mind is, “It sucked!” Because, quite frankly, it did. I mean the living accommodations and food were horrible, there was risk at every avenue and the chances for you to be hurt, maimed or worse were all very real. But let’s consider the facts, I didn’t go there because I was expecting there to be good food, living quarters or for it to be relatively safe. Once you get past all that and realize you have a job to do, it was pretty much just that. Go in and try to make everything and anything you do better than you found it, and better for the person who comes after you. I found lots of decisions were made on 6, 9 or 12 month “plans,” as in someone rotated in and would be there for a certain duration and would “do stuff,” whether right or wrong, and then rotate out. This was true whether it came to licensing software, attempting to procure something to solve a problem or maintaining operational environments for an enduring environment that had been there for 10 years prior to them and would continue to be there long after they were gone. This differs greatly from how corporations or nearly any other mission critical environment is run.

 

SW: Based on your impressive collection of certifications, which includes SolarWinds Certified Professional, I‘m guessing this whole IT gig isn’t new to you.

 

CK: Not exactly, no. I’ve been working in IT for over 20 years. Back in the early 1990s, I was a security researcher. During that time, I would also build and simulate large corporate networks—yes, for fun…and to assist friends who worked at consulting companies. After I returned from a memory buying trip to Japan in 1996 to support my home labs, I decided to get a job at a consultancy in the Chicagoland area, where I went on to work for 13 years before moving onto the vendor life at NetApp and EMC.

 

SW: OK, so when you’re not working or blogging—or keeping our armed forces digital backbone up and running—what are some of your hobbies?

 

CK: When I’m not working or blogging, I’m usually working or blogging! But seriously, I enjoy reading. I even write a book on occasion. I spend a lot of time with my family, and as a vegan foodie I also enjoy discovering new food options. I also enjoy the occasional trip to Las Vegas because I love applying the principle of chaos theory to games of chance. Being that I now live in the Pacific Northwest, I also look forward to the opportunity to get out and explore nature. Finally, I really enjoy getting out there in the community, working with others and helping them grow themselves and their careers, whether that be through mentorship, presenting at conferences and user groups or other kinds of involvement and engagement.

 

SW: OK, for my last question, I want you to really put your thinking cap on—what are the most significant trends you’re seeing in the IT industry?

 

CK: With the maturity and wide scale adoption of virtualization, there are related changes happening in the IT landscape that we’re only beginning to realize the benefits of. This includes software defined storage and software defined networking. SDS and SDN provide such potential benefits that the market hasn’t been ready for them up until this point, but eventually we’ll get there. Cloud is another, though the term is so often repeated, it really isn’t worth talking about outside of the further extension of internal datacenters into public-side datacenters with hybrid cloud services. Lastly, the further commoditization of flash Storage, which is driving prices down significantly, is increasingly making “speeds and feeds” a problem of the past; in turn making the value of data far exceed the speed of data access on disk.

IP address conflicts are usually temporary, but you can’t always expect them to resolve themselves. In my previous blog, we looked at the various causes of IP conflicts and the difficulties administrators face when determining the source of a network issue and whether it’s actually an IP conflict. In this post, I would like to peruse troubleshooting IP conflicts and the fastest methods of resolution to minimize network downtime.

 

So when you see the blatant message staring at you from the screen, “There is an IP address conflict with another system on the network.” Network administrators would typically want to know, as quickly as possible, what system owns that address and where is it located? A relatively easy way is to find the MAC address of an IP address within the same network or subnet, ping the IP address, and then immediately inspect the local ARP table on the router. If you use a Windows PC, the following steps will guide you through this search:

 

  • Click on Windows ‘ Start’ , type ‘cmd,’ and ‘Enter’ to open the command prompt
  • At the command prompt, ping the reachability of the IP address that you want to locate
    • For example, ping xxx.xxx.xx.xx. If the ping is successful, you should see a reply from the remote deviceif the ping request doesn’t locate the host then you won’t be able to proceed with the next step
  • Now, at the command prompt type arp –a. The command should return a table listing all IP addresses your PC is able to contact. Within this table you can locate the IP address you’re looking for, then the corresponding column will show you the MAC address

 

This method for finding the MAC address with ‘ping’ and ‘arp’ typically works. However, if it does not, then you will have to take more time and effort to locate the offending MAC address. If you do not find what you are looking for on the first attempt, you will need to repeat this process on all routers until you find the offending IP and MAC address. Once you are successful in locating the MAC address, you need to find the switch and switch port that the offending IP address/device is connected to. Knowing this will help you to disconnect the device from the network. The following steps help locate the MAC addresses connected to a switch.

 

  • Issue this command on each switch in your network- ‘show mac-address-table’ (this is for Cisco IOS or compatible switches).
  • The command returns a list of MAC addresses associated with each active switch port. Check if this table contains the MAC address that you are looking for.
  • If you find the MAC address, then immediately consider creating new ACL rules or temporarily blocking the MAC address. In critical cases, you might want to shut down the switch port and physically disconnect the offending device from the network.
  • If you do not find the MAC address, repeat the command on the next switch till you find your device.

 

While these procedures help you locate a device on the network, they can be very time-consuming, require technical expertise and login access to network switches and routers.

There are two factors that complicate the effort of locating a device on the network. The first one is network complexity and the other directly relates to historical data availability. The above technique heavily relies on ARP caches. Unfortunately, these caches are cleared from time to time. If this data is not available, it is impossible to determine the location of a system. During a crisis, you would want a system that can help you locate issues fast and easy. Being alerted about an IP conflict before users start complaining or a critical application going down, is important to network reliability. To be able to quickly search for a device with its IP address or MAC address and locating it on the network reduces the time and effort involved in troubleshooting and eliminating issues caused by IP conflicts.

 

Today many IT solutions are offered that aid in effective monitoring and resolution of problems like IP Conflicts. These methods are much faster than manually searching for offending devices. Solutions such as these should offer the ability to:

 

  • Constantly monitor the network for IP Conflicts by setting up alert mechanisms
  • Quickly search, identify, and verify details of the offending device
  • Locate the offending device and immediately issue remediation measures to prevent further problems

 

So, what method do you find to be the most effective for troubleshooting IP conflicts? If it’s an automated solution, which do you use?

Let's talk about patching for our good friend Tux the Linux Penguin (if you don't know about Tux, click here.).  How many of us out there work in a Linux heavy environment?  In the past it might have been a much smaller number, however with the emergence of virtualization and the ability to run Linux and Windows VMs on the same hardware, it's become a common occurrence to support both OS platforms.  Today I thought we'd talk about patching techniques and methods specifically related to Linux systems.   Below I've compiled a list of the 3 most common methods I've used for patching for Linux systems.  After reading the list you may have a dozen more way successful and easy to use methods that the ones that I've listed here, I encourage you to share your list with the forum in order to gain the best coverage of methods to use for patching Linux systems.

 

Open Source Patching Tools

There are a few good open source tools out there for use in patching your Linux systems.  One tool that I've tested with in the past is called Spacewalk.  Spacewalk is used to patch systems that are derivatives of RedHat such as Fedora and CentOS.  Most federal government Linux systems are running Red Hat Enterprise Linux, in this case you would be better off utilizing the Red Hat Satellite suite of tools to manage patches and updates for your Red Hat system.  In the case, your government client or commercial client allows Fedora/CenOS as well as open source tools for managing updates, then Spacewalk is a viable option.  For a decent tutorial and article on Spacewalk and it's capabilities, click here.

 

 

YUMmy for my tummy!

No, this has nothing to do with Cheetos, everybody calm down.  Configuring a YUM repository is another good method for managing patches in a Linux environment.  If you have the space, or even if you don't you should make the space to configure a YUM repository.  Once you have this repository created you can then build some of your own scripts in order to pull down and apply them on demand or with a configured schedule.  It's easy to set up a YUM repository, especially when utilizing the createpro tool.  For a great tutorial on setting up a YUM repository, check out this video.

 

 

Manual Patching from Vendor Sites

Obviously the last method I'm going to talk about is manual patching.  For the record, I abhor manual patching, it's a long process and it can become quite tedious if you have a large environment.  I will preface this section by stating that if you can test a scripted/automated process for patching and it's successful enough that you can deploy it, the please by all means, go that route.  If you simply don't have the time or aptitude for scripting, then manual patching it is.  The most important thing to remember when you are downloading patches via FTP site, you must ensure that it's a trustworthy site.  With RedHat and SUSE, you're going to get their trusted and secured FTP site to download your patches, however with other distros of Linux such as Ubuntu (Debian based) or CentOS, you're going to have to find a trustworthy mirror site that won't introduce a Trojan to your network.  The major drawback with manual patching is security, unfortunately there are a ton of bad sites out there that will help you introduce malware into your systems and corrupt your network.  Be careful!



That's all folks!  Does any of thing seem familiar to you?  What do you use to patch your Linux systems?  If you've set up an elaborate YUM repository or apt/get repository, please share the love! 


tux.jpg Tux out!!

When implementing a SIEM infrastructure, we’re very careful to inventory all of the possible vectors of attack for our critical systems, but how carefully do we consider the SIEM itself and its logging mechanisms in that list?

 

For routine intrusions, this isn’t really a consideration. The average individual doesn’t consider the possibility of being watched unless there is physical evidence (security cameras, &c) to remind them, so few steps are taken to hide their activities… if any.

 

For more serious efforts, someone wearing a black hat is going to do their homework and attempt to mitigate any mechanisms that will provide evidence of their activities. This can range from simple things like…

 

  • adding a static route on the monitored system to direct log aggregator traffic to a null destination
  • adding an outbound filter on the monitored system or access switch that blocks syslog and SNMP traffic

 

… to more advanced mechanisms like …

 

  • installing a filtering tap to block or filter syslog, SNMP and related traffic
  • filtering syslog messages to hide specific activity

 

Admittedly, these things require administrator-level or physical access to the systems in question, which is likely to trigger an event in the first place, but we also can’t dismiss the idea that some of the most significant security threats originate internally. I also look back to my first post about logging sources and wonder if devices like L2 access switches are being considered as potential vectors. They're not in the routing path, but they can certainly have ACLs applied to them.

 

I don’t wear a black hat, and I’m certain that the things I can think of are only scratching the surface of possible internal attacks on the SIEM infrastructure.

 

So, before I keep following this train of thought and start wearing a tin foil hat, let me ask these questions?

 

Are we adequately securing and monitoring the security system and supporting infrastructure?

If so, what steps are we taking to do so?

How far do we take this?

One of the most difficult things storage admins face on a day-to-day basis is that "It's always the storage's fault". You have Virtualization admins calling you constantly telling you saying there's something with the storage, then you have application owners telling you their apps are slow because of the storage. It's a never ending fight to prove out that it's not a storage issue, which leads to a lot of wasted time in the work week.

 

Why is it always a storage issue? Could it possibly be a application or compute  issue? Absolutely, but the reason these teams start pointing fingers is because they don't have insights into each others' IT Operations Management tools. In a lot of environments, an application team doesn't have insight into IOPS, latency, and throughput metrics for the storage supporting their application. On the other hand the storage team doesn't have insight into the application metrics such as paging, TTL, memory consumption, etc.

 

So for example let's look at the below scenario:

 

Application team starts noticing their database is running slow, so what comes to mind? We better call the storage team, as there must be a storage issue. Storage team looks into the issue; it doesn't find anything unusual and they've verified they haven't made in changes recently. So hours go by, then a couple days go by and they still haven't gotten to the bottom of the issue. Both teams keep finger pointing  and have lost trust in each other and just decide they must need more spindles to increase the performance of the application. Couple more days go by and the Virtualization Admin comes to the application team and says "Do you know you're over allocated memory on your SQL server"? So what happened here? A exorbitant amount of time was spent on troubleshooting the wrong issue. Why were they troubleshooting the wrong issue? This happened because each of these teams had no insight into the other teams' operations management tools. This type of scenario is not uncommon and happens more then we would ever like; as we  caused a disruption to business  and wasted a lot of time that could have been spent on valuable activities.

 

So the point is, when looking at operations management tools or process, you must ensure that these tools are transparent between multiple infrastructure groups and applications teams. By doing this we can provide better time-to-resolution, which will allows us to provide less impact to the business.

 

I would love to hear if other users in the community have these types of scenarios and how they have changed their processes to avoid these issues.

Hello thwack! My name is Kong Yang and I recently joined SolarWinds as the Virtualization Head Geek aka vHead Geek. I am super stoked to join my fellow Head Geeks and THE community for IT pros – thwack.

 

A little background - I earned my BS in EE and MS in ECE from UNR and UIUC respectively. After that, I spent 13-years grinding experience in performance tuning & troubleshooting enterprise stacks, virtualization sizing & capacity planning best practices, tech community management, and tech evangelism at Dell. For the last 14-months, I was the Cloud Practice Leader at Gravitant, a hybrid cloud software startup.

 

I am passionate about understanding the behavior of the entire application ecosystem – the analytics of the interdependencies as well as qualifying & quantifying the results to the business bottom line. This encompasses: 

  • Virtualization & Cloud technologies.
    • VMware vSphere and vCloud Air.
    • Microsoft Hyper-V and Azure.
    • Amazon Web Services (AWS).
    • IBM SoftLayer.
    • Google Compute Engine (GCE).
  • Application performance.
    • Tier 1 Application performance best practices – bridging what’s done in ideal lab environments, i.e. vendor reports & benchmark results, and real world IT lab environments.

    • Best practices for proactive performance optimization and reactive performance troubleshooting.
  • Hybrid Cloud best practices – on-premises, off-premises, private & public cloud services.
    • How do I efficiently and effectively monitor & optimize my application assets across hybrid cloud ecosystems?

    • What skills do IT pros need to add in order to not only survive but thrive?
  • Containers, hypervisors, cloud native best practices – vehicles for IT application stacks.
  • DevOps conversations. Gene Kim co-wrote an awesome book entitled The Phoenix Project that annunciates DevOps well.
  • Converged Infrastructure technologies.
    • Nutanix, SimpliVity, Scale Computing.
    • VMware EVO family.
    • VSPEX, FlexPod.
    • VCE Vblocks.
    • Microsoft Cloud Platform System.
  • Data analytics – ask the right questions, pivot points, and correctly interpreting & applying results.

  

Rather than continuing to bore you with my CV, I will leave you with my seven tips on a long and prosperous IT career:

 

  1. Do what you love and love what you do – be passionate about IT, technologies, and people.
  2. Know your IT and do IT – there is no substitute for experience and know-how.
  3. Don't be afraid to fail. My greatest successes have followed failures. Character is built from failures so always learn & keep moving forward.
  4. Don't strive for perfection. Perfection limits innovation by setting an arbitrary & unnecessary ceiling. Innovation is unbounded!
  5. Build your network of trusted advisors - techie friends, peers, professional mentors, colleagues and resources – know whose info you can trust. Return that trust by continually earning & maintaining their trust.
  6. Strength and honor – policies, processes & people-in-charge change; but your principles should never waver.
  7. Remember those who have helped you grow and those who have stood in your way. Be thankful for both of them.

  

I look forward to the opportunity to make your acquaintance and earn your trust. I am @KongYang on Twitter and Kong.Yang on thwack.

 

Let’s close with some fun because IT work can be a real PITA at times. Below is picture of me with two of my friends - @virtualTodd, who is a Sr. Staff Engineer on VMware’s Perf R&D team, and @CiscoServerGeek, who is a Cisco Consulting Systems Engineer. I’m wearing the green & yellow Jester’s hat with the green feather bola and throwing the peace sign while Todd is posing as Captain America and Scott is sporting Wolverine’s claws and the red & black top hat.

kongyang-vmworld2013.png


 

 

   

 

 

 

 

Last month I took part in our regular #datachat on Twitter. The topic was “Rolling SQL Server® Upgrades”, and my guest was Argenis Fernandez (blog | @DBArgenis) from SurveyMonkey. I’ve enjoyed doing the #datachat for the past year and I’m excited that they will be continuing in 2015.

 

The discussion that night was wonderful. Lots of data professionals talking about their experiences with upgrades, both good and bad. And the discussion wasn’t just one-way, either. We took the time to field questions from anyone participating in the #datachat hashtag.

 

When the night was done, and I reviewed the tweets that following day, I found myself grouping many of the tweets into some common thoughts regarding upgrades. Here’s what I saw:

 

  1. Have a plan
  2. Test the plan, repeatedly
  3. Have a plan for rollbacks
  4. Understand complexities
  5. Involve the business stakeholders

 

Let’s break those down a bit.

 

Have a plan

That goes without saying…or does it? You’d be surprised at the lack of planning when it comes to upgrades. Many, many times I have seen upgrades go wrong and because there is no actual plan in place the upgrade continues forward. This is just madness, really, as changes are now being hastily applied in production in an effort to get things working as expected.

 

Test the plan, repeatedly

I’ve seen situations where plans were developed, but not thoroughly tested. Sometimes, the plans weren’t tested at all. The end result is similar to not having any plan in place (see above).

 

Have a plan for rollbacks

Rolling back is something that must be considered. If your company is unwilling to rollback when things go wrong than you might as well not have any plan in place (see above). The idea that the changes MUST be deployed at all costs is the wrong mentality to have. You might think it is driving the business forward, but the reality is that you are letting chaos rule your change deployment process.

 

Understand complexities

As a server or database administrator you need to understand the complexities of the technologies you are using. The easiest way I have found to get this done is to ask yourself “what if?” at every stage of your plan. What if the scripts fail? What if we upgrade the wrong instance first? What if mirroring breaks while the upgrade is in progress? Answering these questions helps everyone to understand what actions may be needed.

 

Involve the business stakeholders

I’m kinda surprised at this, but apparently there are shops out there performing upgrades without notifying the business end-users. Perhaps my experience in a regulated industry like financial services means I have some blinders on here, but I cannot imagine that the business is not involved in signing off on the changes when they are completed. You simply must have them involved in some way throughout the upgrade process, if nothing else to serve as reassurance that things are working correctly.

 

Thanks again to everyone who participated in the #datachat, we always have fun putting these together and I’m looking forward to many more!

One of the roles in my IT career was managing a large IT Operations Management platform. This was probably the most challenging role I have had in IT, as I quickly found out it was a thankless job. The majority of this role was focused on providing forecasting, log management, alert management, problem management, and service-level management. These tasks

Thanks-for-nothing-300x180.jpg all rolled up to what I called "The Thankless Engineer."  This was not because the job wasn't important, but because it needed to satisfy many different technology silos.  IT Operations Management needs to satisfy not only the operations teams, but also needs to meet the requirements and workflows of infrastructure, security, and application teams. This becomes very tricky business when trying to satisfy multiple IT silos' workflows. This role becomes even more of a pain when ops and apps team start receiving false positive alerts, as we all know how much fun it is to be paged in the middle of the night for a non-issue. The biggest issue I see with traditional IT Operations Management, is that it tends to fall on a general operations group to set requirements and needs. This method doesn't always allow a lot of insight into the needs\requirements of infrastructure and application owners.

 

So is it possible to take such a "thankless" role and convert it into a role that provides "business value"? Does "cloud" change the way we need to think about operations management? Does this thingy called "Devops" change operations management? I would say "yes" to all of these trends and we need to change quickly in how we think about IT Operations Management or we are going to fail to innovate. Efficiency and agility are two key traits that companies need, so they are able to drive innovation. IT Operations Management is a key part of allowing companies to deliver services to their organization and to their customers.

 

When changing the IT Operations Management process there are a few concepts that I think we should practice, so we can move from "thankless" to "IT Superhero:"

 

  • Utilize operation-aware tools for application teams
  • Provide application teams insight into the infrastructure
  • Provide infrastructure teams insight into applications
  • Utilize tools that are heterogeneous between private\public cloud infrastructures
  • Utilize application analytics to gain insight into to end-user experience
  • One tool does not rule all

 

I would love to hear from the community on what patterns they think need to change in IT Operations Management and any thoughts you have on "The Thankless Engineer".

 

 

 

 

A system administrator’s roles and responsibilities span various dimensions of an IT organization. As a result, keeping tabs on what’s going on in the world of technology, including vendors and their products, latest product releases, end-user experiences, and troubleshooting performance issues are just some of the areas of focus. Over time, system administrators turn into thought leaders due to the technology, industry, and domain experience they gain and use. They pass on their knowledge to colleagues and technology aficionados. Even organizations turn to such experts to hear what they have to say about where IT is headed.

      

On that note, we at SolarWinds® are glad to have brought together IT gurus, geeks, and fellow system administrators to share their thoughts on system and application performance management. This event took place recently in the form of a #syschat on twitter. For those who didn’t get a chance to tune in, here are some highlights:

   

Application monitoring: Generally, there is a consensus that application downtime affects business performance. Given that businesses are paranoid about this, why hasn’t the adoption of application monitoring in some organizations taken off like it should? Experts like @standaloneSA and @LawrenceGarvin feel that, “Some of it has to do with need.” Or, as @patcable points out, “Admins don’t know what to monitor, and apps don’t provide the right data.” This is true for various reasons. Often, IT pros are given a mandate by business groups saying that all apps are critical. Therefore, they have to watch apps closely for performance issues. Before answering the “what to monitor” question, IT pros need to ask, “Why should I monitor these apps, are they really that critical?” Knowing the answer to this question eliminates the additional noise, and you can focus only on what to do with the really critical apps and ensure that you’re monitoring the right metrics.

           

Apps in the cloud: Monitoring the performance of apps in the cloud is, again, not a direct solution to solving a performance problem that can arise from your apps running in the cloud. As more applications are being deployed in the cloud, the level of difficulty in monitoring those apps gets higher. IT pros have to really get down to understanding the “how,” which takes time. For example, @vitroth said, “Ops finds it hard to monitor what engineering doesn't instrument. Give me counters, categories and severities!” When IT pros have difficulties managing apps running on a physical server, the cloud layer is certainly going to be an unfamiliar place, and new complications will arise.

          

Skills sets for SysAdmins: A lot of buzz is going around about whether SysAdmins will need to have coding skills one day. It may not be mandatory for IT pros to have programing skills, but they might want to develop these skills so they can create and automate tools. While this was only one opinion, others like @patcable suggested that “sysadmins are going to have to become more comfortable writing stuff in some language other than shell.” Learning and understanding your IT infrastructure and environment are essential. IT pros should be willing to learn and learn quickly because ‘things aren’t slowing down.’ Where gaining technical knowledge and skills is concerned, it always helps to “learn a programming language, version control w/git, config management, and keep an eye on Hadoop,” as recommended by @standaloneSA.

        

What are your thoughts on these topics? Where are you doing with application monitoring in your organization? What difficulties do you see with monitoring apps in the cloud? Do you see DevOps improving the adoption of application monitoring? We’re happy to hear your views and opinions. Follow us on @SWI_Systems to learn more.

The modern day network handles a high volume of data and more applications than ever before. Many of these applications are sensitive to delay and latency. Under such situations, network engineers need QoS to prioritize delay-sensitive business apps over others or to drop non-business traffic.

 

A QoS implementation method used to classify and mark applications or protocols in the network is Modular Quality of Service (MQC) QoS. With MQC QoS, the traffic you need to prioritize or drop is grouped into a class-map. The class-map is then assigned to a policy-map to perform QoS actions. If you are not familiar with QoS, check out this blog for getting started with MQC QoS.

 

An option available under MQC QoS to group traffic into a class-map is the “match protocol” statement. This statement allows users to match a desired application or protocol, such as FTP or HTTP, into a class-map and then perform QoS actions on it. Here, the ‘protocol’ key word can refer either to regular protocols like bgp, citirix, dhcp, etc., or Network Based Application Recognition (NBAR) recognized protocols.


What is NBAR?

 

NBAR is a classification technology from Cisco that can identify and classify applications and protocols, including those that use dynamic port numbers. NBAR goes beyond TCP/UDP port numbers and can inspect the payload to identify a protocol. NBAR classifies applications using the default Packet Description Language Modules (PDLM) available in the IOS.

 

Cisco also has NBAR2, which is the next generation version of NBAR that enhances the existing NBAR functionality to classify even more applications. It also provides additional classification capabilities, such as field extraction and attributes-based categorization. Cisco routinely releases updated protocol packs for NBAR2, which can be accessed from the NBAR protocol library for new signatures, signature updates, and bug fixes.

 

Conveniently, Cisco NBAR is supported on most Cisco IOS devices and NBAR2 is supported on devices such as ISR-G2, ASR1K, ASA-CX, and Wireless LAN controllers. And to make it easy, NBAR2 configuration is exactly the same as NBAR.


Why NBAR

 

Many network engineers use Access Control Lists (ACL) for application classification when defining their QoS policies. But sometimes, NBAR is a better choice than ACLs because of NBAR’s ability to automatically recognize applications and protocols which otherwise would have to be defined manually.

 

NBAR is also easier to configure compared to ACLs and provides collection statistics (if you need them) via an NBAR protocol discovery MIB for each application identified by NBAR.

Finally, the biggest advantage of NBAR is that it can be used for custom protocol identification.


Custom Protocols with NBAR

 

There are many applications that are designed to use dynamic port numbers. Such a dynamic change in port numbers can make it difficult to identify applications when using regular monitoring tools and sometimes even with NBAR. While NBAR2 does have signatures for various applications, there are chances you might be using an internally built application not defined in NBAR2 which gives a good reason to define your own custom protocol for NBAR.

 

NBAR custom protocol is quite extensive too. You can define custom protocols to be identified by the NBAR engine based on IP address, port, transport protocol, and even after inspecting into specific bytes of the payload for keywords.

 

Another is the HTTP advantage. Every network allows ingress and egress HTTP protocol which also makes it the protocol used by many non-business applications, rouge applications, and even malware to gain access into the enterprise. With custom protocol matching, NBAR can classify HTTP traffic based on URL, host, MIME, or even the HTTP header fields. So imagine the advantages: allow HTTP traffic from specific sources and block everything else, stop unwanted HTTP traffic and allow all business applications, block only Youtube, but not Salesforce, or allow only Salesforce, but block everything else and many more permutations.

 

So, here it is. You do not have to enable NBAR on your device to group with QoS policies unless you need either NBAR protocol discovery or the NBAR custom protocol identification. There are two options that Cisco reference sites mention for enabling custom NBAR, depending on your IOS version. There is ip nbar custom and also ip nbar custom_name transport command. Provided below is the syntax for both:

 

ip nbar custom name [offset [format value]] [variable field-name field-length] [source | destination] [tcp | udp ] [range start end | port-number]

 

In the above command, offset refers to the byte location in the payload for inspection. The format and its value can be a term (when used with ascii format), a hexadecimal value (used with hex format), or a decimal value (used with decimal format). For complete information on what each option refers to, check this link:

http://www.cisco.com/c/en/us/td/docs/ios/qos/command/reference/qos_book/qos_i1.html#wp1022849

 

Another command, mostly referred to with NBAR2 or newer IOS is:

 

ip nbar custom name transport {tcp | udp} {id id } ip address ip-address | subnet subnet-ip subnet-mask}| ipv6 address {ipv6-address | subnet subnet-ipv6 ipv6-prefix} | port {port-number | range start-range end-range} | direction {any | destination | source}

 

Check the link below for a reference on the above command:

http://www.cisco.com/c/en/us/td/docs/ios-xml/ios/qos/command/qos-cr-book/qos-i1.html#wp1207545360

 

Once you have your custom protocol captured with NBAR, create a class-map and use the match protocol statement with your custom protocol name to classify the traffic that matches the custom-protocol into a class-map. You can then prioritize, drop or police the traffic based on your requirements.

 

Well, I hope this information eases your implementation of NBAR. More importantly, I hope you enjoy the many benefits of NBAR and a trouble-free network!

All of us that have had any experience in the IT field have had to deal with patching at some point in time.  It's a necessary evil, why an evil?  Well if you've had to deal with patches then you know it can be a major pain.  When I hear words like SCCM or Patch Tuesday, I cringe, especially if I'm in charge of path management.  We all love Microsoft (ahem), but let's be honest, they have more patches than any other software vendor in this galaxy!  VMware has their patching, Linux machines are patched, but Windows Servers, there is some heavy lifting when it comes to patching.  Most of my memories or experiences of staying up past 12 am to do IT work has revolved around patching, and again, it's not something that everybody jumps to volunteer for.  While it's definitely not riveting work, it is crucial to the security of your server, network device, desktops, <plug in system here>.  Most software vendors are good about pushing out up to date patches to their systems such as Microsoft, however there are some other types of systems that we as IT staff have to go out and pull down from the vendor's site, this adds more complexity to the patching.

 

My question is, what are you doing to manage your organization's patching?  Are you using SCCM, WSUS or some other type of patch management?  Or are you out there still banging away at manually patching your systems, hopefully not, but maybe you aren't a full blown enterprise.  I'm curious, because to me patching is the most mundane and painful process out there, especially if you are doing it manually.

Security management and response systems are often high-profile investments that occur only when the impact of IT threats to the business are fully appreciated by management. At least in the small and midmarket space, this understanding only rarely happens before the pain of a security breach, and even then enlightenment comes only after repeated exposure. When it does, it's amazing how seriously the matter is taken and how quickly a budget is established. Until this occurs, however, the system is often seen as a commodity purchase rather than an investment in an ongoing business-critical process.

 

Unfortunately, before the need is realized, there is often little will on the part of the business to take some action. In many cases, organizations are highly resistant to even a commodity approach because they haven't yet suffered a breach. One might think that these cases are in the minority, but as many as 60% of businesses either have an outdated "We have a firewall, so we're safe!" security strategy or no security strategy at all.
[Source: Cisco Press Release: New Cisco Security Study Shows Canadian Businesses Not Prepared For Security Threats - December 2014]

 

Obviously, different clients will be at varying stages of security self-awareness, with some a bit further along than others. For the ones that have nothing, they need to be convinced that a security strategy is necessary. For others, they need to be persuaded that a firewall or other security appliance is only a part of the necessary plan and not the entirety of it. No matter where they stand, the challenge is in convincing them of the need for a comprehensive policy and management process before they are burned by an intrusion and without appearing to use scare tactics.

 

What approaches have you taken to ensure that the influencers and decision makers appreciate the requirements before they feel the pain?

Good morning, Thwack!

 

I'm Jody Lemoine. I'm a network architect specializing in the small and mid-market space... and for December 2014, I'm also a Thwack Ambassador.

 

While researching the ideal sweet spot for SIEM log sources, I found myself wondering where and how far one should go for an effective analysis. I've seen logging depth discussed a great deal, but where are we with with sources?

 

The beginning of a SIEM system's value is its ability to collect logs from multiple systems into a single view. Once this is combined with an analysis engine that can correlate these and provide a contextual view, the system can theoretically pinpoint security concerns that would otherwise go undetected. This, of course, assumes that the system is looking in all of the right places.

 

A few years ago, the top sources for event data were firewalls, application servers and a database servers. Client computers weren't high on the list, presumably (and understandably) because of the much larger amount of aggregate data that would need to be collected and analyzed. Surprisingly, IDS/IPS and NAS/SAN logs were even lower on the scale. [Source: Information Week Reports - IT Pro Ranking: SIEM - June 2012]

 

These priorities suggested a focus on detecting incidents that involve standard access via established methods: the user interface via the firewall, the APIs via the application server, and the query interface via the database server. Admittedly, these were the most probable sources for incidents, but the picture was hardly complete. Without the IDS/IPS and NAS/SAN logs, any intrusion outside of the common methods wouldn't even be a factor in the SIEM system's analysis.

 

We've now reached the close of 2014, two and a half years later. Have we evolved in our approach to SIEM data sources, or have the assumptions of 2012 stood the test of years? If they have, is it because these sources have been sufficient or are there other factors preventing a deeper look?

Filter Blog

By date: By tag: