Skip navigation
1 2 3 Previous Next

Geek Speak

2,386 posts

Over the last three posts, we’ve looked at Microsoft event logging use cases and identified a set of must-have event IDs. Now we’re ready to put our security policy in place. This blog will walk you through configuring event logging on client workstations, and creating a subscription on a central log collection device.

Centralizing log collection removes the burden of having to log in to individual workstations during investigations. It also provides a way to archive log data for incident response or compliance requirements. Remember: being able to easily correlate activities across multiple hosts is a powerful threat detection and mitigation tool.

 

Configuring computers in a domain to forward and collect events

All source devices and the collector should be registered in the domain.

 

1.Enable Windows Remote Management Service on each source computer by typing the following at an administrator command prompt (select Run as Administrator from the Start menu or use the Runas command at a command prompt):

     winrm quickconfig

 

    Note:  It is a best practice to use a domain account with administrative privileges.

 

 

 

     Note:  Winrm 2.x uses default HTTP port 5985 and default HTTPS port 5986. If you already have a listener but you want to change the port, run this command:

    Winrm set winrm/config/listener?Address=*+Transport=HTTP @{Port="5985"}

     Then change your Windows firewall policy accordingly.

 

2. Enable the Windows Event Collection service on the collector computer, type the following at an administrative command prompt (select Run as Administrator from the Start menu or use the Run as command at a command prompt):

     wecutil qc

 

 

3. Configure the Event Log Readers Group
Once the commands have been run successfully, go back to the event source computer and open the Computer Management applet from the Server Manager:

Click Start
Right Click
Computer
Select Manage

 

Expand the Local Users and Groups option from the navigation pane and select the Groups folder. Select “Event Log Readers” group, right click and select Add.

 

    

 

In the “Select Users, Computers, Service Accounts or Groups” dialog box, click on the “Object Type” button and select the checkbox for “Computers” and click OK.

 

 

Type in the name of the collector computer and click on the “Check Name” button. If the computer account is found, it will be confirmed with an underline.

The computers are now configured to forward and collect events.

 

4. Create a Subscription

A subscription will allow you to specify the events you want to have forwarded to the collector.

In the Event Viewer on the collector server, select the Subscriptions. From the Action menu in the right pane, choose the “Create Subscription…” link.

 

 

In the Subscription Properties dialog box:

a.  Provide a name and description for the subscription.

b. Leave the “Destination log” field set to default value of Forwarded Events.

c. Choose the first option (“Collector initiated”) for subscription type and then click on Select Computers.

d. Click on the “Add Domain Computers…” in the pop-up dialogue box.

e. Type the name of the collector server and verify the name. Click OK twice to come back to the Subscription Properties main dialog box.

f. In the Events to Collect section, click on the “Select Events…” button to bring up the Query Filter window.

g. Select a time period from the “Logged” drop-down list. For client workstations these may be collected on a daily basis, for critical servers, a more frequent schedule should be deployed.

h. Select types of events (Warning, Error, Critical, Information, and Verbose) by eventID (or pick the event sources you require but remember to be selective to avoid losing visibility into important events due to excessive “noise”.

i.  Click OK to come back to the Subscription Properties main dialog box again.

j. Click on the “Advanced…” button and then in the Advanced Subscription Settings dialog box select the option for “Machine Account” if it’s not already selected.

 

k. Change the “Event Delivery Optimization” option to “Minimize Latency.”

l. Verify the Protocol ports - ideally keep the default value of HTTP and the Port as 5985.

m. Click OK to go back to the Subscription Properties dialog box and then click OK to close it.

 

The Subscriptions option in the event viewer should now show the subscription we just created.

 

5. Verify Events on Collector Computer

Select the Forwarded Events option under Windows Logs in the Event Viewer.

 

 

Notes for Workgroups

If you want to set up log forwarding within a workgroup rather than a domain you will need to perform the following tasks in addition to those defined for domains:

  • Add an account with administrator privileges to the Event Log Readers group on each source computer. You must specify this account in the dialog when creating a subscription on the collector computer. Select Specific User instead of Machine Account (see step 4j). You must also ensure the account is a member of the local Administrators group on each of the source computers.

 

  • Type winrm set winrm/config/client @{TrustedHosts="<sources>"}<sources>winrm set winrm/config/client @{TrustedHosts="msft*"}on the collector computer. To learn more about this command, type winrm help config.

 

Hopefully you have now built a working security policy using Windows Events. In the last blog of this series we will look at combining these events with other telemetry sources in a network by forwarding them to a syslog server or SIEM tool.

I'm getting ready for my trip to Germany and SQL Konferenz next week. If you are near Darmstadt, I hope to see you there. I'm going to be talking about data and database migrations. I'll also make an effort to eat my weight in speck.

 

As always, here are some links from the Intertubz that I hope will hold your interest. Enjoy!

 

Amazon overtakes Microsoft in market capitalisation, thanks to booming AWS revenue

This post had me thinking how overvalued Apple is right now. It's as if the stock price of a company has no relation to the actual value of a company, its people, or products.

 

Apple takes more than half of all smartphone revenues in Q4 2017

Then again, maybe Wall Street does know a few things about Apple. Despite a declining market share in total phones sold (only 18% now), Apple drives huge revenues because their phones cost so much. And Wall Street loves revenues. As long as the iPhoneX doesn't catch fire, Apple stock will continue to be a safe bet.

 

Intel facing 32 lawsuits over Meltdown and Spectre CPU security flaws

You knew the sharks would come out for this one. Between the lawsuits and the insider trading, this could be the end of Intel.

 

US Senator demands review of loot box policies, citing potential harm

This is a topic of discussion in our home, as my children ask for money to purchase "loot boxes" so they can win a prize that allows them to compete in MMO game such as Overwatch. They don't understand the gambling aspect of a loot box, and I suspect we now have a generation of kids that won't think twice about spending extra cash for an instant reward. It won't be long before we pay an extra $3 at Starbucks for a chance that our drink is made correctly.

 

This electric jet can take off vertically and travel almost 190 miles per hour - and it's already being prototyped

Finally! I was promised flying cars decades ago! This would make commuting to work so much more fun.

 

Hacker Group Makes $3 Million by Installing Monero Miners on Jenkins Servers

This is why we can't have nice things.

 

Americans used to eat pigeon all the time-and it could be making a comeback

Wrap it in bacon and we'll never know.

 

Ah, Germany. I came for the speck, but I stayed for the schweinshaxe:

All too often, especially if disaster recovery (DR) is driven and pushed by the IT department, organizations can fall into the common mistake of assuming that they are “good to go” in the event disaster hits. While IT departments can certainly handle the technical side of things, ensuring services are up and running if production goes down, they are not necessarily the key stakeholder in ensuring that business processes and services can also be maintained. These business processes and activities can really be summed up in one key term that goes hand in hand with DR - business continuity (BC). Essentially, business continuity oversees the processes and procedures that are carried out in the event of a disaster to help ensure that business functions continue to operate as normal – the key here being business functions. Sure, following the procedures with our disaster recovery plan is a very big piece of our business continuity plan (BCP), but true BCP’s will encompass much more in terms of dealing with a disaster.

 

BCP: Just a bunch of little DR plans!

 

When organizations embark on tackling business continuity, it's sometimes easier to break it all down into a bunch of little disaster recovery plans – think DR for IT, DR for accounting, DR for human resources, DR for payroll, etc. The whole point of business continuity is to keep the business running. Sometimes, if it is IT pushing for this, we fall into the trap of just looking at the technical aspects, when really it needs to involve the whole organization! So, with that said, what should really be included in a BCP? Below, we will look at what I feel are four major components that a solid BCP should consider.

 

Where to go?

 

Our DR plan does a great job of ensuring that our data and services are up and running in the event disaster hits. However, often what we don’t consider is how employees will access that data. Our employees are used to coming in, sitting down, and logging into a secure internal network. Now that we have restored operations, does a secondary location offer the same benefit that's available to our end-users? Are there enough seats, DHCP, switches to handle all of this? Or, if we have utilized some sort of DRaaS, do they offer seats or labs in the event we need them? Furthermore, depending on the type of disaster incurred, for instance, say it is was a flood, will our employees even be able to travel to alternate locations at all?

 

Essential Equipment

 

We know we need to get our servers back up and running. That’s a no brainer! But what about everything else our organization uses to carry out its day-to-day business? It’s the items we take for granted that tend to be forgotten. Photocopiers, fax machines, desks, chairs, etc. Can ALL essential departments maintain their “business as usual” at our secondary site, either fully or in some sort of limited fashion? And aside from equipment, do we need to think of the infrastructure within our secondary site, as well? Are there phone lines installed? And can that be expanded in the event of long-term use of the facility? Even if these items are not readily available, having a plan on how to obtain them will save valuable time in the restoration process. Have a look around you at all the things on your desk and ask yourself if the same is available at your designated DR facility.

 

Communication

 

Here’s the reality: your building is gone, along with everything that was inside of it! Do you have plans on how to keep in touch with key stakeholders during this time? A good BCP will have lists upon lists of key employees with their contact information, both current and emergency. Even if it is as simple of having employees home/cell phone numbers listed, and possibly, if you host your own email servers, alternate email addresses that are checked on a regular basis. The last thing you want to have is a delay in the process of executing your BCP because you can’t get the go-ahead from someone because you are simply unable to contact them.

 

Updated Organizational Charts

 

While having an updated org chart is great to include within a BCP, it is equally, or perhaps even more important, to have alternate versions of these charts in the event that someone is not available. We may not want to think about it, but the possibility of losing someone within the disaster itself is not far-fetched. And since the key function of the BCP is to maintain business processes, we will need to know exactly who to contact if someone else is unavailable. The last thing we need at times like these is staff arguing, or worse, not knowing who will make certain key decisions. Having alternate org charts prepared and ready is critical to ensuring that recovery personnel has the information they need to proceed.

 

These four items are just the tip of the iceberg when it comes to properly grafting a BCP. But there is much more out there that needs to be considered. Paper records, back-up locations, insurance contacts, emergency contacts, vendor contacts, payroll, banking; essentially every single aspect of our business needs to have a Plan B to ensure that you have an effective, holistic, and more importantly, successful Business Continuity Plan in place. While we as IT professionals might not find these things as “sexy” as implanting SAN replication and metro clusters, the fact of the matter is that we are often called upon when businesses begin their planning around BC and DR. That’s not to say that BC is an IT-related function, because it most certainly is not. But due to our major role in the technical portion of it, we really need to be able to push BC back onto other departments and organizations to ensure that the lights aren’t just on, but that there are people working below them as well.

 

I’d love to hear from some of you that do have a successful BCP in place. Was it driven by IT to begin with, or was IT just called upon as a portion of it? How detailed (or not) is your plan? Is it simply, “Employees shall report to a certain location,” or does it go as far as prioritizing the employees who gain access? What else might you have inside your plan that isn’t covered here? If you don’t have a plan, why not? Budget? Time? Resources?

 

Thank you so much for all of the recent comments on the first two articles. Let's keep this conversation going!

No, it’s not the latest culinary invention from a famous Italian chef: spaghetti cabling (a nice wording for cabling inferno) is a sour dish we’d rather not eat. Beyond this unsavory term hides the complexity of many environments that have grown organically, where “quick fixes” have crystallized into permanent solutions, and where data center racks are entangled in cables, as if they had become a modern version of Shelob’s Lair.

 

These cabling horrors are not an act of art. Instead, they prosaically connect systems together to form the backbone of infrastructures that support many organizations. Having had experience in the past with spaghetti cabling, I can very vividly remember the endless back-and-forth discussions with my colleagues. This usually happened when one was trying to identify the switch port to patch panel connectivity while the other was checking if the system network interface is up or down. That then resulted in trying to figure out if patch panel ports were correctly mapped with wall outlet plug identification. All of this to troubleshoot a problem that would be very trivial if it wasn’t for careless and unprofessional cabling.

 

The analogy with other infrastructure assets is very similar: it can be very difficult for administrators to find a needle in the haystack, especially when the asset is not physical and the infrastructure is large. Multi-tiered architectures, or daisy-chained business processes relying on multiple sources of data, increase potential failure points in the data processing stream. This sometimes makes troubleshooting a far more complex endeavor than it used to be due to upstream or downstream dependencies.

 

One would expect that upstream dependencies would impact a system in such a way that it is no longer able to process data, and thus come to a halt without impact to downstream systems. While this can be a safe assumption, there are also cases where the issue isn’t a hard stop. Instead, the issue becomes data corruption. Either by handing over incorrect data or by handing over only fragments of usable data. In such occurrences, it is also necessary to identify the downstream systems and stop them to avoid further damage until the core issue has been investigated and fixed.

 

Thus, there is a real need for mapping the upstream and downstream dependencies of an application. There are cases in which it’s preferable to bring an entire system to a halt rather than risk financial losses (and eventually litigation, not to mention sanctions), if incorrect data makes its way into production systems. In that case, it would ultimately impact the quality of a manufactured product (think critical products, such as medicines, food, etc.) or a data batch meant for further consumption by a third party (financial reconciliation data, credit ratings, etc.).

 

Beyond troubleshooting, it’s crucial for organizations to have an end-to-end vision of their systems and assets, preferably into a System of Record. This could be for inventory purposes or for management processes, whether based on ITIL or not. The IT view is not always the same as the business view, however both are bound by the same common goal: to help the organization deliver on its business objectives. The service owner will focus on the business and process outcomes, while the IT organization will usually focus on uptime and quality of service. Understanding how assets are grouped and interact together is key in maintaining fast reaction capabilities, if not acting proactively to avoid outages.

 

There is no magical recipe to untangle the webs of spaghetti cabling, however advanced detection/mapping capabilities. Existing information in the organization should help IT and business obtain a precise map of existing systems, and understand how data flows in and out of the system with a little detective work.

 

In our view, the following activities are key enablers to obtain full-view clarity on the infrastructure:

  • Business service view: the business service view is essential in understanding the dependencies between assets, systems, and processes. Existing service maps and documentation, such as business impact assessments, should ideally contain enough information to capture the process view and system dependencies.

 

  • Infrastructure view: it is advisable to rely on infrastructure monitoring tools with advanced mapping / relationship / traffic-flow analysis capabilities. These can be used to complement/validate existing business service views listed above (for lucky administrators / IT departments), or as a starting point to map traffic flows first, then reach out to business stakeholders to formalize the views and system relationships.

 

  • Impact conditions and parent-child relationships: these usually would be captured in a System of Record, such as a CMDB, but might eventually be also available on a monitoring system. An event impacting a parent asset would usually cascade down to child assets.

 

  • Finally, regular service mapping review sessions between IT and business stakeholders are advised to assert any changes.

 

Taken in its tightest interpretation, the inner circle of handling “spaghetti cabling” problems should remain within the sphere of IT Operations Management. However, professional and conscious system administrators will always be looking at how to improve things, and will likely expand to the other activities described above.

 

In our view, it is an excellent way to further develop one’s skills. First, by going above and beyond one’s scope of activities, it can help us build a track record of dependability and reliability. Second, engaging with the business can help us foster our communication skills and move from a sometimes tense and frail relationship to building bridges of trust. And finally, the ability to understand how IT can contribute to the resolution of business challenges can help us move our vision from a purely IT-centric view to a more holistic understanding of how organizations work, and how our prioritization of certain actions can help better grease the wheels.

By Paul Parker, SolarWinds Federal & National Government Chief Technologist

 

I like the idea of taking a holistic view of the user experience. Here's an interesting article from my colleague Joe Kim, where he introduces and discusses Digital Experience Monitoring (DEM).

 

Agencies are moving quickly from paper processes to digital services to providing critical information more efficiently online, rather than paper-based forms and physical distribution methods. As a result, about 30% of global enterprises will implement DEM technologies or services by 2020—up from fewer than 5% today, according to market research firm Gartner®.

 

What, exactly, is Digital Experience Monitoring? In a nutshell, it’s understanding and maximizing each individual user’s online experience.

 

DEM looks at the entire user experience: how fast did the home page load? Once it loaded, how much time did the user spend on the site? Where did they go? What did they do? Taking DEM even further, many agencies will gather information about the user’s device to help further understand the user experience: was the user on a smartphone or on a laptop? What browser?

 

Maximizing the user experience requires an incredible amount of data. This brings its own challenge: all that data can make relevant information difficult to find. Additionally, federal IT pros must be able to understand how the IT infrastructure impacts service delivery and the citizen experience.

 

Luckily, there are an increasing number of new tools available that help give context to the data and help the federal IT pro make highly informed decisions to maximize each citizen’s digital experience.

 

DEM tool benefits

 

DEM-specific tools provide a range of benefits that other tools do not. Specifically, because DEM inherently works with lots of data, these DEM tools are designed to help solve what have historically been thought of as big-data challenges.

 

For example, DEM tools have the ability to recognize patterns within large amounts of data. Let’s say a specific cluster of users is having a sub-optimal experience. Automatic pattern recognition will help the federal IT pro understand if, say, all these users are taking a particular route that is having bandwidth issues. Or, perhaps all these users are trying to access a particular page, form, or application on the site. Without the ability to recognize patterns among users, it would be far more difficult to find the root of the problem and provide a quick solution.

 

A DEM-specific tool can also identify anomalies, a historically difficult challenge to find and fix.

First, the federal IT pro must create a baseline to understand ordinary network behavior. With that in place, an anomaly is easier to identify. Add in the ability to apply pattern recognition—what happens before the anomaly each time it appears—and the problem and solution are far easier to find and implement.

 

And finally, because they can provide a historic perspective, DEM-specific tools can help the federal IT pro forecast infrastructure changes before implementation. Let’s say an agency is undergoing a modernization effort. Many DEM tools provide the ability to forecast based on the baseline and historic information already collected. A solid DEM tool will allow the federal IT pro to mimic changes, and understand the result of those changes throughout the infrastructure, in advance. The reality is, any infrastructure change can impact user experience, so being able to understand the impact in advance is critical.

 

Conclusion

 

Federal IT pros have been using performance monitoring tools for years. That said, the landscape is changing. Using ordinary tools—or, ordinary tools alone—may no longer be an option. It is important to understand the role DEM plays within agency IT departments. In turn, this allows you to recognize the value in bringing in the right tools to help perform this new, critical role.

 

Find the full article on our partner DLT’s blog Technically Speaking.

It was a very full week at CiscoLive--not to mention an additional full week in Spain, which I'll get to in a minute--and I have a lot to share.

 

First and foremost, and this is not meant to be a slam on Munich, I had an amazing time just BEING in Barcelona. Sure it was a little warmer. Sure, I speak a little Spanish as opposed to zero German. And sure, there were three kosher restaurants instead of the one in Munich. But even beyond that, the pace, the layout, and even the FEEL of the place was different for me in a very enjoyable way. I was incredibly happy to hear that CLEUR will be in Barcelona again next year, and hope that I get to be part of the "away team" again.

 

The Big Ideas

At every convention, I try to suss out the big themes, ideas, and even products that make a splash at the show. Here's what I found this time:

 

DevNet! DevNet! DevNet!
I think I talk about DevNet after every CiscoLive, but gosh darn if it's not noteworthy each time. This year, my fellow Head Geek Patrick Hubbard rightly called out the announcement about IBN. No, read it again: NOT big blue. Intent-Based Networking: https://blogs.cisco.com/datacenter/introducing-the-cisco-network-assurance-engine. The upshot of this announcement is that the network is about to get smarter than ever, using data, modeling, and (of course) built-in tools to understand and then ensure the "intent" of the networking you have in place. And how will you interact with this brave new intent-based world? Code.

This leads me to my second big observation:
The time for SDN has come

Every year (since 2014) I've been trying to figure out how SDN fits into the enterprise. Usually when I talk to a group, I give it a shot:

    • "How many of you are thinking about SDN" (usually, most of the hands go up)
    • "How many are using SDN in the lab?" (in most cases, one-half to two-thirds of the hands go down)
    • "How many are using it in prod?" (typically all but three hands go down, leaving just the folks who work for ISPs)

 

This time I had a ton of people--enterprise folks--coming and asking about SDN and Cisco ACI support, which tells me that we have hit a tipping point. I have a theory why (grist for another article), but it boils down to two main things. First, Cisco has done a kick-ass job pushing "DevNet" and teaching network folks of all stripes not to fear the code. People came to the booth asking "does this support python scripting?" Scripting wasn't an afterthought; it was a key feature they needed. Second, SDN experience has filtered down from networking engineers at ISPs to mid-level technicians, and companies are now able to enumerate the value of this technology both on a technical and business level. Thus, the great corporate adoption of SDN is now starting.

 

Being a NetVet is every bit as cool as I thought it would be
Besides causing vendors to stare at your badge for an extra two seconds, the biggest benefit of being a NetVet is the lounge. It is quiet. It has comfy couches. It has it's own coffee machine. It. Has. My. Name. On. It.

 

The View from the Booth

So that sums up the major things I saw at the show. But what about the interactions in the SolarWinds booth? SO MUCH was packed into the three days that it's hard to pick just a few, but here goes.

 

TNG, and I don't mean Star Trek
One of the fun things about a show like CiscoLive is getting to show off new features and even whole new solutions. Three years ago I got to stand on stage with Chris O'Brien and show off "something we've been playing with in the lab," which turned out to be NetPath. This time, we had a chance to get initial reactions to a new command line tool that would perform traceroute-like functions, but without ICMP's annoying habit of being blocked by... well, just about everything. While we're still putting on the final coat of paint, the forthcoming free "Traceroute NG" tool will perform route analysis via TCP or traditional ICMP,  show you route changes if the path changes during scanning, supports IPv4 and IPv6 networks, and more. Attendees who saw it were blown away.

 

Hands Up for BackUp!

We also got to take the lid off an entirely new offering: cloud-based backup for your important systems. (https://www.solarwinds.com/backup) This isn't some "xcopy my files to the cloud" kludge. Using block-based backup techniques for screaming fast (and bandwidth-friendly) results; a simple deployment strategy that supports Windows and Linux-based systems; granular permissions; and a dashboard that lets you know the disposition of every system, regardless of the size of your deployment.

 

Survey Says?
A great part of booth conversations is comparing experiences and discovering how frequently they match up. This frequently comes out as a kind of IT version of Mad Libs.

  • I was discussing alerts and alert actions with an attendee who was clearly part of "Team Linux." After pointing out that alerts should extend far beyond emails or opening tickets, I threw out, "If your IIS-based website is having problems, what's the first thing you do?" Without even a pause they said, "You restart the app pool." That's when I showed SAM's built-in alert actions. (Afterward we both agreed that "install Apache" was an equally viable answer.)
  • When Patrick asked a group of four longtime SolarWinds users to guess the most downloaded SolarWinds product, the response was immediate and emphatic: "TFTP Server." I could only laugh at how well our customers know us.

 

"I'm here to ask question and chew bubblegum (and it doesn't look like you're giving out bubblegum)"
As I have noted in the past, CiscoLive Europe may be smaller (14k attendees versus ~27k in the United States), but the demos go longer and the questions are far more intense. There is a much stronger sense of purpose when someone comes to our booth. They have things they need to find out, design choices they want to confirm, and they don't need another T-shirt, thank you very much. Which isn't to say we had swag left at the end. It was all gone. But it took until the last day.

 

More Parselmouth's than at a Slytherin Convention
This year I was surprised by how often someone opened their questions with, "Do these solutions support Python?" (For the record, the answer is yes: https://github.com/solarwinds/orionsdk-python) Not that I was surprised to be asked about language support in general. What got me was how often this happened to be the opening question. As I said earlier, Cisco's DevNet has done an incredible job of encouraging the leap to code, and it is now framing many networking professional's design choices and world view. I see this as a good thing.

 

La Vida Barcelona

Outside of the hustle and bustle of the convention center, a whole world awaited us. As a polyglot wannabe, the blend of languages was multicultural music to my ears. But there wasn't much time to really see the sites or soak up the Spanish culture because the convention was demanding so much of my day.

 

Which is why I decided to spend an extra week in-country. My wife and I traveled from Barcelona to Madrid, and even spent a day in Seville to visit the apartment where she was born and spent the first few months of her life.

 

We saw some amazing sites:

 

Including some views that GoT fans like jennebarbour will find familiar:

 

 

 

Ate some incredible food:

 

And generally just enjoyed all that Spain had to offer. The only hiccough was the weather. It was kind of like this.

 

For Clevelanders like us, it's pretty normal. But I'm pretty sure the locals suspected we brought our weather with us, and were glad to see the back of me when we finally packed up and headed back home.

 

Until next year (which will be in Barcelona again), and until the next trip.

(pictured: patrick.hubbard, ding, andre.domingues, and the inimitable Silvia Siva.)

Most network engineers enter the profession because we enjoy fixing things.  We like to understand how technology works.  We thrive when digging into a subject matter with focus and intensity.  We love protocols, acronyms, features, and esoteric details that make sense only to a small group of peers.  Within the ranks of fellow geeks, our technical vernacular becomes badge of honor.  However, outside of our technical peers, we struggle to communicate effectively.

 

Our organizations rely on technology to make the business run.  But the deeper we get technically, the wider the communication gap between with IT and business leadership becomes.  We must learn to bridge the gap between technology and business to deliver the right solutions.

 

I was reminded of this communication disparity when working on a circuit outage recently.  While combing through logs and reviewing interface statics, a senior director asked for a status update.  I told him, “We lost BGP on our Internet circuits".  He responded with a blank stare.  I had failed to shift my communication style and provided zero helpful information to my leadership.  I changed my approach and summarized that we lost the logical peering with our provider.  Although the physical circuit appeared to be up, we could not send Internet traffic because we were no longer receiving routing information from them.  My second response, though less precise, provided an understandable picture to my senior leadership and satisfied his question.  He had more confidence that I knew where the problem was and it helped him understand what the escalation point should be.

 

When communicating with leadership about technical projects and problems remember these things.

 

  1. Leadership doesn’t understand your jargon and they don’t need to.  I’ve heard many network engineers decry the intelligence of their leadership because management doesn't know the difference between an ARP table and a MAC address table.  This line of thinking is silly.  Management’s role is to understand the business, build a healthy organization, manage teams effectively, and provide resources to accomplish business goals.  Some degree of technical knowledge is helpful for front-line management, but the higher in the organization an individual is, the less detail they will need know about each technical arena.  This is as it should be.  It’s your job to know the difference between an ARP table and a MAC address table and to summarize technical detail into actionable information.
  2. Management doesn’t always know the exact right question to ask.  I once had a manager describe a colleague as an individual who would provide only data, never analysis.  My boss felt as though he had to ask 30 questions to get a handle on the technical situation.  My colleague thought his sole responsibility was to  answer the question precisely as asked — regardless of the value of that answer.  Don’t be that guy or gal.  Listen carefully and try to understand what you manager wants to know instead of parsing their words precisely.  Answer their question, then offer insight that you believe will help them do their job better.  Be brief, summarize, don’t include so much technical detail that they check out before you get to the punchline.
  3. Effective communication is an art, more than a science.  At the end of the day, great communication happens in the context of strong professional relationships.  You don’t have to be best friends with your manager and you don’t need to spend time with them outside of the office.  However, you should work hard — as much as it depends on you — to build trust and direct, respectful communication channels with your leadership.  Don’t dig in your heels unnecessarily.  Give when you can and hold firm when you must.  If you develop a reputation as a team player, your objections will be taken more seriously when you must voice them.

 

Strong communication skills are the secret weapon of truly effective network engineers.   If you want to grow in influence within your organization, and you want to truly affect change, you’ll need to sharpen your soft skills along with your technical chops.

It’s a common story. Your team has many times more work than you have man hours to accomplish. Complexity is increasing, demands, are rising, acceptable delivery times are dropping, and your team isn’t getting money for more people. What are you supposed to do? Traditionally the management answer to this question is outsourcing but that word comes with many connotations and many definitions. It’s a tricky word that often instills unfounded fear in the hearts of operations staff, unfounded hope in IT management, and sometimes (often?) works out far better for the company providing the outsourcing than the company receiving the services. If you’ve been in technology for any amount of time, you’re likely nodding your head right now. Like I said, it’s a common story.

 

I want to take a practical look at outsourcing and, more specifically, what outsourcing will never solve for you. We’ll get to that in a second though. All the old forms of outsourcing are still there and we should do our best to define and understand them.

 

Professional outsourcing is when your company pays someone else to execute services for you and is usually because you have too many tasks to complete with too few people to accomplish them. This type of outsourcing solves for the problem of staffing size/scaling. We often see this for help desks, admin, and operational tasks. Sometimes it’s augmentative and sometimes it’s a means to replace a whole team. Either way I’ve rarely seen it be something that works all that well. My theory on this is that a monetary motivation will never instill the same sense of ownership that is found in someone who is a native employee. That being said, teams don’t usually use this to augment technical capacity. Rather, they use it to increase/replace the technical staff they currently have.

 

Outside of the staff augmentation style of outsourcing, and a form that usually finds more success, is process specific outsourcing. This is where you hire experts to provide an application that doesn’t make sense for you to build, or to do a specific service that is beyond reasonable expectation of handling yourself. This has had many forms and names over the years, but some examples might be credit card processing, application service providers, electronic health record software, etc…  Common modern names for this type of outsourcing is SaaS (Software-as-a-Service) and PaaS (Platform-as-a-Service). I say this works better because it’s purpose is augmenting your staff technical capacity, leaving your internal staff available to manage product/service.

 

The final and newest iteration of outsourcing I want to quickly define is IaaS (Infrastructure-as-a-Service) or public cloud. The running joke is that running in the cloud is simply running your software on someone else’s server, and there is a lot of truth in that. How it isn’t true is that the cloud providers have mastered the automation, orchestration, and scaling in the deployment of their servers. This makes IaaS a form of outsourcing that is less about staffing or niche expertise, and more about solving the complexity and flexibility requirements facing modern business. You are essentially outsourcing complexity rather than tackling it yours.

 

If you’ve noticed, in identifying the above forms of outsourcing above I’ve also identified what they truly provide from a value perspective. There is one key piece missing though and that brings me to the point of this post. It doesn’t matter how much you outsource, what type of outsourcing you use, or how you outsource it, the one thing that you can’t outsource is responsibility.

 

There is no easy button when it comes to designing infrastructure and none of these services provide you with a get out of jail free card if their service fails. None of these services know your network, requirements, outage tolerance, or user requirements as well as you do. They are simply tools in your toolbox and whether you’re augmenting staff to meet project demands, or building cloud infrastructure to outsource your complexity, you still need people inside your organization making sure your requirements are being met and your business is covered if anything goes wrong. Design, resiliency, disaster recovery, and business continuity, regardless of how difficult it is, will always be something a company will be responsible for themselves.

 

100% uptime is a fallacy, even for highly redundant infrastructures run by competent engineering staffs, so you need to plan for such failures. This might mean multiple outsourcing strategies or a hybrid approach to what is outsourced and what you keep in house. It might mean using multiple providers, or multiple regions within a single provider, to provide as much redundancy as possible.

 

I’ll say it again, because I don’t believe it can be said enough. You can outsource many things, but you cannot outsource responsibility. That ultimately is yours to own.

This edition of the Actuator lands on Valentine's Day, a holiday many believe to be invented by the greeting card industry. The true origins go back centuries; the Romans celebrated the middle of February as the start of spring. I'm kinda surprised that there isn't a cable network showing a Roman wrestling with the Groundhog over the official start date of spring. I'm not saying I would watch, but if I were flipping channels and found that show... Anyway.

 

As always, here are some links from the Intertubz that I hope will hold your interest. Enjoy!

 

SpaceX's Falcon Heavy launch was (mostly) a success

I just want to remind you that in 1971 NASA launched a car into space. And it landed on the Moon. And then two men got in the car and drove it around the Moon. But hey, nice job Elon. Keep your chin up, kid.

 

Winter Olympics was hit by cyber-attack, officials confirm

Because shutting down a website will show the world... something. I really don't know what the point of this attack would be. Why waste the time and chance you get caught? My only guess is that they used this as a test. Either a test of their method, or a test of the defense. Probably both.

 

Customer Satisfaction at the Push of a Button

Sometimes, the simplest questions can lead to the best insights. You don't need to collect 1,000 different metrics to know if there's a problem (or not).

 

We Clearly Need to Discuss the Stock Market

"The stock market has been a little runaway recently, and this recent movement is more or less just putting down the whiskey and taking a good, hard look at its life." Brilliant.

 

Would You Have Spotted This Skimmer?

I want to say "yes," but the answer is "no" because I rarely check for such things. Maybe once every ten times do I pull on the reader at a gas station or checkout counter. Let's be careful out there.

 

The House That Spied on Me

A bit long, but worth the time if you want to become paranoid about all the ways you are giving up privacy in your own home. I think we need a show called "Are You Smarter Than Your Smart Device?" to find out if people understand the basics of IoT security.

 

UK ICO, USCourts.gov... Thousands of websites hijacked by hidden crypto-mining code after popular plugin pwned

Everything is terrible.

 

Finally! I was promised flying cars YEARS ago:

This January I was invited by SAP to co-present as a keynote speaker at the yearly ASUG (Americas SAP User Group) volunteer meeting in Nashville, TN. I was asked to offer my perspective as an SAP customer for their business transformation service, the SAP Optimization and Pathfinder Report. This report takes a 30-day capture of the usage of your SAP® ECC, analyzes it, comes up with recommendations on how to: take advantage of SAP functional enhancements, move to the cloud, implement the SAP award-winning UX “Fiori”, and migrate to the SAP in-memory database HANA. I just so happened to be one of the first SAP customers to run the report when it was released last year, and I am fairly active on the SAP and ASUG community forums (although nowhere near as active as I am on THWACK®). So, I naturally drew the attention of the SAP Services and Support teams, thus my invitation.

 

One of the many hats I wear in my company’s IT department is that of the SAP Basis, Security and Access Manager. For those of you not familiar with SAP Basis is the equivalent of Microsoft® Windows® and Active Directory® administration. I take to this role with all the zest and enthusiasm that I take to my role as the only monitoring resource for my company (hence my heavy involvement with the SolarWinds® product suite, my SCP certification, and my MVP status on THWACK). I am always on the lookout for any and all services and tools that are part of my company’s SAP Enterprise Support contract. Because if you are an SAP customer the one constant is that you are overwhelmed and confused by the arsenal of services and tools available to you from SAP.

 

Okay. Okay! So why am I talking about SAP so much in a THWACK blog? Great question! Here is where the roads intersect. The audience for my presentation was a mix of ASUG volunteers representing special interest groups (SIGs) for: Human Resources, Supply Chain, Sales & Distribution, Finance, IT, Warehouse Mgmt, and other others. Additionally, these SIG’s were comprised of VP’s, C-level, directors, and managers from all industries. My presentation about business transformation had to reach all of them. SAP, SolarWinds… it didn’t matter the technology. My message had to be that business transformation traversed all departments and required stakeholders buy-in from each.

 

Fortunately for me I work for a wine and spirits distributor. Alcohol is a great icebreaker. Everyone it seems has a great story to tell involving alcohol.  So as I went through my presentation outlining the business transformation opportunities of this report I periodically stopped and would query the audience: “Show of hands! How many of you subscribe to the SAP Enterprise Support newsletter?” Out of 450 only about 30 raised their hands. “Show of hands! How many of you are SAP Certified Center of Excellence?” Maybe 10. “Show of hands! How many of us regularly initiate the SAP Continuous Quality Checks?” About 15. “Show of hands! How many of us are aware of openSAP? SAP’s free training available to everyone?” Maybe 10. “WOW! You people got some homework!” Laughter ensued.

 

Before I finished my presentation it dawned on me that technology in the business landscape is still expected to be championed by IT. These are ASUG volunteers I was presenting to and even they aren’t taking advantage of SAP’s services and tools. These are meant for the entire business to innovate, transform, and grow. But my “show of hands” exercise not only revealed that the services and tools are not only not being utilized but the business isn’t tuning in to their strategic vendor’s communications. The realization of the value of enterprise services is lost.

 

So that leads me to you my fellow THWACKsters. Obviously you are tapping in to the awesome value of the THWACK. But what about the enterprise services of your other vendors? Microsoft, VMware®, Cisco®, Verizon®, and so many others. Show of hands…

By Paul Parker, SolarWinds Federal & National Government Chief Technologist

 

There may still be a few skeptics out there, but cloud adoption is getting commonplace. Here's an interesting article from my colleague Joe Kim, where he offers suggestions on simplifying the complexity.

 

The idea of moving IT infrastructure to the cloud may have initially sounded interesting, but this optimism was quickly followed by, “Hold on a minute! What about security? What about compliance?”

 

Administrators quickly realized that the cloud may not be a panacea, and a complete migration to the cloud may not be the best idea. Organizations still needed to keep at least some things on-premises, while also taking advantage of the benefits of the cloud.

 

A complex hybrid IT world

 

Thus, the concept of hybrid IT was born. In a hybrid IT environment, some infrastructure is migrated to the cloud, while other components remain onsite. Agencies can gain the economic and agile benefits of a cloud environment while still keeping a tight rein on security.

 

However, hybrid IT has introduced a slew of challenges, especially in terms of network complexity. Indeed, respondents to a recent SolarWinds survey of public-sector IT professionals listed increased network complexity as the top challenge created by hybrid IT infrastructures. That survey discovered that nearly two-thirds of IT professionals said their organizations currently use up to three cloud provider environments, and 10% use 10 or more.

 

Compounding this challenge is the fact that hybrid IT environments are becoming increasingly distributed. Agencies can have multiple applications hosted in different data centers—all managed by separate service providers. Even applications that are managed on-premises will often be located in different offices.

 

Connecting to these various applications and services requires multiple network paths, which can be difficult to monitor and manage. Even a simple Google® search requires many paths and hops to monitor. Traditional network monitoring tools designed for on-premises monitoring are not built for this complexity.

 

A single-path approach to simplicity

 

While administrators cannot actually combine all of their network paths into one, they can—from a monitoring perspective—adopt a single-path analysis approach. This form of monitoring effectively takes those multiple paths and creates a single-path view of the activity taking place across the hybrid IT network. Formulating a single path allows administrators to get a much better perspective on the performance, traffic, and configuration details of devices and applications across hybrid networks. This, in turn, makes it easier to ascertain, pinpoint, and rectify issues.

 

Single-path analysis can help managers quickly identify issues that can adversely impact quality of service, allowing them to more easily track network outages and slowdowns and tackle these problems before users experience the deleterious effects. Managers can also gain better visibility into connections between end-users and services, as single-path analysis provides a clear view of any network infrastructure that might be in the path and could potentially impede QoS.

 

IT professionals can take solace in the fact that there is a simpler way to manage complex hybrid IT infrastructures. By following a single-path analysis strategy, managers can greatly reduce the headaches and challenges associated with managing and monitoring the many different applications and services within their hybrid IT infrastructures.

 

Find the full article on Government Computer News.

Back in the first post of this series we covered off the planning portion with regards to implementing Office 365. Knowing what you are aiming to achieve is critical to measuring the value and success of a project. Assuming the math checks out, now it is time to pull the trigger and start migrating to Office 365. A botched migration can easily compromise the value of a project, so planning should be done with care.

 

UNDERSTAND HYBRID VS. CLOUD

 

Office 365 offers multiple options for deployment. You can run fully in the cloud, which means you'll be able to remove large parts of your infrastructure. This can be especially appealing if you are nearing the end of life for some equipment. Saving on a large capital expenditure and moving it to an operating expense can be ideal at times.

 

Another option might be a hybrid approach. This approach is a mix of using your on-premises infrastructure and Office 365's infrastructure. This is commonly used as a way to get your mailboxes and data up to the cloud. It allows for administrators to choose which mailboxes run where. It can also be used for security or compliance measures: maybe you want all C-level executives to keep their mailboxes on-premises.

 

ROLLING OUT NEW SERVICES

 

Have you opted to roll out any other Office 365 / Azure services into the mix? Maybe you are interested in using Azure Multifactor Authentication (MFA), or maybe Azure Active Directory to allow for single sign-on (SSO). How does that fit into the process? You'll need to see how any new service fits into your current environment. Using MFA as an example, will this be displacing an existing service, or is it a new offering for the environment? Either way, there will be additional work to be done.

 

ALWAYS HAVE A PLAN B

 

As with any IT project, what will you do if you have a failure along the way? This isn't to say that Office 365 as a whole isn't working out but think about the smaller things. What if you move your CEO's mailbox and then something goes awry? How do you move it back? Identifying what needs to be migrated, upgraded, or installed based on the above gives you a list. You can use that list to start forming failback planning for each of those components.

 

A good tip for planning failback is don't just focus on the tech. Make sure you know what people need to be involved. Do you need specific folks from your team or from other teams? Make sure that is part of the plan so if you do need rollback, those folks can be available, just in case.

 

COORDINATING THE MIGRATION

 

When it comes time to start moving data, make sure you don't blindside your users. You'll want to identify key individuals within your organization to liaise with (e.g. department managers). The goal here is to ensure that you minimize disruptions. The last thing you want to do is have an outage period overlap with the closing of a large sales deal.

 

Kicking off a platform migration can be a stressful event, but proper planning and communication can go a long way. I would love to hear any comments or experiences others have had when migrating to a cloud service such as Office 365. Did everything go smooth? If there were hiccups, what were they, and how were they handled?

 

Getting right into the technical nitty gritty of the Disaster Recovery (DR) plan is probably my favorite part of the whole process. I mean, as an IT Professional this is our specialty – developing requirements, evaluating solutions, and implementing products. And while this basic process of deploying software and solutions may work great for single task-oriented, department type applications, we will find that in terms of DR there are many more road blocks and challenges that seem to pop up along the way. And if we don’t properly analyze and dissect our existing production environments, or we fail to involve many of the key stakeholders at play, our DR plan will inevitably fail – and failure during a disaster event could be catastrophic to our organizations and, quite possibly, our careers.

 

So how do we get started?

 

Before even looking at software and solutions we really should have a solid handle on the requirements and expectations of our key stakeholders. If your organization already has Service Level Agreements (SLA’s) then you are well on your way to completing this first step. However, if you don’t, then you have a lot of work and conversations ahead of you. In terms of disaster recovery, SLA will drive both the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). An RPO essentially dictates the maximum amount of time in which an organization can incur data loss. For instance, if a service has an RPO of 4 hours we would need to ensure that no matter what we can always restore our service with no more than 4 hours of data loss, meaning we would have to ensure that restore points are created on a 4-hour (or smaller) interval. An RTO dictates the amount of time it takes to get our service restored and running after a failure. Thus, an RTO of 4 hours would essentially mean we have 4 hours to get the service up and running after the notification of a failure before we would begin to massively impact our business objectives.

 

Determining both RTO and RPO can become a very challenging process and really needs to involve all key stakeholders within the business. Our application owners and users will certainly always demand lower RPO and RTO values, however IT departments may inject a bit of realization into the process when a dollar value is placed on meeting those low RPO/RTOs. The point of the exercise though is to really define what’s right for the organization, what can be afforded, and create formal expectations for the organization.

 

Once our SLA’s, RTOs, and RPOs have been determined then IT can really get started on determining a technical solution to ensure that these requirements can be met. Hopefully we can begin to see the importance of having the expectations set beforehand. For instance, if we had a mission-critical business service with RTO of 10 minutes then we would most likely not rely on a tape backup to protect that service as it would take much longer than that restore from tape, instead, we would most likely implement some form of replication. On the flip side, a file server, or more specifically the data on the file server, may have an RTO of say 10 hours, at which point it could be cost effective to rely on backup to protect this service. My point is, having RTO and RPO set before beginning any technical discovery is key to getting a proper, cost-effective solution.

 

What else is there to consider?

 

Ten years ago, we would be pretty much done our preliminary work for a DR plan by simply determining RTO and RPO and could begin investigating solutions – but in today’s modern datacenters that’s simply not the case. We have a lot more at play. What about cloud?  What about SaaS? What about remote workers? Today’s IT deployments don’t just operate within the 4 walls of our datacenters and are most often stretched into all corners of the world – and we need to protect and adhere to our SLA policies no matter where the workload runs. What if Office 365 suddenly had an outage for 3 hours? Is this acceptable to your organization? Do we need to archive the mail somewhere else so at the very least the CEO can get that important message he needed? Same goes with our workloads that may be running in public clouds like Amazon or Azure – we need to ensure that we are doing all we can to protect and restore these workloads.

 

The upfront work of looking at our environments holistically, determining our SLAs, and developing RTO and RPO’s really do set IT up for success when it comes time to evaluate a technical solution. Quite often we won’t find just one solution that fits our needs – and in most deployments, we will see many different solutions deployed to satisfy a well-built DR plan. We may have one solution that handles backup of cloud and another that handles on-premises workloads. We could also have one solution that replicates to could, and another that moves workloads to our designated DR site. The point being that by focusing most of our time on the development of RPO, RTO, and business practices really lets the organization, and not IT, drive the disaster recovery processes – which in turn lets IT focus on the technical deployment and solutions built around it.

 

Thus far we have had two posts regarding developing our DR plan which dictate taking a step back and having discussions with our organizations before even beginning to evaluate and implement anything technical. I’d love to hear feedback on this. How do you begin your DR plans? Have you had those conversations with your organization around developing SLA’s? If so, what challenges present themselves? Quite often organization will look to IT for answers that should really be dictated by the business requirements and processes – what are your feelings on this? Leave me a comment below with your thoughts. Thanks for reading!

I love watching those modern movies where IT works magically. In these movies, any average Joe with access to a computer or terminal can instantly access anything with seamless effort. Everything is clear and neat, we’re presented with impeccably clean data centers, long alleys of servers with no spaghetti cables lying around, and the operators knows pretty much every single system, address and login information.

 

For a sizeable portion of the IT workforce though, this idyllic vision of the IT world is only a piece of fiction. I see here an interesting parallel with the IT world that we know. No matter how well intended we are, or how much meticulous we are with tracking our infrastructure and virtual machines, we will eventually lose track at some point in time, especially if we do not have a kind of automated solution that at the very least regularly scans our infrastructure.

 

In my previous post, I was discussing about Business Services and their challenges in complex environments and explaining how critical it is to properly map business service dependencies through a Business Impact Analysis (BIA) among other tools. To be able to look at systems from a business service perspective, we have to readjust our vision to see the hidden wavelength of the business services spectrum. Let’s put on our X-Ray business vision glasses to look beyond the visible IT infrastructure spectrum.

 

Why is it so complicated to keep track of business systems? Shouldn’t we just have a map of everything, automatically generated and up-to-date? And shouldn’t each business service be accountable for their own service / system map?

 

In an idyllic world the answer should be yes. However, with our busy lives and a focus on delivering, attention sometimes takes a toll and priorities get adjusted accordingly, especially in reactive environments. Lack of documentation, high turnover rates in personnel, as well as no proper handover / knowledge transfer sessions can be contributing factors to losing track of existing systems. But even in well intentioned cases, we may have some misses. It could be that new project which your deputy forgot to tell you about while you were on holidays, where a couple dozen systems were on-boarded into IT support, because they were too busy firefighting. Or it could be that recently purchased smaller company, where nobody informed IT that some new IT systems must be supported. Or again, this division which now runs the entire order processing system directly on a public cloud provider infrastructure.

 

Finger pointing aside, and looking at the broader scope, there has to be a way to do a better job at identifying those unknown systems, especially in the context of an IT organization that is oriented towards supporting business services. Mapping business service dependencies should be a collaborative effort between IT and business stakeholders, where the goal is to cover end-to-end all of the IT systems that participate in a given business process or function. In an ideal world, this mapping activity should be conducted through various interviews with individual stakeholders, service line owners and key contributors to a given process.

 

It is, however, difficult to obtain 100% success in such activities. A real-life example I’ve encountered during my career was the infamous “hidden innocuous desktop computer under a table”: A regular-looking machine running a crazy Excel macro which was pumping multicast Reuters financial feeds and pushing the data into an SQL server which would then be queried by an entire business division. This innocuous desktop was a critical component for a business activity which had a turnaround of cca. 500 million USD per day. With this computer off, the risk was regulatory fines, loss of reputation and loss of business from disgruntled customers. This component was eventually virtualized once we figured out that it existed. But for one found, how many are left lying around in cabinets and under tables?

 

Organizations have evolved, and long gone is the time where a monolithic core IT system was used across the company. Nowadays, a single business service may rely on multiple applications, systems and processes, and the opposite is also true: one application may also service multiple business services. Similarly, the traditional boundaries between on-premises and off-premises systems have been obliterated. Multi-tiered applications may run on different infrastructures, with portions sometimes out sight from an IT infrastructure perspective.

 

In this complex, entangled, and dynamic world, the ability to document and maintain one-to-one relationships between systems requires exponentially more time and is at best a difficult endeavor. So how do we cope with this issue and where do we go next?

 

Not all business services and processes are created equal, some are critical to the organization and others are secondary. A process that requires data to be collated and analyzed once every two weeks for regulatory compliance has a lower priority than another process which handles manufacturing and shipping. This classification is essential because it is fundamentally different from the IT infrastructure view. In IT Operations, a P1 incident often indicates downtime and service unavailability for multiple stakeholders. That incident classification already derives from multiple inputs such as the number of affected users, whether the environment is production or not. But with additional context such as business service priority, it becomes easier to effectively assess the impact and better manage expectations, resource assignment and resolution.

 

Automated application mapping and traffic flow identification capabilities in monitoring systems are essential for mapping system dependencies (and thus business service dependencies) and avoid similar situations as the ones described above. But moreover, tools which allow the organization to break away from the classical IT infrastructure view and incorporate a business services view are the most likely to succeed.

I remember the simpler days. Back when our infrastructure all lived in one place, usually just in one room, and monitoring it could be as simple as walking in to see if everything looked OK. Today’s environments are very different, with our infrastructure being distributed all over the planet and much of it not even being something you can touch with your hands. So, with ever increasing levels of abstractions introduced by virtualization, cloud infrastructure, and overlays, how do you really know that everything you’re running is performing the way you need it to? In networking this can be a big challenge as we often solve technical challenges by abstracting the physical path from the routing and forwarding logic. Sometimes we do this multiple times, with overlays, existing within overlays, that all run over the same underlay. How do you maintain visibility when your network infrastructure is a bunch of abstractions?  It’s definitely a difficult challenge but I have a few tips that should help if you find yourself in this situation.

 

Know Your Underlay - While all the fancy and interesting stuff is happening in the overlay, your underlay acts much like the foundation of a house. If it isn’t solid, there is no hope for everything built on top of it to run the way you want it to. Traditionally this has been done with polling and traps, but the networking world is evolving, and newer systems are enabling real-time information gathering (streaming telemetry). Collecting both old and new styles of telemetry information and looking for anomalies will give you a picture of the performance of the individual components that comprise your physical infrastructure. Problems in the underlay effect everything so this should be the first step you take, and the one your most likely familiar with, to ensure your operations run smoothly.

 

Monitor Reality - Polling and traps are good tools, but they don’t tell us everything we really need to know. Discarded frames and interface errors may give us concrete proof of an issue, but they give no context to how that issue is impacting the services running on your network. Additionally, with more and more services moving to IaaS and SaaS, you don’t necessarily have access to operational data on third party devices. Synthetic transactions are the key here. While it may sound obvious to server administrators, it might be a bit foreign for network practitioners. Monitor the very things your users are trying to do. Are you supporting a web application?  Regularly send an HTTP request to the site and measure response time to completion. Measure the amount of data that is returned. Look for web server status codes and anomalies in that transaction. Do the same for database systems, and collaboration systems, and file servers… You get the idea. This is the proverbial canary in a coal mine and what lets you know something is up before the users end up standing next to your desk. The reality is that network problems ultimately manifest themselves as system issues to the end users, so you can’t ignore this component of your network.

 

Numbers Can Lie - One of the unwritten rules of visibility tools is to use the IP address, not the DNS name, in setting up pollers and monitoring. I mean, we’re networkers, right? IP addresses are the real source of truth when it comes to path selection and performance. While there is some level of wisdom to this, it omits part of the bigger picture and can lead you astray. Administrators may regularly use IP addresses to connect to and utilize the systems we run, but that is rarely true for our users, and DNS often is a contributing cause to outages and performance issues. Speaking again to services that reside far outside of our physical premises, the DNS picture can get even more complicated depending on the perspective and path you are using to access those services. Keep that in mind and use synthetic transactions to query your significant name entries, but also set up some pollers that use the DNS system to resolve the address of target hosts to ensure both name resolution and direct IP traffic are seeing similar performance characteristics.

 

Perspective Matters - It’s always been true, but where you test from is often just as important as what you test. Traditionally our polling systems are centrally located and close to the things they monitor. Proverbially they act as the administrator walking into the room to check on things, except they just live there all the time. This design makes a lot of sense in a hub style design, where many offices may come back to a handful of regional hubs for computing resources. But, yet again, cloud infrastructure is changing this in a big way. Many organizations offload Internet traffic at the branch, meaning access to some of your resources may be happening over the Internet and some may be happening over your WAN. If this is the case it makes way more sense to be monitoring from the user’s perspective, rather than from your data centers. There are some neat tools out there to place small and inexpensive sensors all over your network, giving you the opportunity to see the network through many different perspectives and giving you a broader view on network performance.

 

Final Thoughts

 

While the tactics and targets may change over time, the same rules that have always applied, still apply. Our visibility systems are only as good as the foresight we have into what could possibly go wrong. With virtualization, cloud, and abstraction playing larger roles in our network, it’s more important than ever to have a clear picture of what it is in our infrastructure that we should be looking for. Abstraction reduces the complexity presented to the end user but, in the end, all it is doing is hiding the complexity that’s always existed. Typically, abstraction actually increases overall system complexity. And as our networks and system become ever more complex, it takes more thought and insight into “how things really work” to make sure we are looking for the right things, in the right places, to confidently know the state of the systems we are responsible for.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.