Skip navigation
1 14 15 16 17 18 Previous Next

Geek Speak

1,898 posts

If you follow Geek Speak, you're probably aware of the annual "Head Geek IT Pro-Dictions." That said, we thought it would be fun to have the Head Geeks revisit last years predictions. Last year certainly offered countless technology trends and topics to discuss, deliberate, dissect, and debate.


But, were the Geeks right or wrong with their predictions? Check out the video below, as the Head Geeks discuss if they deserve a cheers for their tech predictions or an unfortunate series of jeers. And if you haven't seen this years IT Pro-dictions, be sure to check them out at the end of the video to find out what they believe will be the top ten industry drivers of 2016.



As always, your comments are welcomed, so feel free to chime in and have some fun with the Head Geeks!


kong.yang

patrick.hubbard

adatole

sqlrockstar

 

 

January 28 is Data Privacy Day (DPD). Observed since 2008, DPD brings awareness to the importance of data privacy and protection. According to the Verizon 2015 Data Breach Investigations Report, about 60% of cyber attackers are able to compromise an organization’s data within minutes. 2016 is going to be no different from a threats perspective, and data thefts are bound to happen. However, you can minimize the possibility of a cyberattack or data privacy incident by strengthening network security and following some simple security tips.

 

Centralize monitoring and control: Continuously monitor your network and get a centralized view of the hundreds of security incidents happening in real-time. This is one of the most basic requirements if your organization is required to follow industry-standard compliance regulations like HIPAA, PCI DSS, etc.


Embrace data-driven forensics: Data-driven analysis of a suspicious event will result in better root cause analysis and forensics. A suspicious event can be as trivial as an increase in Web traffic from a known host during specific non-business hours over the last seven days, or repeat connection requests to critical assets (servers, databases, etc.) from an unknown host outside the network. Considering the worst case scenario that an attack has happened, you must be able to trace it back to the source, establish an audit trail, and document the findings and the action taken.


Watch out for malicious software: A term we may see more often in 2016 is ransomware. Sensitive data is the main driver behind these types of malicious software penetrating the network, and a regular user can become an unsuspecting victim of this attack, spreading it to other computers/applications inside the network. Though anti-virus and anti-malware software can be installed to protect the systems, you should set processes in place that will alert you to suspicious application and file activities. Also, you must consider the fact that subtle file and registry changes are hard to detect without file integrity monitoring tools, and zero-day malware attacks dwell on this advantage.


Educate your users/colleagues: Patient records and credit card information are critical data. However, other data, such as social security numbers, ATM passcodes, and bank account names stored on an unprotected desktop or document creates a prime opportunity for private data leaks. Periodic mailers and knowledge sharing among peers and with users can relatively improve your organization’s security.

 

You can learn more about the Data Privacy Day here.

 

Do you think it’s time to stop, think, and formulate an effective data privacy policy for your organization? What plans do you have to improve data privacy in your organization in 2016? What roadblocks do you foresee that will stop or slow you down from implementing some right away? Write in and let me know.

I read an interesting thread the other day about a network engineer that tried to use an automated tool to schedule a rolling switch upgrade. He realized after it completed and the switches restarted that he had the wrong image for the device and they weren't coming back up. It was about fifty switches in total, which resulted in a major outage for his organization.

What struck me about the discussion thread was first that he wondered why the tool didn't stop him from doing that thing. The second was that the commenters responded that it wasn't the tool's job to sanity check his inputs. The end result was likely a severe discipline discussion on the part of the original engineer.

Tools are critical in network infrastructure today. Things have become far too complicated for us to manage on our own. SolarWinds makes a large number of tools to help us keep our sanity. But it is the fault of the tool when it is used incorrectly?

Tools are only as good as their users. If you smash your fingers over and over again with a hammer, does that mean the hammer is defective? Or is the fact that you're holding it wrong in play? Tools do their jobs whether you're using the correctly or incorrectly.

2016 is the year when we should all stop blaming the tools for our problems. We need to reassess our policies and procedures and find out how we can use tools effectively. We need to stop pushing all of the strange coincidences and problems off onto software that only does what it's told to do. Software tools should be treated like the tools we use to build a house. They are only useful if used properly.

Best practices should include sanity checking of things before letting the tool do the job. A test run of an upgrade on one device before proceeding to do twenty more. A last minute write up of the procedure before implementing it. Checking the inputs for a monitoring tool before swamping the system with data. Tapping the nail a couple of times with the hammer to make sure you can pull your fingers away before you start striking.

It's a new year full of new resolutions. Let's make sure that we stop blaming our tools and use them correctly. What do you think? Will you resolve to stop blaming the tools? Leave me a comment with some of your biggest tool mishaps!

Will this be the year of Infrastructure-As-Code (Infra-As-Code or IAC) becoming more mainstream? Or is this just a wishful journey that will never catch on? Obviously, this is not a new thing but how many companies have adopted this methodology? Or better yet, how many companies have even begun the discussions of adopting this? Do you currently write scripts and save them somewhere and think “Hey I/we are doing Infra-As-Code already”? Well if true then you are not correct. “But why?” you might be thinking. Infra-As-Code is much larger and more dynamic than just writing scripts in a traditional static method. But if you do currently utilize scripts for infrastructure related tasks and configurations, then you are much better off than those who have not began this journey at all. The reason being is that taking an automated and more programmatic approach of configurations on your infrastructure instead of a manual prone to errors approach is a much more predictable and consistent method of configuring your infrastructure components. Now these components can be of numerous types such as servers, network routers or switches, storage components and much more. But for this series of posts we will only be focusing on the network components and how we can look at beginning the journey towards Infra-As-Code.

 

Below are some of the topics that we will be covering over the series of posts.

 

 

So what does Infra-As-Code really mean? Let’s go ahead and address this here in this post and get a good foundation of what it really does mean.

 

If you begin to treat your infrastructure as code, you can begin to develop the processes in which allow you to deliver a fully tested, repeatable, consistent and deliverable methodology for configuration state in your environment. In doing this you can begin looking at these items as a pipeline of continuous delivery. Furthermore, allowing for automation tooling to consistently bring your infrastructure to the desired state which has been defined. As you begin this journey you will start with a baseline and then a desired state. Adopting chosen automation tooling for your environment, version control, code review and peer reviews will allow for a much more stable and speedy deployment. As well as, allowing for easier roll-backs in the off chance that something does not go as planned. But the chance of having to roll-back should be minimal assuming that proper testing of configurations has been followed throughout your pipeline delivery.

 

I know this all sounds great (on paper) and feels a little unrealistic in many aspects but in the next post we will begin to discover on how we can get started. And hopefully by the end of this series you will have a much better understanding and realistic view on how you too can begin the Infra-As-Code journey!

Remember, back in the day, when you’d go to a website and it was down? Yes, down. We’ve come a long way in a short time. Today it’s not just down that is unacceptable, users find it unacceptable and get frustrated if they have to wait more than three seconds for a website to load.

 

In today’s computing environments, slow is the new down and a slow application in a civilian agency means lost productivity, but a slow military application in theater can mean the difference between life and death. Due to a constantly increasing reliance on mission critical applications, the government must now meet, and in most cases surpass, the high performance standards that are being set by the commercial industry, and the stakes continue to get higher.

 

Most IT teams focus on the hardware, after blaming and ruling out the network of course. If an application is slow, the first thought is to add hardware: more memory, faster processors, upgrade storage to SSD drives, etc. – to combat the problem. Agencies have spent millions of dollars throwing hardware at application performance issues without a good understanding of the bottleneck slowing down an application.

However, according to a recent survey on application performance management by research firm Gleanster, LLC, the database is the number one source of issues with application performance, in fact 88 percent of respondents cite the database as the most common challenge or issue with application performance.

 

Trying to identify database performance issues poses several unique challenges:

  • Databases are complex. Most people think of a database as this mysterious black box of secret information and are wary to dig too deep.
  • There are a limited number of tools that assess database performance. Tools normally assess the health of a database (is it working, or is it broken?) and don’t identify and help remediate specific database performance issues.
  • Database monitoring tools that do provide more information don’t go that much deeper. Most tools send information in and collect information from the database, with little to no insight about what happens inside the database that can impact performance.

To successfully assess database performance and uncover the root cause of application performance issues, IT pros must look at database performance from an end-to-end perspective.

 

In a best-practices scenario, the application performance team should be performing wait-time analysis as part of their regular application and database maintenance. A thorough wait-time analysis looks at every level of the database—from individual SQL statements to overall database capacity—and breaks down each step to the millisecond.

 

The next step is to look at the results, then correlate the information and compare. Maybe the database spends the most time writing to disk; maybe it spends more time reading memory.

 

Ideally, all federal IT shops should implement regular wait-time analysis as a baseline of optimized performance. Knowing how to optimize performance—and understanding that it may have nothing to do with hardware—is a great first step toward staying ahead of the growing need for instantaneous access to information.


Read an extended version of this article on GCN

With its ongoing effort toward a Joint Information Environment, the Defense Department is experiencing something that’s extremely familiar to the enterprise world: a merger. The ambitious effort to consolidate communications, services, computing and enterprise services into a single platform is very similar to businesses coming together and integrating disparate divisions into a cohesive whole. Unlike a business merger, however, JIE will have a major impact on the way the DOD IT is run, ultimately providing better flow of and access to information that can be leveraged throughout all aspects of the department.

 

When JIE is complete, DOD will have a single network that will be much more efficient, secure and easier to maintain. IT administrators will have a holistic view of everything that’s happening on the network, allowing them to pinpoint how one issue in a specific area can not only be detrimental to that portion of the network but also how it impacts other areas.

 

The JIE’s standard security architecture also means that IT managers will be able to more easily monitor and corner potential security threats and respond to them more rapidly. The ability to do so is becoming increasingly important, as is evidenced by our recent survey, which illustrated the rise of cybersecurity threats.

 

As DOD kicks the JIE process into high gear, they are establishing Joint Regional Security Stacks (JRSS) which are intended to increase security and improve effectiveness and efficiency of the network. However, the network will still be handling data from all DOD agencies and catering to thousands of users, making manual network monitoring and management of JRSS unfeasible. As such, IT pros will want to implement Network Operations (NetOps) processes and solutions that help support the efforts toward greater efficiency and security.

 

The process should begin with an assessment of the current NetOps environment. IT pros must take inventory of the monitoring and management NetOps tools that are currently in use and determine if they are the correct solutions to help with deploying and managing the JIE.

 

Network managers should then explore the development of a continuous monitoring strategy, which can directly address DOD’s goals regarding efficiency and security.

 

Three key requirements to take into account in planning for continuous monitoring in JIE are:

 

  • Optimization for dual use. Continuous network monitoring tools, or NetOps tools, can deliver different views of the same IT data while providing insight and visibility to the health and performance. When continuous monitoring is implemented with “dual use” tools they can serve two audiences simultaneously. 
  • Understanding who changed what. With the implementation of JIE, DOD IT pros will be responsible for an ever-expanding number of devices connected to the network, and this type of tool enables bulk change deployment to thousands of devices.
  • Tracking the who, what, when and where of security events. Security information and event management (SIEM) tools are another particularly effective component of continuous monitoring, and its emphasis on security and could be an integral part of monitoring JRSSs. SIEM capabilities enable IT pros to gain valuable insight into who is logging onto DOD’s network and the devices they might be using, as well as who is trying to log in but being denied access.

 

Like any merger, there are going to be stumbling blocks along the way to the JIE’s completion, but the end result will benefit many – including overworked IT pros desperate for greater efficiency. Because while there’s no doubt the JIE is a massive undertaking, managing the network that it creates does not have to be.

 

To read an extended version of this article, visit Defense Systems

Well, here we are again, at the start of a New Year. This is the time for everyone to list out his or her goals and resolutions, and be reminded of how miserably they failed at such things over the previous year.


Here are my resolutions for 2016. You're welcome.


Lose wait

We all hate to wait.


We hate waiting for traffic, for the next episode of Sherlock, or for Cleveland to have a winning football team.


You know what else we hate waiting for? Data. We hate waiting for a database to return results from simple queries and reports. We hate not knowing why a report has not finished. And we hate waiting for the database administrator to fix things.


So this year I’m going to take the time to understand more about what my queries are waiting for. I’m going to take the time to learn about wait events, resource bottlenecks, and the options available to help tune the queries as needed.


Sort out my inbox

Application systems become more complex with each passing day, and they require additional monitoring along the way. The end result is an inbox full of alerts.


I’m tired of the clutter, and so are you.


So in 2016, I will find a way to make sure that I am only being alerted for things that require action. I will start by digging into the alert system to see if I can find out why the alert in question was generated. Then I can start being proactive in my work so that the amount of time needed to react to alerts is minimized.


I know that one-hour of time being spent weekly in a proactive way can save me up to three hours of time I now spend weekly in a reactive mode. That’s a lot of extra time I can better spend looking at pictures of cats and arguing with strangers on the Internet.


Be nicer to my coworkers

Just putting the word DevOps and emojis into emails isn’t enough. In 2016, I’m going to find a way to use monitoring tools to help facilitate communications. I will start to communicate ideas using reports based on the data our tools are collecting.


And by doing so I will help reduce the number of “blamestorming” meetings that happen frequently right now.


Now, these are my resolutions for 2016, feel free to use them for yourself if you want. But I’d encourage you to think about your own resolutions for the upcoming year. Think about the ways you can make things better for yourself and for others and then put those resolutions to work. How? By entering them in the 2016 IT Resolutions contest.


Meanwhile, I want to hear your thoughts about my list, your list, or your IT plans for 2016 in general, in the comments below.

ev.jpeg

Here in these early days of January, it feels the same way weekends did on Saturday mornings when I was 8 years old—a giant bowl of Sugar Frosted Choco-Bombs in my lap, cartoons on TV, and hour after hour of joyful opportunity spread out in front of me.

 

However, in the years since I was 8, I have learned a few things:

  • Don't turn up the volume on the TV before it turns on or it wakes up your parents.
  • 2 bowls of cereal is awesome; 4 is too many.
  • Carry the milk with both hands even when you are sure you can do it with one.
  • Make plans or all those hours disappear before you know it. Then it's Monday again and you are explaining to Mrs. Tabatchnik why the answer for all your math homework problems is 12.

 

In the spirit of that last point, making plans, now is the perfect time to set some goals. One might even call them "resolutions" for things that should be on your 2016 bucket list. Here are 4 suggestions of things that should be on yours.

 

Turn off the noise
My first 2016 resolution suggestion is pure #MonitoringGlory. Nobody wants to get an automated ticket, email, or text for something that isn't actually a problem, whether it comes in the middle of the day or at 2 am. Resolve to spend some quality time with your alert triggers and their results. Does the trigger logic identify a real, measurable, actionable problem, or is it an "FYI alert" that merely pesters an actual human to go check and see if something is ACTUALLY wrong? Now dig into the results over the last year. Did this alert generate storms of alerts? Almost none at all? What did people do when the alert came in?

 

All of these questions will help you create a better, more meaningful alert. This leads to the recipient of the alert believing it more, which leads to better responsiveness.

 

Enable IPv6

In the 20 years since the protocol was released, IPv6 has netted only a 10% adoption in the workplace. With the oncoming storm of SDN, IoT, and BYOE—not to mention the general growth of networks and network complexity—there are alarming security risks in NOT understanding what is and isn't IPv6-enabled in your environment (and what it's doing). Finally, with the not-so-modest gains to be made with IPv6 in the area of clustered servers, domain controllers, multicast, and more, this is the time to get in front of the curve and start planning, and even implementing IPv6.

 

Commit to learning and testing now so you aren't under the gun when it's really crunch time.

 

Commit to security

In the same vein, your IT resolutions should include at least one security-related commitment. Maybe you make friends with the audit team for once. Maybe you scan your network device configurations and see if they meet SOX or DISA STIG standards. Maybe you use NetFlow or Deep Packet Inspection to identify the types of traffic on your network (as well as the source and destination of that traffic).

 

Heck, even just choosing and using a password manager for your own personal accounts would be a great start. If for no other reason than it would get you thinking about all the OTHER users in your organization and how they are managing their passwords. Which, as we saw throughout 2015, was the first line of defense to fail in every major breach.

 

Whatever it is, don't let security be someone else's responsibility this year.

 

Know the value of your monitoring

Coming back around to monitoring for my last point, commit to taking the time to understand what monitoring provides you. What I mean by that is, every time a specific alert triggers, what have you saved in terms of minutes of outage, staff time, and/or predictive vs reactive repair costs?

 

Calculating this may be time consuming, but it's not complex, as I've described in the past (https://thwack.solarwinds.com/community/solarwinds-community/geek-speak_tht/blog/2015/01/09/the-cost-of-not-monitoring).

 

Why should this be on your 2016 resolutions? Because it helps you identify which tools, monitors, and alerts cost your company (in time to create, maintain, and respond to) more than they are worth, and which have a high return on investment. Not only that, but doing this for existing monitors helps you evaluate which of your upcoming requests is most worth digging into.

 

Finally, having these numbers handy gives you the ammunition you need to face the bosses and bean counters when you request additional licenses—they need a justification.

 

Because I don't know about where you work, but it feels like my whole management and purchasing team is related to old Mrs. Tabatchnik.


Those are MY recommendations for what you should have on your 2016 IT resolutions, but your list probably looks a lot different. You really should put those resolutions to work. How? By entering them in the 2016 IT Resolutions contest (https://thwack.solarwinds.com/community/solarwinds-community/contests-missions/it-resolutions/overview?CMP=THW-GS-SWI-2016_IT_Resolutions_LA-X-THW). Meanwhile, I want to hear your thoughts about my list, your list, or your IT plans for 2016 in general, in the comments below.

Network monitoring has relied historically on SNMP as a primary means of gathering granular statistics. SNMP works on a pull model. The network monitoring station reaches out and pulls value from OIDs, and then reacts to that data. There are also monitoring options where a network device pushes statistics to data collectors such as network management stations, flow collectors, or syslog engines.


In researching Ethernet switches, I've run across the term telemetry that describes datasets coming from these devices. Vendors are positioning telemetry as if it is some new feature that you need to be on the lookout for.


So, is telemetry something new? In digging through vendor literature, watching presentations, and talking to one of my vendor contacts specifically, I’ve concluded network telemetry represents both old and new forms of network statistics, and new ways of gathering and exposing data.


First, in presentations, it's clear that some networking vendors use the term “telemetry” generically. As they work through their demo, they display, for example, sFlow and syslog data. Those are not new data formats to network engineers. We know flow data in its various formats, including sFlow. We also know syslog well. And we also know that those formats typically contain information pushed to our data collectors in real-time or near real-time.


However, I do think that for some vendors, telemetry is more than a fancy way to describe the same old data. For instance, Juniper Networks shared several facts with me about their Junos Telemetry Interface that are a bit different than what network engineers might be used to. Here are the more relevant points:


  • Junos telemetry is streamed in a push model, like syslog or flow data.
  • Juniper uses Google's Protobuf message format to stream the data. Protobuf is interesting. The big idea according to Google is to define your message format and fields, and then compile code optimized to read that data stream. This means that Juniper doesn't have to shoehorn telemetry into a format that might be ill-suited to the data. They can build their structured message format and optimize it however they like, and extend as they go.
  • Juniper is not exposing every conceivable value via their telemetry interface (which proprietary SNMP vendor MIBs tend to do). Rather, they’ve focused on performance management data: I/O & error counters, queue statistics, and so on.
  • The Junos telemetry interface is open to anyone that wants to parse the data. Therefore, any vendor that wishes to create a custom application for end users could work with Juniper, get the data format details, and go to town.


Other vendors that come up when talking about telemetry include Cisco with their ACI data center fabric, and Arista with the telemetry interface in their EOS operating system. While I don't have specific details on how Cisco and Arista telemetry interfaces might differ from the Junos telemetry interface, they all seem to emphasize the near real-time pushing of descriptive network data to a collector that can aggregate the data and present it to a network operator.


So whether the term telemetry is being used generically to mean "data from the network" or specifically to mean "pushing specific network metrics to a data collector," I believe it's a term we're going to see used more and more.


While the data gathered via telemetry might be familiar, I believe the method used to gather the data, as well as what's being done with that data, is where the magic lies.


This begs another question. Could network telemetry be the end of SNMP? While my crystal ball remains murky, I believe SNMP has a long run still ahead of it. To supplant the familiar and ubiquitous SNMP, vendors will need to get their heads together on just exactly what this new telemetry format should be.


From what I can tell looking at just three vendors — Cisco, Juniper, and Arista — network telemetry is implemented differently for each of them. Differences slow technology adoption, as the variant solutions place monitoring vendors in the unenviable position of having to pick and choose which telemetry solutions to align themselves with.


Whatever SNMP's shortcomings might be, all you have to do is sort out the OIDs. The industry has already agreed upon the rest.


As 2015 ends, businesses are busy closing deals, evaluating project success, and planning for the New Year. For IT professionals, this transitional period is crucial in building a foundation for success in the upcoming year. This year, I resolve to keep IT stupid simple. It’s guaranteed to KISS away all IT issues.

 

A Walk in the Clouds, containers and loosely coupled services

It’s easy to get lost in the myriad of new technologies and the associated vendor FUD (fear, uncertainty, and doubt) that fills the IT landscape. It certainly doesn’t make an IT professional’s job any easier when it’s hard to discern between what’s fact or fiction. Especially when one can solve a problem in so many different ways.

Ultimately, what’s the most efficient and effective method to troubleshoot and remediate problems?


Keep IT Stupid Simple

So let’s start with the obviouskeeping IT stupid simple. This means that if you don’t understand the ins and outs of your solution stack; then, it shouldn’t be your solution. When an application or system slows down, breaks or fails (and it will), your job is on the line to quickly root cause and resolve the issue. Keeping IT stupid simple makes troubleshooting and remediation much easier.


The USE Method

For performance bottlenecks, a great and simple framework to follow is the USE Method by Netflix’s Brendan Gregg.

USE stands for utilization, saturation, and errors. Each aspect is defined below:

  1. Utilization – the average time that a resource was busy working
  2. Saturation – the degree to which the resource can no longer do the work often resulting in queue lengths
  3. Errors – count of error events

Think of it as a checklist to keep things simple when troubleshooting and remediating issues. For every key resource, check the utilization, saturation, and errors. These aspects are all interconnecting and provide different clues in identifying bottlenecks.


The complete picture

Utilization covers what’s happening over time, but depending on sampling rate and incident intervals, it may not provide the complete picture. Saturation provides insights on overloaded conditions, but may not show up by just viewing utilization metrics for the reasons mentioned above. And errors give clues on operations that have failed and lead to retries. Combining all three can provide a clear view of what is happening during a performance bottleneck condition.

A proper monitoring tool will collect, aggregate, and visualize utilization while alerting on saturation conditions and logging and correlating errors.


A virtualization example of USE Method

Let’s walk through a simple graphical virtualization example utilizing the USE method focused on one resource metric.


  • Utilization – Let’s examine on the Host CPU utilization of Host – bas-esx-02.lab.tex, which shows a 98% utilization.

utilization.png

Figure 1

  • Saturation – Next, let’s dig in to verify if there was a triggered alert that indicates a potential saturation condition on the Host CPU utilization resource.

saturation.png

Figure 2

  • Errors – Finally, let’s see if this saturated event had any bearing on the host’s availability. And it appears that there was a time window 2 days and 8 hours ago relative to these screen captures that the host server had some availability issues.

error.png

Figure 3

The USE method paired with a proper virtualization tool epitomizes the keep IT stupid simple principles as a mean to troubleshoot any potential bottlenecks.

 

What is your resolution?

So in the upcoming year, what do you resolve to do to complete your data center picture whether they reside on your premises or on cloud? And what tools will you use or do you need to get it done right? Please chime in below in the comment section.


And join the 2016 IT Resolutions Contest!

exercise-more.PNG

Hello Geek Speakers,


As 2015 comes to an end and 2016 begins, SolarWinds once again tapped its band of experts - the Head Geeks - to take a look inside their crystal balls and provide a glimpse into IT trends to watch for in the coming year. Will their predictions come to fruition in 2016? Let us know what you think and don't forget to revisit last years Prodictions to see where the Head Geeks were right or wrong.


Be sure to @mention each Geek to continue the conversation with them about their predictions. Here are each of their thwack handles:


Kong Yang - kong.yang


Patrick Hubbard - patrick.hubbard


Leon Adato - adatole


Thomas LaRock - sqlrockstar


As we did last year, we will continue to revisit these predictions and see if they are becoming a reality, or if they were entirely incorrect. However, that will likely NOT happen!






We hope you all have a Happy New Year - Enjoy!


 

After taking a look at what it means to monitor the stability of Microsoft Exchange, and choosing a product option that won’t keep your organizational staff busy for months configuring it we will now look at what it means to monitor Exchange Online in the Office 365 product platform.   Yes, you did read that correctly, Exchange Online.  Isn’t Microsoft monitoring Exchange Online for me? Well yes, there is some level of monitoring, but we as customers typically do not get frontline insight into the aspects of the product that are not working until something breaks.  So, let’s dive into this a little bit further.


Exchange Online Monitoring


If your organization has chosen Exchange Online your product SLA’s will generally be around 99.9x%.  The uptime varies from month to month, but historically they are right on track with their marketed SLA or they will slightly exceed it.  As a customer of this product your organization is still utilizing the core Exchange features such as a Database Availability Group (DAG) for your databases, Outlook Web App, Azure Active Directory, Hub Transport servers, CAS servers etc, but the only difference is that Microsoft maintains all of this for your company.  So assuming that Office 365/Exchange Online meets the needs of your organization this is great, but what happens when something is down? 99.9x% is good, but it’s not 99.999%, so there are guaranteed to be some occurrences of downtime.


Do I really need monitoring?


Not convinced monitoring is required?  If your organization has chosen to move to the Exchange Online platform; being able to understand what is and isn’t working in the environment can be very valuable.  As an Exchange Administrator within the Exchange Online platform, if something isn’t working I can guarantee that leadership is still going to look to me to understand why even if the product is not maintained onsite.  Having a quick and simple way to see that everything is functioning properly (or not) through a monitoring tool can allow you to quickly provide your leaders the feedback they need to properly communicate to the business what is happening.  Even if the only thing I can do next is contact Microsoft to report the issue.


Corrective Measures


Choose a monitoring tool for Exchange Online that will provide valuable insights into the your email functionality.  My guidance here would be relatively similar to suggestions that I would make for Exchange On-Premises.


  • Evaluate several tools that offer Exchange Online monitoring, and then decide which one will best suit your organizations requirements.
  • Implementation of monitoring should be project with a dedicated resource.
  • The tool should be simple and not time consuming to configure. 
  • Choose a tool that monitors Azure Connectivity too. Exchange Online depends heavily on Azure Active Directory and DNS, so being aware of the health of your organizations connectively to the cloud is important.
  • Make sure you can easily monitor your primary email functionality.  This can include email flow testing, Outlook Web App, Directory synchronization, ActiveSync, and more.
  • Ensure that the tool selected has robust reporting. This will allow for time saving’s from scripting your own reports, and allow for better historical trending of email information.  These reports should include things such as mail flow SLA’s, large mailboxes, abandoned mailboxes, top mail senders, public folder data, distribution lists and more.


These considerations will help your determine which product solution is best for your organizational needs.


Concluding Thoughts


Monitoring the health of your organizations connectivity to the cloud is valuable to providing insight into your email system.  There are options that with provide you and your organizational leadership instant insight into the health ensuring that there is an understanding of overall system health, performance and uptime.

Now that I’m finally recovered from Microsoft Convergence in Barcelona, I’ve had a chance to compare my expectations going into the event with the actual experience of attending.  And as always for SolarWinds staff especially as Head Geek, that experience is all about speaking with customers. What was different about Convergence, aside from Barcelona always being it’s wonderful self, is that the mix of conversations tended more toward IT managers and less with hands-on admins.  And lately I’m finding more and more managers who actually understand the importance of taking a disciplined approach to monitoring.

 

IMG_1248.png

MS Convergence is focused on MS Dynamics, specifically Dynamics AX CRM. Generally when you look at the front page for a software product and the first button is “Find a partner” rather than “Live Demo”, or “Download”, it lets you know it’s a complex platform.  But in the case of Dynamics AX, there’s a reason- a surprising number of integrations for the platform.  There’s no way Microsoft can be expert in SAP, TPF, Informix, Salesforce and 496 other large platforms, so they rely on partners. For us in the SolarWinds booth of course, it meant we had lots of familiar conversations about complex infrastructure and the challenges of managing everything from networking to apps to storage and more.

 

One thing was clear, at least with the European audience- cloud for Dynamics customers seems to be driving more discipline not less.  As admins cloud too often means shadow IT- even more junk that we need to monitor but with less control.  With Dynamics the challenge isn’t remaining calm when your NetFlow reports show dozens of standalone Azure or AWS instances, it’s the reverse.

 

With a single Azure service endpoint for the platform, firewall and router traffic analysis uncovers dozens of niche domain publishers across the organization, each pushing critical business data to Dynamics in the cloud.  Some integrations are well behaved, running by ops and monitored, but others are hiding under the desk of a particular clever analyst.  These developers have credentials to extend the CRM data picture, but no budget to assure operations.  Managers I spoke with were of course nervous about that, plus regulatory compliance and of course evolving EU data privacy laws- not a trivial CRM endeavor.  (Makes me actually grateful for U.S. PCI, HIPA, SOX and GLBA, which though headaches are reasonably stable.)

 

It was interesting to look at the expo hall, filled with dozens of booths of boutique partners, each specializing in a particular CRM integration nuance and to remember the fundamental concerns of everyone in IT.  We’re on the hook for availability, security, cost management and end user quality of experience and we share the same challenges.  Establishing and maintaining broad insight into all elements of production infrastructure is not a first step, and certainly mustn’t be the last step.  Monitoring is a critical service admins provide to management and teammates that makes everything else work.  In hybrid IT, with on premises, cloud, SaaS and everything else in between, there are more not fewer dependences, more to configure and more to break.  It always feels good to be needed just as much by the Big Systems owners and IT managers as it is to be the favorite tool of admins working helpdesk tickets.

 

Of course it was also nice to see that Microsoft finally streamlined the doc for Surface Pro 4.

I've had the opportunity over the past couple of years to work with a large customer of mine on a refresh of their entire infrastructure. Network management tools were one of the last pieces to be addressed as emphasis had been on legacy hardware first and the direction for management tools had not been established. This mini-series will highlight this company's journey and the problems solved, insights gained, as well as unresolved issues that still need addressing in the future. Hopefully this help other companies or individuals going through the process. Topics will include discovery around types of tools, how they are being used, who uses them and for what purpose, their fit within the organization, and lastly what more they leave to be desired.


Blog Series

One Company's Journey Out of Darkness, Part I: What Tools Do We Have?

One Company's Journey Out of Darkness, Part II: What Tools Should We Have?

One Company's Journey Out of Darkness, Part III: Justification of the Tools

One Company's Journey Out of Darkness, Part IV: Who Should Use the Tools?

One Company's Journey Out of Darkness, Part V: Seeing the Light

One Company's Journey Out of Darkness, Part VI: Looking Forward



If you'e followed the series this far, you've seen a progression through a series of tools being rolled out. My hope is that this last post in the series spawns some discussion around tools that are needed in the market and features or functionality that is needed. these are the top three things that we are looking at next.

 

Event Correlation

The organization acquired Splunk to correlate events happening at machine level throughout the organization, but this is far from fully implemented and will likely be the next big focus. The goal is to integrate everything from clients to manufacturing equipment to networking to find information that will help the business run better and experience fewer outages and/or issues as well as increase security. Machine data is being collected to learn about errors in the manufacturing process as early as possible. This error detection allows for on the fly identification of faulty machinery and enables quicker response time. This decreases the amount of bad product and waste as a result, improving overall profitability. I still believe there is much more to be gained here in terms of user experience, proactive notifications, etc.


Software Defined X

Looking to continue move into the software defined world for networking, compute, storage, etc. These offerings vary greatly and the decision to go down a specific path shouldn't be taken lightly by an organization. In our case here we are looking to simplify network management across a very large organization and do so in such a way that we are enabling not only IT work flows, but for other business units as well. This will likely be OpenFlow based and start with the R&D use cases. Organizationally IT has now set standards in place that all future equipment must support OpenFlow as part of the SDN readiness initiative.

Software defined storage is another area of interest as it reduces the dependency on any one particular hardware type and allows for ease of provisioning anywhere. The ideal use case again is for R&D teams as they develop new product. Products that will likely lead here are those that are pure software and open, evaluation has not really begun in this area yet.


DevOps on Demand

IT getting a handle on the infrastructure needed to support R&D teams was only the beginning of the desired end state. One of the loftiest goals is to create an on-demand lab environment that provides compute, store and network on demand in a secure fashion as well as provide intelligent request monitoring and departmental bill back. We've been looking into Puppet Labs, Chef, and others but do not have a firm answer here yet. This is a relatively new space for me personally and I would be very interested in further discussion around how people have been successful in this space.

 

Thank you all for your participation throughout this blog series.  Your input is what makes this valuable to me and increases learning opportunities for anyone reading.


Recently we covered what it means to configure server monitoring correctly, and the steps we can take to ensure that the information we get alerted on is useful and meaningful.  We learned that improper configuration leads to support teams that ignore their alerts, and system monitoring becomes noise. Application monitoring isn’t any different, and what your organization sets up for these needs is likely to be completely different than what was done for your server monitoring.  During this article we will focus on monitoring Microsoft Exchange on-premises, and what should be considered when choosing and configuring a monitoring tool to ensure that your organizational email is functioning smoothly to support your business.


Email Monitoring Gone Wrong


In the early days of server monitoring it wasn’t unusual for system administrators to spend months configuring their server monitoring tool for their applications.  With some applications, this decision may be completely appropriate, but with Microsoft Exchange I have found that server monitoring tools typically are not enough to successfully monitor your organizational email system. Even if the server monitoring tool comes with a “package” that is specifically for monitoring email.  The issue becomes that by default these tools with either alert on tool much or too little never giving your application owner exactly what they need.


Corrective Measures


So how can your business ensure that email monitoring is setup correctly, and that the information received from that tool is useful?  Well it really comes down to several simple things.

  • Evaluate several Exchange monitoring tools, and then choose a tool that will best suit your Exchange needs.  In most cases this tool is not the same as your server monitoring tool.
  • Implementation of Exchange monitoring should be project with a dedicated resource.
  • The tool should be simple and not time consuming to configure. It should NOT take 6 months to be able to properly monitoring your email system.
  • Choose a tool that monitors Active Directory too.  Exchange depends heavily on Active Directory and DNS, so Active Directory health is also vital.
  • Make sure you can easily monitor your primary email functionality. This includes email flow testing, your Exchange DAG, DAG witness, Exchange databases, ActiveSync, Exchange Web Services, and any additional email functionality that is important to your organization.
  • Ensure that the tool selected has robust reporting.  This will allow for time saving’s from scripting your own reports and allow for better historical trending of email information. These reports should include things such as mail flow SLA’s, large mailboxes, abandoned mailboxes, top mail senders, public folder data, distribution lists and more.

This approach will ensure that your email system will remain functional, and alert you before a real issue occurs.  Not after the system has gone down.


Concluding Thoughts


Implementing the correct tool set for Microsoft Exchange monitoring is vital to ensuring the functionality and stability of email for your business.  This is often not the same tool used for server monitoring, and should include robust reporting options to ensure your SLA’s are being met and that email remains functional for your business purposes.

Filter Blog

By date:
By tag: