1 2 3 4 Previous Next

Geek Speak

80 Posts authored by: kong.yang

Logs are insights into events, incidents, and errors recorded over time on monitored systems, with the operative word being monitored. That’s because logging may need to be enabled for those systems that depend on defaults, or if you’ve inherited an environment that was not configured for logging. For the most part, logs are retained to maintain compliance and governance standards. Beyond this, logs play a vital role in troubleshooting.


For VMware® ESXi and Microsoft® Hyper-V® nodes, logs represent quintessential troubleshooting insights across that node’s stack, and can be combined with alerts to trigger automated responses to events or incidents. The logging process focuses on which logs to aggregate, how to tail and search those logs, and what analysis needs to look like with the appropriate reactions to that analysis. And most importantly, logging needs to be easy.


Configuring system logs for VMware and Microsoft is a straightforward process. For VMware, one can use the esxcli command or host profiles. For Microsoft, look in the Event Viewer under Application and Services Logs -> Microsoft -> Windows and specifically, Hyper-V-VMMS (Hyper-V Virtual Machine Management service) event logs. The challenge is efficiently and effectively handling the logging process as the number of nodes and VMs in your virtual environment increase in scale. The economies of scale can introduce multi-level logging complexities thereby creating troubleshooting nightmares instead of being the troubleshooting silver bullets. You can certainly follow the Papertrail if you want the easy log management button at any scale.


The question becomes, would your organization be comfortable with, and actually approve of, cloud-hosted log management, even with encrypted logging, where the storage is Amazon® S3 buckets? Let me know in the comment section below.

What's in an IT title?

Posted by kong.yang Jul 21, 2017

Continuous integration. Continuous delivery. Cloud. Containers. Microservices. Serverless. IoT. Buzzworthy tech constructs and concepts are signaling a change for IT professionals. As IT pros adapt and evolve, the application remains the center of the change storm. More importantly, the end goal for IT remains essentially the same as it always has been: keep the revenue-impacting applications performing as optimally as possible. Fundamental principles remain constant below the surface of anything new, disruptive, and innovative. This applies to IT titles and responsibilities as well.


Take, for example, the role of site reliability engineer (SRE), which was Ben Treynor’s 2003 creation at Google. He describes it as what happens when you ask a software engineer to perform an operations function. Google lists it as a discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. Even before the coining of the term SRE, there were IT professionals who came before and built out massively distributed, fault-tolerant, large-scale systems. They just weren’t called SREs. Fast forward to 2008, and another title started to gain momentum: DevOps engineer aka the continuous integration/continuous delivery engineer. Regardless of their titles, core competencies remain fundamentally similar. 


Speaking of IT titles. How do you identify yourself with respect to your professional title? I've been a lab monitor, a systems engineer, a member of technical staff, a senior consultant, a practice leader, and now a Head GeekTM. Does your title bring you value? Let me know in the comment section below.

On the surface, application performance management (APM) is simply defined as the process of maintaining acceptable user experience with respect to any given application by "keeping applications healthy and running smoothly." The confusion comes when you factor in all the interdependencies and nuances of what constitutes an application, as well as what “good enough” is.


APM epitomizes the nature vs nurture debate. In this case, nurture is the environment, the infrastructure, and networking services, as well as composite application services. On the other hand, nature is the code level elements formed by the application’s DNA. The complexity of nature and nurture also plays a huge role in APM because one can nurture an application using a multitude of solutions, platforms, and services. Similarly, the nature of the application can be coded using a variety of programming languages, as well as runtime services. Regardless of nature or nurture, APM strives to maintain good application performance.


And therein lies the million dollar APM question: What is good performance? And similarly, what is good enough in terms of performance? Since every data center environment is unique, good can vary from organization to organization, even within the same vertical industry. The key to successful APM is to have proper baselines, trends reporting, and tracing to help ensure that Quality-of-Service (QoS) is always met without paying a premium in terms of time and resources while trying to continuously optimize an application that may be equivalent to a differential equation.


Let me know in the comment section what good looks like with respect to the applications that you’re responsible for.

IT Right Equals Might?

Posted by kong.yang Jun 23, 2017

If I learned anything from Tetris, it’s that errors pile up and accomplishments disappear.

– Andrew Clay Shafer (@littleidea on Twitter).


In IT, we make our money and maintain our job by being right. And we have to be right more often than not because the one time we are wrong might cost us our job. This kind of pressure can lead to a defensive, siloed mentality. If might equals right, then look for an IT working environment that is conducive to hostilities and sniping.


I’ve witnessed firsthand the destructive nature of a dysfunctional IT organization. Instead of working as a cohesive team, that team was one in which team members would swoop in to fix issues only after a colleague made a mistake. It was the ultimate representation of trying to rise to the top over the corpses of colleagues. Where did it all go wrong? Unfortunately, that IT organization incentivized team members to outdo one another for the sake of excellent performance reviews and to get ahead in the organization. It was a form of constant hazing. There were no mentors to help guide young IT professionals to differentiate between right and wrong.


Ultimately, it starts and ends with leadership and leaders. If leaders allow it, bad behaviors will remain pervasive in the organization’s culture. Likewise, leaders can nip such troubling behavior in the bud if they are fair, firm, and consistent. That IT team’s individual contributors were eventually re-organized and re-assigned once their leaders were dismissed and replaced.


Rewards and recognition come and go. Sometimes it’s well-deserved and other times we don’t get the credit that’s due. Errors, failures, and mistakes do happen. Don’t dwell on them. Continue to [ learn and ] move forward. A career in IT is a journey and a long one at that. Mentees do have fond memories of mentors taking the time to help them become a professional. Lastly, remember that kindness is not weakness, but rather an unparalleled kind of strength.

I was fortunate enough to be in the audience for my friend, Dave McCrory's presentation at Interop during the Future of Data Summit. Dave is currently the CTO of Basho, and he famously coined the term "data gravity" in 2010. Data gravity, or as friends have come to call it, McCrory's Law, simply states that data is attracted to data. Data now has such critical mass that processing is moving to it versus data moving to processing.


Furthermore, Dave introduced this notion of data agglomeration, where data will migrate to and stick with services that provide the best advantages. Examples of this concept include car dealerships and furniture stores being in the same vicinity, just as major cities of the world tend to be close to large bodies of water. In terms of cloud services, this is the reason why companies that incorporate weather readings are leveraging IBM Watson. IBM bought The Weather Company and all their IoT sensors, which has produced and continues to produce massive amounts of data.


I can't do enough justice to the quality of Dave's content and its context in our current hybrid IT world. His presentation was definitely worth the price of admission to Interop. Do you think data has gravity? Do you think data agglomeration will lead to multi-cloud service providers within an organization that is seeking competitive advantages? Share your thoughts in the comment section below.

I have lots of conversations with colleagues and acquaintances in my professional community about career paths. The question that inevitably comes up is whether they should continue down their certification path with specific vendors like VMware, Microsoft, Cisco, and Oracle, or should they pursue new learning paths, like AWS and Docker/Kubernetes?


Unfortunately, there is no answer that fits each individual because we each possess different experiences, areas of expertise, and professional connections. But that doesn't mean you can't fortify your career.  Below are tips that I have curated from my professional connections and personal experiences.

  1. Never stop learning. Read more, write more, think more, practice more, and repeat.
  2. Be a salesperson. Sell yourself! Sell the present and the potential you. If you can’t sell yourself, you’ll never discover opportunities that just might lead to your dream job.
  3. Follow the money. Job listings at sites like dice.com will show you where companies are investing their resources. If you want to really future-proof your job, job listings will let you know what technical skills are in demand and what the going rate is for those skills.


As organizations embrace digital transformation, there are three questions that every organization will ask IT professionals:

  1. A skill problem? Aptitude
  2. A hill problem? Altitude
  3. A will problem? Attitude

How you respond to these questions will determine your future within that organization and in the industry.


So what do you think of my curated tips? What would you add or subtract from the list? Also, how about those three organizational questions? Are you being asked those very questions by your organization? Let me know in the comment section.


A final note: I will be at Interop ITX in the next few weeks to discuss this among all the tech-specific conversations. If you will be attending, drop me a line in the comment section and let’s meet up.

Raise your hand if you have witnessed firsthand rogue or shadow IT. This is when biz, dev, or marketing goes directly to cloud service providers for infrastructure services instead of going through your IT organization. Let's call this Rogue Wars.


Recently, I was talking to a friend in the industry about just such a situation. They were frustrated with non-IT teams, especially marketing and web operations, procuring services from other people’s servers. These rogue operators were accessing public cloud service providers to obtain infrastructure services for their mobile and web app development teams. My friend's biggest complaint was that his team was still responsible for supporting all aspect of ops, including performance optimization, troubleshooting, and remediation, even though they had zero purviews or access into the rogue IT services.


They were challenged by the cloud’s promise of simplified self-service. The fact that it's readily available, agile, and scalable was killing them softly with complexities that their IT processes were ill prepared for. For example, the non-IT teams did not leverage proper protocol to retire those self-service virtual machines (VMs) and infrastructure resources that form the application stack.That meant that they were paying for resources that no longer did work for the organization. Tickets were also being opened for slow application performance, but the IT teams had zero visibility to the public cloud resources. For this reason, they could only let the developers know that the issue was not within the purview of internal IT. Unfortunately, they were handed the responsibility of resolving the performance issue.


This is how the easy button of cloud services is making IT organizations feel the complex burn. Please share your stories of rogue/shadow IT in the comments below. How did you overcome it, or are you still cleaning up the mess?

SolarWinds recently released the 2017 IT Trends Report: Portrait of a Hybrid IT Organization, which highlights the current trends in IT from the perspective of IT professionals. The full details of the report, as well as recommendations for hybrid IT success, can be found at it-trends.solarwinds.com.


The findings are based on a survey fielded in December 2016. It yielded responses from 205 IT practitioners, managers, and directors in the U.S. and Canada from public and private-sector small, mid-size, and enterprise companies that leverage cloud-based services for at least some of their IT infrastructure. The results of the survey illustrate what a modern hybrid IT organization looks like, and shows cost benefits of the cloud, as well as the struggle to balance shifting job and skill dynamics. 


The following are some key takeaways from the 2017 IT Trends Report:

  1. Moving more applications, storage, and databases into the cloud.
  2. Experiencing the cost efficiencies of the cloud.
  3. Building and expanding cloud roles and skill sets for IT professionals.
  4. Increasing complexity and lacking visibility across the entire hybrid IT infrastructure.

Cloud and hybrid IT are a reality for many organizations today. They have created a new era of work that is more global, interconnected, and flexible than ever. At the same time, the benefits of hybrid IT introduce greater complexity and technology abstraction. IT professionals are tasked with devising new and creative methods to monitor and manage these services, as well as prepare their organizations and themselves for continued technology advancements.


Are these consistent with your organizational directives and environment? Share your thoughts in the comment section below.

Troubleshooting efficiency and effectiveness are core to uncovering the root cause of incidents and bad events in any data center environment. In my previous post about the troubleshooting radius and the IT seagull, troubleshooting efficacy is the key performance indicator in fixing it fast. But troubleshooting is an avenue that IT pros dare not to walk too often for fear of being blamed for being incompetent or incorrect.


We still need to be right a lot more than we are wrong. Our profession does not give quarters when things go wrong. The blame game anyone? When I joined IT operations many a years ago, one of my first mentors gave me some sage advice from his own IT journey. It’s similar to the three envelope CEO story that many IT pros have heard before.

  1. When you run into your first major (if you can’t solve it, you’ll be fired) problem, open the first envelope. The first envelope’s message is easy – blame your predecessor.
  2. When you run into the second major problem, open the second envelope. Its message is simply – reorganize i.e. change something whether it’s your role or your team.
  3. When you run into the third major problem, open the third envelope. Its message is to prepare three envelopes for your successor because you’re changing company willingly or unwillingly.  


A lifetime of troubleshooting comes with its ups and downs. Looking back, it has provided many an opportunity to change my career trajectory. For instance, troubleshooting the lack of performance boost from a technology invented by the number one global software vendor almost cost me my job; but it also re-defined me as a professional. I learned to stand up for myself professionally. As Agent Carter states, "Compromise where you can. And where you can’t, don’t. Even if everyone is telling you that something wrong is something right, even if the whole world is telling you to move. It is your duty to plant yourself like a tree, look them in the eye and say, no. You move." And I was right.


It’s interesting to look back, examine the events and associated time-series data to see how close to the root cause signal I got before being mired in the noise or vice-versa. The root cause of troubleshooting this IT career is one that I’m addicted to, whether it’s the change and the opportunity or all the gains through all the pains.


Share your career stories and how troubleshooting mishap or gold brought you shame or fame below in the comment section.

Most of the time, IT pros gain troubleshooting experience via operational pains. In other words, something bad happens and we, as IT professionals, have to clean it up. Therefore, it is important for you to have a troubleshooting protocol in place that is specific to dependent services, applications, and a given environment. Within those parameters, the basic troubleshooting flow should look like this:


      1. Define the problem.
      2. Gather and analyze relevant information.
      3. Construct a hypothesis on the probable cause for the failure or incident.
      4. Devise a plan to resolve the problem based on that hypothesis.
      5. Implement the plan.
      6. Observe the results of the implementation.
      7. Repeat steps 2-6.
      8. Document the solution.


Steps 1 and 2 usually lead to a world of pain. First of all, you have to define the troubleshooting radius, the surface area of systems in the stack that you have to analyze to find the cause of the issue. Then, you must narrow that scope as quickly as possible to remediate the issue. Unfortunately, remediating in haste may not actually lead to uncovering the actual root cause of the issue. And if it doesn’t, you are going to wind up back at square one.


You want to get to the single point of truth with respect to the root cause as quickly as possible. To do so, it is helpful to combine a troubleshooting workflow with insights gleaned from tools that allow you to focus on a granular level. For example, start with the construct that touches everything, the network, since it connects all the subsystems. In other words, blame the network. Next, factor in the application stack metrics to further shrink the troubleshooting area. This includes infrastructure services, storage, virtualization, cloud service providers, web, etc. Finally, leverage a collaboration of time-series data and subject matter expertise to reduce the troubleshooting radius to zero and root cause the issue.


If you think of the troubleshooting area as a circle, as the troubleshooting radius approaches zero, one gets closer to the root cause of the issue. If the radius is exactly zero, you’ll be left with a single point. And that point should be the single point of truth about the root cause of the incident.


Share examples of your troubleshooting experiences across stacks in the comments below.

One of the hottest topics in IT today is IoT, which usually stands for the Internet of Things. Here, however, I’d like to assign it another meaning: the internet of trolls and their tolls.


What do the internet of trolls and their tolls have to do with the data center and IT in particular? A lot, since we IT professionals have to deal with the mess created by end-users falling for the click-bait material at its heart. Without a doubt, the IT tolls from these internet trolls can cause real IT headaches. Think security breaches and ransomware, as well as the additional strain on people, processes, and technological resources.


One example of the internet of trolls and their tolls is the rise of fake online news. It’s an issue that places the onus on the end-user to discern between fact and reality, and often plays on an end-user’s emotions to trigger an action, such as clicking on a link. Again, what does this have to do with us? Social media channels like Facebook and Twitter are prominent sources of traffic on most organizations’ infrastructure services, whether it be the routers and switches, or the end-user devices that utilize those network connections and bandwidth, plus compute resources.


Fake news, on its own, may provide water cooler conversation starters, but throw in spearfishing and ransomware schemes, and it can have fatal consequences in the data center. Compromised data, data or intellectual property held for ransom, and disruption to IT services are all common examples of what can be done with just a single click on a fake news link by IT’s weakest link – our end-users.


Both forms of IoT have their basis in getting data from systems. The biggest challenges revolve around the integrity of the data and the validity of the data analysis. Data can be framed to tell any story. The question is: Are you being framed by faulty data and/or analysis when dealing with the other IoT?


Let me know what you think in the comment section below.

A Never Ending IT Journey around Optimizing, Automating and Reporting on Your Virtual Data Center



IT reporting at its best is pure art backed by pure science and logic. It is storytelling with charts, figures, and infographics. The intended audience should be able to grasp key information quickly. In other words, keep it stupid simple. Those of you following this series and my 2016 IT resolutions know that I’ve been beating the “keep it stupid simple” theme pretty hard. This is because endless decision-making across complex systems can lead to second-guessing, and we don’t want that. Successful reporting takes the guesswork out of the equation by framing the problem and solution in a simple, easily consumable way.


The most important aspect of reporting is knowing your target audience and creating the report just for them. Next, define the decision that needs to be made. Make the report pivot on that focal point, because a decision will be made based on your report. Finally, construct the reporting process in a way that will be consistent and repeatable.

  • excerpted from Skillz To Master Your Virtual Universe SOAR Framework


Reporting in the virtual data center details the journey of the virtualization professional in the virtual data center. The story will start with details of virtual data center and key performance indicators. It will evolve into a journey of how to get what is needed to expand the delivery capabilities of the virtual data center. With agility, availability and scalability at the heart of the virtual data center show, reporting is the justification for optimization and automation success.


Audience and context matters

Reporting is an IT skill that provides the necessary context for decision-makers to make their singular decision. The key aspects of reporting are the audience and the context. You need to know who the audience is and that will guide an IT pro on the context i.e. the data, the data analysis and the data visualization required in the report. To adeptly report, an IT professional needs to answer the following questions: for whom is the report intended and what things need to be included for a decision?


Reporting molds data and events into a summary highlighting key truths for decision makers to make quick, sound decisions. It is neither glamorous nor adrenaline-pumping but it shows IT mastery in its last, evolved form - a means to an end


This post is a shortened version of the eventual eBook chapter. Stay tuned for the expanded version in the eBook.

Master of Your Virtual IT Universe: Trust but Verify at Any Scale

A Never Ending IT Journey around Optimizing, Automating and Reporting on Your Virtual Data Center



Automation is a skill that requires detailed knowledge, including comprehensive experience around a specific task. This is because you need that task to be fully encapsulated in a workflow script, template, or blueprint. Automation, much like optimization, focuses on understanding the interactions of the IT ecosystem, the behavior of the application stack, and the interdependencies of systems to deliver the benefits of economies of scale and efficiency to the overall business objectives. And it embraces the do-more-with-less edict that IT professionals have to abide by.


Automation is the culmination of a series of brain dumps covering the steps that an IT professional takes to complete a single task. These are steps that the IT pro is expected to complete multiple times with regularity and consistency. The singularity of regularity is a common thread in deciding to automate an IT process.


Excerpted from Skillz To Master Your Virtual Universe SOAR Framework


Automation in the virtual data center spans workflows. These workflows can encompass management actions such as provisioning or reclaiming virtual resources, setting up profiles and configurations in a one to many manner, and reflecting best practices in policies across the virtual data center in a consistent and scalable way.


Embodiment of automation

Scripts, templates, and blueprints embody IT automation. They are created from an IT professional’s best practice methodology - tried and true IT methods and processes. Unfortunately, automation itself cannot differentiate between good and bad. Therefore, automating bad IT practice will lead to unbelievable pain at scale across your virtual data centers.


To combat that from happening, keep automation stupid simple. First, automate at a controlled scale following the mantra, “Do no harm to your production data center environment.” Next, monitor the automation process from start to finish in order to ensure that every step executes as expected. Finally, analyze the results and use your findings to make necessary adjustments to optimize the automation process.


Automate with purpose

Start with an end goal in mind. What problems are you solving for with your automation work? If you can’t answer this question, then you’re not ready to automate any solution.


This post is a shortened version of the eventual eBook chapter. Stay tuned for elongated version in the eBook. Next week, I will cover reporting in the virtual data center.

Master of Your Virtual IT Universe: Trust but Verify at Any Scale

A Never Ending IT Journey around Optimizing, Automating and Reporting on Your Virtual Data Center


Optimization is a skill that requires a clear end-goal in mind. Optimization focuses on understanding the interactions of the IT ecosystem, the behavior of the application stack, and the interdependencies of systems inside and outside their sphere of influence in order to deliver success in business objectives.


If one were to look at optimization from a theoretical perspective, each instantiation of optimization would be a mathematical equation with multi-variables. Think multivariate calculus as an IT pro tries to find the maxima as other variables change with respect to one another.


Excerpted from Skillz To Master Your Virtual Universe SOAR Framework


Optimization in the virtual data center spans the virtual data center health across resource utilization and saturation while encompassing resource capacity planning and resource elasticity. Utilization, saturation, and errors play key roles in the optimization skill. The key question is: what needs to be optimize in the virtual data center?


Resources scalability

Similar to other IT disciplines, optimization in the virtual environment boils down to optimizing resources i.e. do more with less. This oftentimes produces an over-commitment of resources and the eventual contention issues that follow the saturated state. If the contention persists over an extended period of time or comes too fast and too furious, errors usually crop up. And that’s when the “no-fun” time begins.


Resource optimization starts with tuning compute (vCPUs), memory (vRAM), network and storage. It extends to the application and its tunable properties through the hypervisor to the host and cluster.


Sub-optimal scale


vCPU and vRAM penalties manifests in saturation and errors, which lead to slow application performance and tickets being opened. There are definite costs to oversizing and undersizing virtual machines (VMs). Optimization seeks to find the fine line with respect to the entire virtual data center environment.


To optimize compute cycles, look for vCPU utilization and their counters as well processor queue length. For instance, in VMware, the CPU counters to examine are: %USED, %RDY and %CSTP. %USED shows how much time the VM spent executing CPU cycles on the physical CPU. %RDY defines the percentage of time a VM wanted to execute but had to wait to be scheduled by the VMKernel. %CSTP is the percentage of time that a SMP VM was ready to run but incurred delay because of co-vCPU scheduling contention. The performance counters in Microsoft are System\Processor Queue Length, Process\% Processor Time, Processor\%Processor Time and Thread\% Processor Time.


To optimize memory, look for memory swapping, guest level paging and overall memory utilization. For VMware, the counters are SWP/s and SWW/s while for Microsoft, the counter is pages/s. For Linux VMs, leverage vmstat and the swap counters si and so, swap in and swap out respectively.


Of course, a virtualization maestro needs to factor in hypervisor kernel optimization/reclamation techniques as well as the application stack and the layout of their virtual data center infrastructure into their optimization process. 


This post is a shortened version of the eventual eBook chapter. For a longer treatment, stay tuned for the eBook. Next week, I will cover automation in the virtual data center.

A Neverending IT Journey around Optimizing, Automating, and Reporting on Your Virtual Data Center



The journey of one begins with a single virtual machine (VM) on a host. The solitary instance in a virtual universe with the vastness of the data center as a mere dream in the background. By itself, the VM is just a one-to-one representation of its physical instantiation. But virtualized, it has evolved, becoming software defined and abstracted. It’s able to draw upon a larger pool of resources should its host be added to a cluster. With that transformation, it becomes more available, more scalable, and more adaptable for the application that it is supporting.


The software abstraction enabled by virtualization provides the ability to quickly scale across many axes without scaling their overall physical footprint. The skills required to do this efficiently and effectively are encompassed by optimization, automation, and report. The last skill is key because IT professionals cannot save their virtual data center if no one listens to and seeks to understand them. Moreover, the former two skills are complementary. And as always, actions speak louder than words.




In the following weeks, I will cover practical examples of optimization, automation, and reporting in the virtual data center. Next week will cover optimization in the virtual data center. The week after will follow with automation. And the final week will discuss reporting. In this case, order does matter. Automation without optimization consideration will lead to work being done that serves no business-justified purpose. Optimization and automation without reporting will lead to insufficient credit for the work done right, as well as misinforming decision makers of the proper course of actions to take.


I hope you’ll join me for this journey into the virtual IT universe.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.