cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Observability - Control Systems Theory

Level 13

Observability is key to successful hybrid IT deployments because it goes hand-and-hand with controllability. If you can observe a system’s internals well, then you can control equally well that system’s output. But what is observability?

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. It is a concept that originates from control systems theory. Thinking about the current landscape of observability, there are different opinions on what is sufficient for observability. Traditional IT professionals believe that metrics and logging are sufficient. IT professionals, who focus on logging, however, believe that metrics are too noisy and makes one susceptible to paralysis by overanalysis. Meanwhile, site reliability engineers, those who embrace the DevOps culture, associate observability to tracing microservices and code stack. 

Who’s right? Who’s wrong? There are so many shades of observability grey that finding common ground can be a challenge in of itself. Application stacks are evolving with so much velocity, variety, and volume of change such that a combination of all three (metrics, logs, and traces) are needed to build a viable observability protocol. Thus, a proper toolset needs to encompass all three aspects while allowing subject matter experts to impart their rigor and discipline.

Have your organization incorporated observability concepts into your day to day? Please share your observability stories in the comment section below.

9 Comments

Observability in my workplace includes:

  • Solarwinds output
  • Syslogging and analysis through Splunk
  • Nagios output
  • Help Desk tickets / complaints
  • "Experts" who are long on theory and short on practicality.  One expects me to provide throughput statistics and analysis every port (75,000 of them) on my network for EVERY SECOND.  Talk about your "paralysis through analysis"!  As I understand it, NPM can't handle the load, switchers & routers aren't designed to accept that demand, and I KNOW some of my WAN circuits don't have available overhead for that extra data.  Plus our SQL database hasn't the room for the additional data.

Don't get me wrong--I'd LOVE to be able to call up that data for every port.  We'll just have Q come in and adjust the local constants up for electricity's speed and computational capabilities of CPU's and throughput and storage capabilities of memory, spinning disk, SSD, CPU's, etc.

Or, how about when observation actually changes behavior?  That's a part of particle physics that continues to scroll my nerd.

The OBSERVER EFFECT of QUANTUM PHYSICS says: "Your THOUGHTS affect REALITY" - YouTube

and

The Most Beautiful Experiment | the observer effect

Observability of network performance, application performance, server performance, employee performance--all are of interest in the work place and for personal network use at home, too.

Certainly I behave differently when I'M aware I'm being observed.  Whether my boss is sitting at my side, or my wife is listening to me talk on the phone, I sometimes behave differently under observation.  How about you?  If a police car happens to be directly behind you when you're driving, do you tend not to speed?  I bet you do.

MVP
MVP

Nice write up

Level 14

I'm currently tuning solarwinds (I'm new here) to remove the noise storm they get.  We have the metrics but really need to focus on what's important when alerting and reporting.  Too many irrelevant alerts and people stop responding or miss the really important stuff.  Too few and some really important stuff doesn't get reported.  It's a living, growing thing which needs monitoring and managing.  Looks like I'm going to be busy.

MVP
MVP

As rschroeder​ was talking about concerning the data mountain expected...when other teams come back a week or 3 later and want all the details around a 24 0r even 6 hour period when a problem they recently discovered began you no longer have details since everything is aggregated up.  So as you spin up all these micro services and servers...how do you maintain all that data over time to be able to manage scale or to determine problems ?  How do you measure changes you make against past instances ?  It requires a big database....enter the need for a data warehouse.  This has been an idea that is present on Thwack and even now you are building the case for it.  Even the current database throws stuff away after a node or other key is removed....  While splunk and other tools can support the log data over time, where do you keep all the other metric centric observable data ? Perfstack is a great tool being able to correlate metrics across a "stack".  This is an extension of that, but you still have to have that long term detailed metric data for analysis that doesn't impact the production database.

https://thwack.solarwinds.com/ideas/2637#start=250

My company was one of the first 7 companies to adopt SAP HANA for the pure reason of near real-time BI/Analytics/Reporting of our warehouse operations. How many cases and orders were being processed every night? How much was being spent on warehouse staff? Was the location of our products in the warehouse the most efficient? To make all of this function HANA received multiple data sources, not just an SAP database. There were dashboards presented on TV screens in each warehouse for the managers to watch and make decisions on. Yes, there were KPI's and analytical, but there was Observability too.

Right now I am receiving far too much data for me to process. All of the network devices, firewalls, VM Hosts, CommVault backups, etc. There is no way for me to keep up. I pick and choose my KPI's monthly. I believe that if you subscribe to the, "Traditional... belief that metrics and logging are sufficient." you will one day in the not-so-distant future be out of a job. Monitoring engineers, and Infrastructure as a whole, has to be ahead of this DevOps thing to survive.

Level 20

Solarwinds tools would seem to really help with this.  Less alerts but better alerts is how I try to do it.

One of the best parts of Geek Speak Blogs is enjoying how SW staff carefully and cautiously remind us of problems.  We don't see product recommendations--and that's commendable.  kong.yang​ could have said "You've got problem situations X, Y, and Z, and you should buy our matching products A, B, and C to help you better handle those issues."  But he didn't, and I'm so happy with SW for this philosophy.

Geek Speak Blogs lead us to think.  To think about new problems, or about existing problems with a new framework.  Yes, some of the blogs also lead us to think "How could I address this problem with the tools I have today?  What different tools would make addressing these issues easier and faster and better?"  And we might not even go to Solarwinds for those tools.

I would, though.  Keeping life simple and intuitive, and sticking to a product line that offers that single pane of glass--that's just plain good sense.  Maybe it's not "common" sense yet, but it's good sense!

Thanks, Kong, and all your peers, for keeping the Geek Speak articles flowing, and for keeping us thinking and aware of problems and their solutions.

MVP
MVP

The toolset needs to get it's data off the source server quickly and then processed out of band to prevent the observation from affecting the application/server/stack thus skewing the results.

MVP
MVP

Agreed. A good product sells itself. All of us on Thwack are already customers and don't need further "pushing." It's nice to have a forum where we can share ideas including competing products and come to good conclusions for what we really need to accomplish.

On the other side of that I'm glad that SolarWinds is constantly growing and listening to our conversations for ideas and improvements.

About the Author
Mo Bacon Mo Shakin' Mo Money Makin'! vHead Geek. Inventor. So Say SMEs. vExpert. Cisco Champion. Child please. The separation is in the preparation.