cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Logging and beyond

Compared to logging monitoring is nice and clean. In monitoring you are used to look at data that is already normalized. So you basically have the same look and feel for statistics from different sources like e.g.

switches, routers and servers. Of course you will have different services and checks across the different device types but some of these interface statistics can be compared easily with each other. Here you are always

looking at normalized data. In the Logging world you are facing very different Log types and formats. An interface that is down will look identically in the monitoring no matter if it is on the switch or the

connected server interface. If you now want to find the error massage for the interface that is down in the Logs of the switch and the server you will find two completely different outputs. Even how you can access

the logs is different. On a switch it is usually a ssh connection and a show command and on a windows based server eventually a RDP session and the "Event Viewer".

This is mostly a manual and time consuming process to compare the different logs with each other and find the root cause for the problem. An other problem is that many devices have only a limited storage for logs or

even worse loosing all stored logs after a reboot. Sometimes after an unexpected reboot of a device you end up with nothing in your hands to figure out what has caused the reboot.

We can do better with sending all the Logs to a centralized Logging server. It stores all log data independently from the origin. That reduces also the needed time for information gathering. Often you will see once all

the logs are concentrated at one point that many devices have different time stamps on their log massages. To make the logs easy consumable it is important that all log sources have the same time source and pointing

to a synchronized NTP server. Once the centralization problem is solved the biggest benefit comes from the normalization of the Log Data into logical fields that are searchable. This is something that is often done by

a SIEM solution that has been implemented to address the security aspect of logging. But I have seen a lot of SIEM projects where the centralized logging and normalization approach also improves the troubleshooting

capabilities significantly. With all the logs on the same place and format you can find dependency that are not visible in the monitoring. For example I was facing periodically reboots on a series of modular routers.

In the monitoring all the performance graphs looked normal and the router was answering to all SNMP and ICMP based checks as expected until it reboots without any warnings. So I looked into the log data and found

that 24 hours before the reboot was happening that on all of the routers a "memory error massage" was showing up.

Because the vendor needed some time to deliver a bug fix release that addressed this issue we needed a proper alarming for that. So every time we captured the " memory error massage" that triggered the reboot on the

centralized logging server we created an alarm , so that we could at least prepare a scheduled reboot that was manually initialized in a time frame when it effected less users. That was a blind spot in the monitoring

system and sometimes we can improve the alarming with the combination of Logging and active checks. So afterwards you have found the root cause for your problem ask yourself how you can prevent it from causing

an outage the next time. When there is the possibility to achieve that with logging this is worth the effort. You can start small and add more log massages over time that trigger events that are important for you.

16 Comments
MVP
MVP

Totally agree. I do a fair bit of alerting using my syslog server. But these alerts are created after the issue has shown up. I monitor all BGP and OSPF alarms this way.

MVP
MVP

Logs are great if you can filter out the normal "noise".  Then you can find what's not normal so much easier.

Then if you can correlate a message to an event that happens later then you an at least get a heads up or maybe do some corrective action.

Certain systems generate prodigious quantities of logs (e.g.: Cisco WLC 5508's and Cisco ASA's).  Relying on them for troubleshooting can be daunting unless you have a SIEM--AND the training and familiarity to be efficient and competent in its use!

Other systems have local logs that can be helpful if they're correctly sized and configured.  Having 20K of log space allocated on a switch or router while that router also has debug level logging enabled is not going to help you discover what happened an hour ago.  But setting the system to log what's useful to your needs--perhaps only Warnings or worse--can help you go straight to the heart of the matter.

Setting off-box syslogging levels to include lower-level data, perhaps Alerts and higher, leaves you the ability to leverage your SIEM or syslog reporter for deeper investigations.

Level 14

Having an accurate time source is critical if you want to use a SIEM for correlation.  NTP is also a must if your logs are going to be used forensically in a court of law.  

I responded to a similar blog about this a while ago.

First we logged server logs. Then we logged switch logs. Then we logged firewall logs. And then application logs. But we had all this data and nothing to do with it. Then we went back and setup alerts. And more alerts, and more alerts. "Who is looking at all these alerts?" we would ask. So then we setup Reporting, and more Reporting.

So.Much.Data!

Level 13

I agree!!!! So Much Data!

If I had enough time on my hands to review ALL the log data, I would not have anytime to do my real job.  Same as for all the alerts one receives about the goings on with your network.  "White Noise"

Over the past week I have seen the error of my ways by not setting up alerts in LEM sooner.  More to come...

RT

Level 20

Nothing like a bugged Cisco IOS memory problem or leak!  Ugh i've had a few of those... some were very slow too!

Had a Cisco issue this last week. I was running a script in NCM that was adding in a restricted Privilege user, with some specific commands added to the priv level. One of them was "show spanning-tree detail". Fairly innocuous in of itself, right? But not so in the 12.2 version of IOS and this specific model of switch (3750Es)! A few seconds after adding the command, the switch would run out of compute and hang, needing to be rebooted

Not good when these are the core fibre switches...

IOS 15.0_2 fixes it, so that's on the cards for this client's infra! Gotta love on-boarding migrations

Level 20

Wow that sounds horribly bad! nothing like dying core switches!!!  I could have easily falling into that trap!

Any vendor (Cisco, or others) that provides a core or distribution switch, that isn't compatible with hitless reloads and hitless upgrades (e.g.: ISSU) needs to re-examine their future viability.

Those of us who buy those non-ISSU devices and deploy them in core and distribution positions can recommend alternate solutions that better support Five 9's.

On the other hand, I work in health care, and we get no maintenance windows for 7x24 hospital and data center critical care systems, so Five 9's means much to my ability to sleep well.

Yeah, I have to admit that it's made me a little paranoid when it comes to running new commands. I'm looking in to using some bug checking routines before running scripts in future, although in my defence I did test it in the lab first, but I can't lab up every combo of IOS/model

I'm just glad that I didn't wirite mem as part of the script, and the reboot reverted to the unmodified config. It could have been terribad! I tend to backup both running and startup configs first, make changes, back up the running config post change, leave things for 24 hours, and then write the changes. At least I have good points to revert back to if things go FUBAR.

Zero downtime environments are the worst. Got to have resilience everywhere, which is good practice anyway, but you have to KNOW the alternate paths on your network will work if you perform maintenance on the primary path. I don't envy you your maintenance windows!

MVP
MVP

Yeah I really had to sort out the Syslog server as there was a huge amount of messages that were not necessary. So I write rules to delete these messages when they arrive. I also have a rule that will email us with all level 0, 1 & 2 syslogs. Then from there I custom write rules on what else I like to know about. BGP and OSPF come in as Syslog level 4.

This works well (with a bit of planning) and is super easy to setup.

MVP
MVP

Using the debug command too can cause this issue. I've stopped some switches from working back in the day entering an incorrect debug command.

Live and learn as they say.

Level 21

Log Management and SIEM is near and dear to my heart!  We have been using SolarWinds LEM since shortly after SolarWinds acquired the product and I still feel like I am just scratching the surface of the potential value and capabilities.  As we have rolled it out to more and more systems I find it more common for my team to come to me looking to have me pull data from LEM to help them solve problems.

SIEM tools certainly have a much steeper learning curve then a classic monitoring system; however, if done right they are totally worth the time you put into them.

About the Author
work for 15+ years in the networking industry. I have worked for many different sectors like industry, car manufactors and government. I am a monitoring enthusiast and have done Monitoring for large scale enviroments. I blog at networkautobahn.com and my recently started podcast can be found at networkbroadcaststorm.com