With monitoring, we try to achieve end to end visibility for our services. So everything that is running for business critical applications needs to be watched . For the usual suspects like switches, servers and firewalls we have great success with that. But in all environments you have these black spots on the map that nobody is taking care of. There are two main categories why something is not monitored, the organisational (not my department) and the technical.
Not my Department Problem
In IT sometimes the different departments are only looking after the devices that they are responsible for. Nobody has established a view over the complete infrastructure. That silo mentality ends up with a lot of finger pointing and ticket ping pong. Even more problematic are devices that are under the control of a 3rd party vendor or non IT people. For example, the power supply of a building is the responsibility of the facility management. In the mindset of the facility management monitoring has a completly different meaning to the one we have in IT. We have build up fully redundant infrastructures. We have put a lot of money and effort into making sure that every device has a redundant power supply. Only to find that it ends up in a single power cord that is going to a single diesel power generator that was build in the 1950s. The monitoring by the facility management is to go to the generator two times per day and take a look at the front panel of the machine.
TECHNICAL HARD TO MONITOR
And than you have the technical problems that can be a reason why something is not monitored. Here are some examples why it is sometimes hard to implement monitoring from a technical perspective. Ancient devices: Like the mentioned Diesel Power generator there are old devices that come from an era without any connectors that can be used for monitoring. Or it is a very old Unix or Host machine. I have found all sorts of tech that was still important for a specific task. So when it couldn´t be decommissioned it is still a dependency for a needed application or task. If it is still that important than we have to find a way to monitor it. It is needed to find a way to connect like we do with SNMP or an agent. If the devices simply support none of this connections we can try to watch the service that is delivered through the device or implement an extra sensor that can be monitored. For example of the Power generator, maybe we can not watch the generator directly but we can insert some devices like an UPS that can be watched over SNMP and shows the current power output. With intelligent PDU in every rack you can achieve even more granularity on the power consumption of your components. Often all the components of a rack have been changed nearly every two years, but the Rack and the power connector strip have been used for 10+ years. The same is true for the cooling systems. There are additional sensor bars available that feed your monitoring with data for the case the cooling plant can not deliver these data. With a good monitoring you can react before something happens.
IT IS PASSIVE
Another case are passive technologies like CWDM/DWDM or antennas. These also can only be monitored indirectly with other components that are capable of proper monitoring. With GBICs that have an active measurement / DDI interface you have access to real time data that can be implemented into the monitoring. Once you have this data in your monitoring you have a baseline and know how the damping across your CWDM/DWDM fibres should look like. As a final thought, try and take a step back to figure out what is needed so that your services can run. Think in all directions and expect nothing as given. Include everything that you can think of from climate, power and include all dependancy of storage, network and applications. And with that in mind take a look at the monitoring and check if you cover everything.