I'm guessing most folks have moved away from traps in favor of syslog, any chance you can do that? It'll simplify your life so I'd seriously consider it if possible...
Technically, switching to syslogs from traps would be possible; getting people to accept that change may be a bit more difficult.
If I do switch to syslogs we still run into a lot of the same or similar problems though; alerting my techs of the important stuff while squelching out all of the noise.
There are certainly logs that we know we want alerts for and we have set those up without a problem. The problem is the ones you don't know about but are important and will want to have alerts for if they come through.
One approach I've found useful to filter out syslog noise is the "artificial ignorance" technique. I believe this technique was named by Marcus Ranum, probably close to 20 years ago by now. The idea is that you filter out stuff that you know isn't interesting, and refine that filter list over time. The simplest way is with little shell scripts, like:
grep -vf exclusions.txt syslog.txt
where exclusions.txt is a list of stuff you know you don't want to see, e.g. LINEPROTO-5-UPDOWN, etc. I've seen people write small VB apps to do the same sort of thing in a pure Windows environment.
A more sophisticated way would be to create syslog summary scripts, that give you a quick list of how many instances of each message type occurred during a window. This thread:
has an example of how to do this with Kiwi, and a link to a blog post by Terry Slattery on how to do it with Perl scripts. (I haven't tried the Kiwi method yet, but I will probably get to it next week.)
For the reasons you have outlined, we have over time moved away from both snmp traps and syslog.
Instead, we tend to rely more on active probes (i.e. Orion provided for interfaces, up/down etc and custom pollers for array status, temp, power status, fan status etc). We do still have some alerts defined for both syslogs and snmp traps - just not a huge number (considering the size of our implementation and number of device types).
This may delay the alert detection - but is outweighed in most cases by ease of setup/maintenance, no false positives and certainty over what we have monitored in our environment. Given how many times we have discussed this internally (with different viewpoints) - I'm sure that not everyone will agree with this approach. The alternative however is the need to spend a lot of time tailoring device configs, alert filters, testing etc (and re-doing these activities for each software upgrade or hardware refresh).
Part of the biggest problem is vendors like Cisco that just don't implement standard message formats (or even have message id format). If you have ever looked at the number of different messages Cisco devices can generate to indicate that a power supply fan has failed - you will understand the problems of defining an alert (they even change within the same IOS train). Cisco could learn a lot from the IBM mainframe approach - once a message format is published (with a unique message ID) then it will never change.
We do still however collect syslog and snmp traps. These are mainly used however for adhoc reporting. For example we look for most frequently occurring alerts, top talkers etc. They are of course also available to the support staff if we want to see what alerts were produced for a particular failure event that occurred.