The 4 Operations of Alerting on F5 BIG-IP

published by admin on Tue, 04/28/2020 - 21:57

Introduction

This tutorial is about F5 BIG-IP alerting, particularly on how to configure any F5 to generate more or less alerts of different types, or even of the same type as it can sometimes be necessary. We will call it the 4 Operations of Alerting as the 4 basic arithmetic operations are a good guideline to follow and will help us to remember how things work, and how to configure our F5. The 4 operations being noted + - * / , we will give to each their meaning in the context of alerting.

Alerting basics
Monitoring and Alerting are the inseparable brothers in any mature IT environment. While there are different methods to implement them, this brotherhood takes all its sense when we choose to rely on SNMP, as this protocol provides both polling statistic collections and receiving unsolicited messages, more commonly called traps or alerts by a language abuse. However, an alert could also be the result of a statistic polling later compared to a threshold, or the result of a more or less complicated calculation (sum, derivative calculation,..). SNMP being well established and understood by all of the big hardware constructors, we of course find it on F5 as its central monitoring & alerting strategy. In this tutorial we will just focus on SNMP and the possibilities of turning alerting more verbose or less, but be aware that F5 goes beyond this and has or uses complementary solutions that are also worth exploring such as Analytics, HSL, and sFlow.

[+] operator
The + operation means that we are adding more alerts to the system, of new types that are at the moment not handled.

Adding alerts is done by adding a dedicated section to the /config/user_alert.conf configuration file, as explained in the official documentation K3727: Configuring custom SNMP traps.

I suggest two use-cases, one for expired certificates and the other for more or less brutal clock adjustments. Indeed when the clock is adjusted this could mean a serious system overload or just a recoverable unavailability of the NTP source. In both cases anyway this clock adjustment is good in itself but has side effects as all of the timers are impacted (from the single TCP timers to the persistence sessions) and for this reason we may want to be alerted.

Practice : we add the following to the file named above /config/user_alert.conf :
alert CERTIFICATE_EXPIRED "CN(.*) expired on" { snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.301"; } alert CLOCK_ADJUSTMENT_SOFT "^01010029:5: Clock advanced by(.*)" { snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.302"; } alert CLOCK_ADJUSTMENT_HARD "^01010040:4: Clock has unexpectedly adjusted by(.*)" { snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.303"; }

Remember that in a real world we will not wait for a certificate to expire and that we would wake up earlier before receiving this alert. Alerting before a certificate is going expire can be done with the same method and for those interested, I let you this work as an exercise to practice what we have just been talking about.

While relying on system generated messages to raise new alerts is already a good step, we can also benefit from iRules for application monitoring. Let's suppose that we have an iRule which performs a DNS resolution, and because of the criticity of such service we want to raise an alert when one DNS server is unreachable, we can combine the iRule with an alert this way :
if [RESOLV::lookup ..] log local0. "ERROR DNS server unreachable : $dnsserver1"
alert APPLICATION_DNS_UNREACHABLE "^ERROR DNS server unreachable(.*)" { snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.304"; }

At this moment you probably have noticed that we are using different OIDs for our custom alerts and more explanations will be given on this at the end of the tutorial.

[-] operator
The - operation means that we are reducing the type of alerts that we want the system to generate. We can either want to remove completely all alerts of one type, or just some alerts related to a restricted set of objects or even one object only.

In the first case, we consider one or several alert types as just being meaningless and we eradicate them once for good. The solution is to cut at the root by creating a LogFilter that will prevent the messages to be logged and as a consequence to generate a following trap.
For example, we don't want to receive any alert related to bad SSL handshakes as they are no value creators and could (should) be checked at the profile level. Although this will also shutdown all other connection errors, we will configure this LogFilter :
Name : Filter_TMM_SSL_messages Severity : Warning Source : all Message ID : 01260009 Log Publisher : None

The second case is very realistic also, it's not based on an all-or-nothing idea and we will let some alerts of one given type to still be sent.
For example, we are budget constrained and hosting both production and development applications on the same F5 device, the first ones being very stable while the second being restarted over and over by the DEV team. We then need to stay sharp with the production events while ignoring the development ones is the appropriate posture.

As for the + operator, the solution is again to edit /config/user_alert.conf and add as many sections as alerts that we want to neutralize. Here you will notice the importance of a good naming convention as we will rely on regex to identify our development related objects. Let's say our dev virtual servers and pools are stored in the Common partition and have names starting by dev_vs_ and dev_pool_, we are removing the traps signaling all UP or DOWN from these objects by the following content in /config/user_alert.conf :

alert BIGIP_MCPD_MCPDERR_POOL_MEMBER_MON_STATUS_UP "^01070727:5: Pool /Common/dev_pool_(.*)" {} alert BIGIP_MCPD_MCPDERR_POOL_MEMBER_MON_STATUS "^01070638:5: Pool /Common/dev_pool_(.*)" {} alert BIGIP_MCPD_MCPDERR_VIRTUAL_AVAIL "^01071681:5: SNMP_TRAP: Virtual /Common/dev_vs_(.*)" {} alert BIGIP_MCPD_MCPDERR_VIRTUAL_UNAVAIL "^01071682:5: SNMP_TRAP: Virtual /Common/dev_vs_(.*)" {}

[*] operator
The * operation means that we are multiplying the same alert. The difference with + is that we don't exactly want more alerts, we just want more of the same.

One practical case is when you really want to be notified by email in addition to the usual monitoring tool (that may be temporarily unreachable for example). The same alert would then be propagated twice : by the SNMP trap and by email, and again we will configure the /config/user_alert.conf file such as :

alert BIGIP_MCPD_MCPDERR_VIRTUAL_UNAVAIL "^01071682:5: SNMP_TRAP: Virtual /Common/prod_vs_(.*)" { snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.305"; email toaddress="team@mycompany.com" fromaddress="noreply@big-ip-device.mycompany.com" body="Virtual /Common/prod_vs_ is down" }

As you may already know, the BIG-IP won't send two similar traps in a row because of a throttling mechanism. A second case would then be to act on this throttling mechanism then similar alerts could be sent more often. In practice however, it's more likely that we want to achieve the contrary and this is why this idea will be studied just later in the / operator section, but you will deduce easily on how to use it for the * operator.

[/] operator
In many cases we can consider that just one alert is enough to know something happened, even though this alert would be propagated twice by two different paths as mentionned above. There is one configuration parameter called suppress interval that can be changed to extend the period a similar alert will not be sent, which is explained in this documentation K24258199.
In practice for example, run this tmsh command to have a 5 minutes interval :

modify /sys db snmp.bigiptraps.suppress.interval value 300

Additional steps
Restarting the alerting daemon might be necessary but I've noticed that in practice it will restart by itself when the config is changed. If so, just run this :

bigstart restart alertd

Of course you wouldn't go in production without testing your configuration, and for this you just need to to send a syslog message to the log that will trigger the corresponding alert. For example :

logger -p local0.notice "01070727:5: Pool /Common/dev_pool_mytestpool monitor status up."

Alert naming convention
As seen previously, all traps come with a unique identifier and for the additional traps that you will create, you will give them an id between 300 and 999 (.1.3.6.1.4.1.3375.2.4.0.300 to .1.3.6.1.4.1.3375.2.4.0.999). If you have many additional traps, you can just use the next free id or choose an ordering system (for example 300 to 399 for system alerts, 400 to 499 for application alerts, and so on).
If choosing an ordering system, this makes sense to also choose a naming convention but at this moment you need to consider what BIG-IP modules you are using. Here are some ideas for starting your configuration :
300 to 399 : SYSTEM_HEALTH_ , anything related to system issues (cpu, memory, network,..) 400 to 499 : SECURITY_ , anything related to security events (ASM and AFM modules particularly) 500 to 599 : APPLICATION_ , anything related to application events or metrics, likely to be used by iRules

Reference documentation
K3727: Configuring custom SNMP traps
K10095: Error Message: Clock advanced by <number> ticks
K41651928: Error Message: 01010040:4: Clock has unexpectedly adjusted by <#> ms
K3667: Configuring alerts to send email notifications
Supplemental Document : Log Messages Reference

Tags:

snmp