Deep Dive

Deep Dive

/ by Marek / , , ,  + .

Broadband Network Telemetry over MQTT

At FAELIX we’ve been using message brokers for many years. Back when we started mitigating attack traffic across our entire network our tech stack included queues and topics to coordinate all the various moving parts. Lately we’ve become big fans of MQTT as the protocol gains support across the IoT ecosystem, and it got us thinking…

The User Story

A wholesale network provider fault struck around six o’clock one Sunday morning. Our network monitoring noticed this and our engineers were alerted to the issue. The engineers emailed the affected backhaul connectivity customer to let them know that their WISP would, unfortunately, be without service until the fibre splicing team had reconnected the area.

The first thing the network technicians at the affected WISP did, having ruled out any issues local to the site, was contact our support team to let us know there was a fault. Unfortunately they hadn’t checked their email, but were soon reassured that the fault had been detected and a fault had already raised as a priority. As the day progressed the splicing team did their work, finishing slightly ahead of the published estimated time to resolve.

Learning from this both we and the affected customer asked ourselves: what if in future FAELIX could allow our customers to opt-in to monitoring of their connections?

Our Internal Implementation

We decided that one of the best ways to achieve this — which wouldn’t rely on being able to ping a customer IP address — would be to leverage our RADIUS platform. Each time a customer broadband connection establishes or drops, our anycast RADIUS cluster receives L2TP steering Access-Request messages, along with accounting Start and Stop messages at the beginning and end of the PPP session. The challenge for us would be how to aggregate this messaging from the distributed cluster into one central location for processing.

We chose to proxy RADIUS accounting messages to MQTT. Each RADIUS server, in addition to its local processing, converts each accounting request into JSON with a very simple mapping using FreeRADIUS’s rlm_python module. After some testing we deployed the relatively short scripts to our RADIUS servers using Salt. To mitigate against a network issue between any RADIUS server and the central MQTT broker causing AAA timeouts we also started a small private broker on each RADIUS server, configured to bridge the relevant topics across. This way the MQTT publish is local to the RADIUS server and almost instantaneous. Further checks and balances ensure that our monitoring will detect if the local-to-central bridging fails and alert our engineering teams to the problem.

import paho.mqtt.publish as publish
import json

def accounting( p ):
    datadict = attributes_to_dict( p )

    username = datadict.get( 'User-Name' )
    ( username, realm ) = username.split( "@", 1 )
    datadict[ 'Realm' ] = realm

    if 'Acct-Input-Octets' in datadict and 'Acct-Input-Gigawords' in datadict:
        datadict[ 'Acct-Input-Octets' ] = ( datadict.pop( 'Acct-Input-Gigawords' ) << 32 |
                                            datadict.pop( 'Acct-Input-Octets' ) )
    if 'Acct-Output-Octets' in datadict and 'Acct-Output-Gigawords' in datadict:
        datadict[ 'Acct-Output-Octets' ] = ( datadict.pop( 'Acct-Output-Gigawords' ) << 32 |
                                             datadict.pop( 'Acct-Output-Octets' ) )

    path = "radius/%s@%s" % ( username, realm )
    qos = dict_to_qos( datadict )
    data = json.dumps( datadict )

    publish.single( path, data, qos = qos, retain = True, hostname = ... )

Once all the MQTT infrastructure was in place it was relatively simple for our control plane to subscribe to the relevant topics. Upon receiving a stop or start RADIUS message via MQTT our control plane consults the data in our CRM — which records the associations between PPP credentials, broadband lines, and customers interested in broadband monitoring alerts — and determines if and how to alert the customer. Because we persist the topics for Start and Stop it’s very easy for our subscribed alerting process to determine how long a line was online or offline and so can include that in the alert message. For customers who have opted into receiving SMS notifications we’re using the Simwood Partner platform, and we’ve also implemented email and Pushover alerts too. The code below is an elided-version of the logic behind the alerting:

def radius_process_mqtt_accounting( client, userdata, message ):
    msg = json.loads( message.payload.decode( "utf-8" ) )
    ( channel, user, *bits ) = message.topic.split( "/" )
    
    if msg.get( 'Acct-Status-Type', None ) == 'Start':
        if not message.retain:
            destination_smss = radius_process_mqtt_accounting_start( user, msg )
    elif msg.get( 'Acct-Status-Type', None ) == 'Stop':
        if not message.retain:
            destination_smss = radius_process_mqtt_accounting_stop( user, msg )

    for ( destination, sms ) in destination_smss:
        if destination.startswith( "07" ):
            integ_simwood.send_sms( destination, sms )

def monitor_broadband():
    mqtt_client = integ_mqtt.Connection( subscriptions = [ "radius/+/Start",
                                                           "radius/+/Stop",
                                                           ],
                                         on_message = radius_process_mqtt_accounting,
                                         )
    mqtt_client.process_messages_forever()

Opening Up Telemetry

What if one of our customers wants alerts delivered into different third party system? We’d love to accommodate their requirements, but it’s not realistic for us to engage in a software development exercise to integrate with the myriad on-call and alerting systems our customers use. So why don’t we just expose the MQTT endpoints to the customers as well? Well, we did just that!

Now any customer can connect to our MQTT brokers and consume real-time information about their PPP sessions. The credentials required for this are exactly the same as for the broadband connection itself. Because we operate and automate our entire stack, it was relatively straightforward to pull the usernames and passwords out of our LNS secrets management system, hash them into the right format, and distribute them to the public-facing MQTT broker along with an appropriate ACL.

It probably doesn’t make sense to connect to our MQTT broker from the broadband connection being monitored — though your MQTT connection dropping might alert you to the PPP session dropping, this isn’t useful. Instead we suggest you consume this telemetry off-net, e.g. on one of our VPS servers or via a third party provider. The lightweight nature of MQTT makes this possible even on a low-end device like a Raspberry Pi.

Next Level

When we announced MQTT APIs for our broadband service it received positive attention. We’re already planning what we can extend it to, both to simplify our own internal tooling and also empower our customers.