BGP.guru

BGP.guru

Nerd blog.

28 May 2019

An Exercise in Systems Thinking or Stupidity... Your Choice

I was recently read into a Problem for which the particular vendor was downplaying the severity of the issue and the usual games of Enterprising ‘saving face.’ They were haranged into providing a means for detecting the failure state and gave ‘us’ a script to run. This is where I get involved as the interested parties had no infrastructure do run said tool and since no real manual was provided were at a loss. No biggy, that’s my jam, so I agreed to get it working for them.

First thing I noticed was the script was designed for one platform but expected to be run on another. Not surprising but this introduced a rather annoying series of vulnerabilities I wasn’t willing to live with. So I fixed those, factored a bunch of stuff I didn’t need out and implemented a more ‘manageable’ logging strategy. Shared a diff back to the vendor, the usual response was received.

Now that I have something working it was a matter of making it operational and supportable. Took another look and realized this damned thing was never meant to be either. It is a script written in the while (true): do stuff style and requires a TTY. Originally was supposed to be run in Windows which would have required a privileged account to be always logged into a terminal session running in a foreground command prompt. Yeah that’s not going to ‘work.’ Can’t schedule it due interactivity requirement… and there are a dozen devices that need to be ‘watched’ so yeah we need to run it a whole bunch of times.

So with that and the commitment to make it work I set out to put together another solution without modifying the original tool too much (again vendor stuff). My first thought having getting this running and ported to (not-Windows) was the run in it in tmux (cause ‘ain’t nobody got time’ to run a session in the foreground forever). But the stupid thought came to me… why not run the monitors for all the devices in one tmux session?! This would be a bit annoying to get going manually so will have to figure that out.

Due to the nature of the tool in that I barely trust it and it requires some privilege I started working on isolating it with an unprivileged service service user and similar process segmentation. Got it running and watching all devices and moved from ‘watch this terminal’ to monitor syslog (again more manageable) and delivered that as the ‘minimally viable product.’

This was a stupid hack and not very operational if you were not VERY familiar with how to use tmux etc. Also the ‘monitor syslog for this string’ is a terrible set of instructions so I could do better. Plus if the system got rebooted due to patch cycle (which happened), monitoring stopped without notification.

My first step was to make setup and tear down easier…

I added signal handling support for SIGINT and SIGTERM as a means of well making a graceful exit feasible. That signal handler emitted an appropriate log message and cleanly broke the while(true) business.

The next was to get the tmux session stood up and torn down. The tmux manual was helpful here as it so turns out you can setup sessions and populate windows which is exactly what I needed here…

$cat start-monitoring.sh
#!/bin/bash
cd /apps/<BASE_REDACTED>
session="<NAME_REDACTED>"
devices="A B C D E F G"
echo "Starting tmux session $session."
tmux new-session -s "$session" -n . -d
for device in $devices; do
    echo  "Starting to monitor $device."
    tmux new-window '<COMMAND_REDACTED> $device'
    sleep 2 # be kind, don't introduce races that need not be raced
done
echo "Done."
$cat stop-monitoring.sh
#!/bin/sh
session="<NAME_REDACTED>"
echo "Terminating monitoring processes."
pkill -TERM -u <REDACTED_USER> '<COMMAND_REDACTED>'
echo "Terminating tmux $session session."
tmux kill-session -t $session
echo "Done"

This isn’t perfect but it does work and pretty cleanly all things considered. The darned thing can be started and stopped without knowing how it works or much else. The ‘does it work’ test is pretty straightforward and simple. Far more operational than previously but doesn’t fix the issue of system permanence.

This problem was extra annoying because of Linux and systemd. If this was *BSD or the tried and legacy sysv-init it would have been trivial to manage. My excuse for actively avoiding to learn anything about systemd has come to an end. The process was not pleasurable but I learned just enough to have system state persistence which is nice.

The first thing one needs to do is setup a manifest/service which is very similar in nature to SMF or other ‘frameworks.’ The trick is finding the right parameters and that was not a very intuitive process and the documentation was even more terse than my own and largely confusing. I managed to find a solution in the good old tried and true ‘find other manifests and the system and choose bits and pieces that seem relevant and try’…

I came up with the following:

#cat device-monitoring.service
[Unit]
Description=A service to start/shutdown <NAME_REDACTED> monitoring.
After=syslog.target network.target network-online.target

[Service]
Type=forking
User=<REDACTED_USER>
ExecStart=<REDACTED_PATH>/start-monitoring.sh
ExecStop=<REDACTED_PATH>/stop-monitoring.sh
TimeoutStartSec=0

[Install]
WantedBy=default.target

To install it:

cp device-monitor.service /etc/systemd/system/
systemctl enable device-monitor.service

To disable it:

systemctl disable device-monitor.service

To ‘monitor’ service state:

systemctl status device-monitor.service

There now the damned thing is operational and manageable with system state persistence. But there was one more thing… someone had to keep an eye on it. Having a procedure for the ‘on-call’ folk to run periodically is a waste. Sure I could have just left it there but being the ‘on-call’ person and doing stupid things for no direct value was something that annoyed me to no end so I couldn’t do that to them without sacrificing some humanity.

My usual goto for this kind of thing is curl+pushover but that’s not ‘Enterprise grade’ and would mean buying something so that’s a non-starter for reasons I’ll leave to intuit as an exercise to the reader. So back in the muck and mire I worked with another team to implement a paging tool using the Enterprise baloney we have (and it is ever a Rube Goldberg Machine of crazy). The trick was convincing my partners to get me what I needed without trying to solve my problem which netted me a nice and generic tool I can use to page out to the on-call folks without jumping through hoops or navigating byzantine frameworks of process. Said tool (after testing) and annoying the on-call folk for a bit was then integrated into the above scripts and the device monitoring script to page when stopped, started and when the original failure state (the damned thing was supposed to monitor in the first place) was detected. Humans need not do stupid boring watching of the logs, or manage some crazy set of scripts or stare at a terminal.

This met all my criteria and the damned thing is running in production and has been pretty damned resilient. I’m surprised as I expected this to be more painful given I now ‘own’ the Problem by contributing to it.

Regardless not being boxed into a particular line of thinking is what is important here… take what is given and adapt the tools at hand to make something to solve the problem, respect operator’s time and yet still be simple enough to maintain.

Now you might thinking… I wouldn’t have done it that way… using tmux as a daemon shell is absolutely nuts. Well that’s wonderful, and I’m glad you have a different approach but I’m fine with using the tool in this way, and made me appreciate it’s functionality even more. This speaks to the Unix philosophy of taking small tools and composing them together to achieve something. Systemd is still crazy but and non-intuitive but it can be made to do what you need though I wish the documentation was geared towards actually accomplishing something instead of stating a panoply of options without intent or guidance. As well I came out of this with a few new tools and a composable way to page our operations teams when it matters so all in all pretty successful. Usually these requests don’t go so well, but I enjoy having these kinds of things to work on.


Sean Cody - Sean Cody is....