Thursday, February 16, 2017

Test Driven Infrastructure - Validating Layer 1 Networking with Nagios

Previously we've talked about how we use Nagios / Icinga for three broad types of monitoring at LMAX: alerting, metrics, and validation. The difference between our definitions of alerting and validation is a fine one and it more has to do with the importance of the state of the thing we are checking and the frequency in which we check it. An example of what I consider an "Alert" is if Apache is running or not on a web server. However the version of Apache might be something I "Validate" with Nagios as well, but I wouldn't bother checking this every few minutes and if there was a discrepancy I wouldn't react as fast as if the entire Apache service was down. It's a loose distinction but a distinction none the less.

The vast majority of our network infrastructure is implemented physically in a data centre by a human being. Someone has to go plug in all those cables, and there's usually some form of symmetry, uniformity and standard to how we patch things that gives Engineers like me warm fuzzy feelings. Over many years of building our Exchange platforms we've found that going back to correct physical work costs a lot of time, so we like to get it right the first time, or, be told very quickly if something is not where it's expected to be. Thus enters our Test Driven Networking Infrastructure - our approach uses Nagios / Icinga as the validation tool, Puppet as the configuration and deployment engine, LLDP as the protocol on which everything runs on top of, and Patch Manager as the source of truth.

Validating Network Patching

I've written about our Networking Puppet module before and how we use it to separate our logical network design from it's physical implementation. The same Puppet Networking module also defines the monitoring and validation for our network interfaces. Specifically this is defined inside Puppet Class networking::monitoring::interface, which has a hard dependency on LMAX's internal Nagios module which unfortunately at this time is not Open Source (and would be one long blog post of it's own to explain).

So since you can't see the code I'll skip over all the implementation and go straight to the result. Here is what our Puppet Networking module gives us in terms of alerts:



Pretty self explanatory. Here's the end result of our networking infrastructure validation, with server names and switch names obfuscated:



However a green "everything-is-ok" screenshot is probably not a helpful example of why this is so useful, so here's some examples of failing checks from out build and test environments:




To summarise the above, our validation fails when:
  • we think an interface should be patched somewhere but it's not up or configured
  • an interface is patched in to something different than to what it should be
  • an interface is up (and maybe patched in to something) but not in our source of truth
Next I'll describe how the Nagios check works. Combined with a specific provisioning process which I describe below, the above checks give us Test Driven Infrastructure that helps us quickly correct physical patching errors.

How The Nagios Check Actually Works

The idea behind the check is for the Nagios server to first retrieve what the server says the LLDP neighbour of each interface is, then compare this with it's own source of truth and raise an appropriate OK, WARNING or CRITICAL check result.

Nagios knows what interfaces to check for because Puppet describes every interface to monitor. Nagios makes an SNMP call to the server, getting back CSV output that looks like this:

em1,yes,switch01,1/31,10,Brocade ICX6450-48
em2,yes,switch02,1/31,10,Brocade ICX6450-48

The fields are:
  1. interface name
  2. link
  3. remote LLDP device name
  4. remote LLDP device port
  5. VLAN
  6. remote LLDP device model
A version of this script is up on GitHub here. It contains a lot of conditional logic to handle the LLDP information for different vendor hardware. For example certain Brocade switches don't mention the word "Brocade" so we infer that from the MAC address. Different switches use different fields for the same information as well, and the script parses the right field based on the remote side model type, eg: Brocades and Linux Kernels put the Port ID in the "descr" field but other devices put it in the "id" field.

The Nagios check cross references this data against it's own records which is the "source of truth" file, which looks like this:

server01,em1,switch01,0/31
server01,em2,switch02,0/31

The Nagios check script has some smarts built in to handle logical implementations that don't model well in Patch Manager. One of the complexities is stacked switches. The LLDP information from the server will describe a stacked switch port as something like "3/0/10", where 3 is the Stack ID. In Patch Manager it would get confusing if we labelled every device in a stack the same, so instead we name them switch1-3 where the "-3" indicates the stack number. The Nagios script looks for and parses this as Stack ID.

Our TDI Workflow

The Nagios checks are the critical part of a much larger workflow which gives us Test Driven Infrastructure when we provision new machines. The workflow follows the steps below roughly, and I go into each step in more detail in the following sections:
  1. Physical design is done in Patch Manager, including placement in the rack and patching connections
  2. Connections are exported from Patch Manager into a format that our Nagios servers can parse easily
  3. Logical design is done in Puppet - Roles are assigned and necessary data is put in Hiera
  4. Hardware is physically racked and the management patches are put in first
  5. Server is kickstarted and does it's first Puppet run, Nagios updates itself and begins to run checks against the new server
  6. Engineers use the Nagios checks as their test results, fixing any issues
As you might have deduced already the workflow is not perfectly optimised; the "tests" (Nagios checks) come from Puppet, so you need a machine to be installed before you get any test output. Also we need at least some patching done in order to kickstart the servers before we can get feedback on any of the other patching.

Physical Design in Patch Manager

We use Patch Manager's Software-As-A-Service solution to model our physical infrastructure in our data centres. It is our source of truth for what's in our racks and what connections are between devices. Here's an example of a connection (well, two connections really) going from Gb1 in a server, through a top of rack patch panel, and into a switch:



Exporting Patch Manager Connections

Having all our Nagios servers continually reach out to the Patch Manager API in order to search for connections is wasteful, considering that day to day the data in Patch Manager doesn't change much. Instead we export the connections in patch manager and at the same time filter to remove any intermediate patch panels or devices we don't care about - we only want to know about both ends of the connection. Each Nagios server has a copy of the "patchplan.txt" file, which is an easy to parse CSV that looks like this:

server01,em1,switch01,0/31
server01,em2,switch02,0/31


Logical Design In Puppet

As part of creating the new server in Puppet, the networking configuration is defined and modelled in line with what has been planned in Patch Manager. So for example if a Dell server has it's first two on board NICs connected to management switches in Patch Manager, somewhere in Puppet a bonded interface will be defined with NICs em1 and em2 as slaves (which are the default on board NIC names on a Dell Server).

How we model our logical network design in Puppet is covered in to much more detail here.

Hardware is Physically Racked

Obviously someone needs to go the data centre and rack the hardware. If it's a large build it can take several days, or weeks if there's restricted time we can work in the data centre (like only on weekends). We try to prioritise the patching for management first so we're able to kickstart machines as quickly as possible.

Kickstarts and Puppet Runs

Once a new has done it's first Puppet run and it's catalog is compiled, a set of Exported Puppet Resources that describe Nagios checks for this server are available for collection. The Puppet runs on our Nagios servers will collect all these resources and turn them into relevant Nagios configuration files and begin running these service checks.

Make the Red and Yellow go Green

Since this is a newly built server it's expected that a lot of the validation style Nagios checks will fail, especially if only the management networks are patched but our Puppet code and Patch Manager is expecting other NICs to be connected. Our engineers use the Nagios check results for the new server as the feedback for our Test Driven Infrastructure approach to provisioning new servers - make the tests pass (make the red and yellow go green) and the server is ready for production.

Monday, January 30, 2017

Puppet Networking Example

The majority of our Puppet modules - and I think most organisations that adopted Puppet over 5 years ago are in the same boat - is nothing to be proud of. It's written for a specific internal purpose for a single operating system, and has no tests. It was written very quickly to achieve our purpose. These internal modules aren't really worth sharing, as there's much better modules on the Forge or GitHub. Over the years I've been slowly retiring our home grown modules for superior, publicly available modules, but it's a long process.

There are a couple of our internal modules that I am quite proud of though. One of them is our Networking module, which we use to write the configuration files that describe the interfaces, bonds, vlans, bridges, routes and rules on our Red Hat derived systems. Our networking module allows us to quickly define an interface for a VM with a piece of Hiera config if we want something quickly, but the real strength of it comes from how we use it to model our defence in depth networking architecture for our platform.

The module's not perfect, but we've been able to largely abstract our logical network design from how we implement it physically. Our Puppet roles and profiles describe themselves as having "Application Networks" and the module takes care of what that looks like on servers in different environments - perhaps it's an untagged bonded network in production but it's vlan tagged in staging with a completely different IP range.

Here is the module + accompanying documentation on GitHub, along with the first few paragraphs of the Preface.


LMAX-Exchange/puppet-networking-example

This is not a "real" Puppet module - it's not designed to be cloned or put on the Forge. It even refers to other Puppet modules that are not publicly available. In fact, if you did blindly install this module into your infrastructure, I guarantee it will break your servers, eat your homework, and kill your cat.

This module is a fork of LMAX's internal networking module with a lot of internal information stripped out of it. The idea behind releasing this is to demonstrate a method of abstracting networking concepts from networking specifics. It is designed to educate new LMAX staff, plus a few people on the Puppet Users list who expressed some interest. The discussion in Puppet Users thread How to handle predictable network interface names is what motivated me to fork our internal module to describe it to other people.

I'm now going to fabricate a scenario (or a "story") that will explain the goals we are trying to reach by doing networking this way in Puppet. While the scenario's business is very loosely modelled on our own Financial systems architecture, the culture and values of the Infrastructure team in the scenario match our own Infrastructure team much more closely - which is how our Puppet networking evolved into what it is now.

If the scenario sounds completely alien to you - for example if you run a Cloud web farm where every instance is a transient short-lived VM - then the design pattern this module is promoting probably won't be that helpful to you. Likewise if you are a 1 man Sys Admin shop then this level of abstration will read like a monumental waste of time. If however you run an "enterprise" shop, manage several hundred servers and "things being the same" is very important to you, then hopefully you'll get something from this.

Saturday, January 14, 2017

Leaping Seconds

Just before New Year 2017 a leap second was inserted into Coordinated Universal Time (UTC). At LMAX we had some luxury to play with how we handled the leap second. January 1st is a public holiday, there's no trading, so we are free to do recovery if something didn't go according to plan. This blog post is an analysis of the results of various time synchronisation clients (NTP and PTP) using different methods to handle the leap second.

Some Research

Red Hat have a comprehensive article about the different clock synchronisation software they support on their Operating Systems and each one's capabilities. The section "Handling of the Leap Second" is especially worth a read to understand the various options and which ones would be applicable to you.

Since there's no financial trading on New Years day, this event became a real "live test" opportunity for us. We were able to consider all the available methods for correcting time. If the leap second was inserted during the middle of the year (June 30th), chances are the next day would be a working week (and in 2016 it was) and we'd have had less options to consider.

Our platform code assumes that time never goes backwards - it is always expected to go forwards at some rate. If it does go backwards, our application logic simply uses the last highest time it saw until the underlying clock source has progressed forwards again. In other words, our platform's view of time will "freeze" for one second if the clock is stepped back for one leap second.

During trading hours this is can be a problem. For previous leap seconds we've ignored the event and let NTP handle the clock drift naturally. The Red Hat page describes the clock being off for "hours" when you use this method. From our past experience it's more like days. Ideally we want clock synchronisation to recover rapidly and we want time to always progress forward - the "client slew" method.

Most of our platform uses the tried and tested NTP Daemon for clock synchronisation. The standard NTP Daemon doesn't have a fast slewing option, only Chrony can do this. Upgrading to Chrony before the leap second event wasn't an option for us unfortunately, so our hand was forced to use the "daemon step" method for this leap second. We judged safer than the kernel step method (less likely to trigger kernel bugs) but we knew our platform code needed to be tested heavily.

Some of our platform uses PTPd, and it's due to be rolled out more widely soon. PTPd's in built help describes it's leap second handling methods:

setting: clock:leap_second_handling (--clock:leap_second_handling)
   type: SELECT
  usage: Behaviour during a leap second event:
         accept: inform the OS kernel of the event
         ignore: do nothing - ends up with a 1-second offset which is then slewed
         step: similar to ignore, but steps the clock immediately after the leap second event
        smear: do not inform kernel, gradually introduce the leap second before the event
               by modifying clock offset (see clock:leap_second_smear_period)
options: accept ignore step smear
default: accept

Personally I was interested in knowing how quickly PTPd could bring the clock back in sync if we simply ignored the leap second and let it's normal error correction mechanism slew the clock. This would probably be our preferred method if a leap second is introduced during trading hours.

NTP Planning and Expectations

The plan was to have NTP step and PTP ignore the leap second.

Telling NTPd to step the clock is simple - we just needed to remove the "-x" flag from ntpd, but we had to make sure our platform code would handle it.  To do this we isolated one of our performance test environments and set up a fake stratum 1 NTP server by fudging a stratum 0 server. The configuration for this fake NTP server is:

restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict 127.127.1.0
restrict -6 ::1
restrict 10.101.0.0 mask 255.255.0.0 notrap nomodify
server 127.127.1.0
driftfile /var/lib/ntp/drift
fudge 127.127.1.0 stratum 0
leapfile /etc/ntp/leap-seconds.list

We set the fake NTP server's system clock to Dec 31st 23:45:00, force sync'd all performance machines to this NTP server, then started a performance run.  This particular run generally takes 10 minutes to get going so by 23:59:59 the environment would be running it's normal performance load, which is a simulation based on real production traffic patterns. This is one of the best tests we can come up with to simulate what would happen if the leap second occurred during business hours.

This leap second test was repeated a number of times and, as expected the timestamp 23:59:59.999 was used for the second time the clock ticked 23:59:59. Once the clock moved to 00:00:00 the exchange time progressed forward normally.

PTP Calculations

We wanted to test PTP slewing the clock post leap second, which is the method we'd be considering if the leap second occurred during trading hours. We know that NTP can take a long time to recover from a leap second,The inbuilt PTP configuration docs describe three options we set to slew the clock and improve recovery speed:

ptpengine:panic_mode=n
clock:leap_second_handling=ignore
clock:max_offset_ppm=1000

The first option is to stop the PTP daemon entering panic mode, which can result in the daemon stepping the clock (we want to avoid steps).

The second option simply tells PTPd to ignore the leap second from upstream, which will begin the slewing process after the leap second event occurs.

The third option sets the maximum frequency shift of a software clock. It's measured in Parts Per Million, where 1ppm is a shift of 1us per second. A value of 1000 means that we should be able to recover the clock by 1ms every second, which is 1000 seconds to recover from the leap second event.

There is also a default setting "clock:leap_second_pause_period=5" which makes the PTP daemon stop doing clock updates for 5 seconds before and 5 seconds after the leap second event, basically as a safety measure.

1000 seconds is 16 minutes and 39 seconds, adding the 5 second pause period we estimate that our PTP disciplined server clocks should be back in sync by 00:16:44 on January 1st.

What Actually Happened: NTP

The actual leap second event over all was fine. For the NTP disciplined servers, the testing of our code held up and as expected, our platform stopped processing for 1 second until real time caught up with it's view of time. If we look at the clock drift of one of our NTP disciplined servers at this time, there's no perceivable clock drift after Sunday 00:00 (the scale of the graph is in microseconds):



A much more interesting graph is a non-production machine that didn't pick up the NTP configuration change that removed the "-x" flag. On this hardware NTPd ignored the leap second and disciplined the clock using it's normal algorithms:



If you look at the X axis, it takes almost 12 hours for this NTP daemon to get remotely close to zero, and even after that it's not until Monday 12:00 that the system clock is within 10ms offset. This behaviour fits our observations during the previous leap second - it took much longer recover than we expected.

The ntpd man page talks about the maximum slew rate the linux kernel allows is 500ppm, so it will take a minimum 2000 seconds for NTP to correct 1 second of inaccuracy. What we're looking at here though is days. While we will be moving almost all platform servers to PTP we will still use NTP in our estate, and thus I'd like to understand the above behaviour. We haven't done much research into improving NTP recovery times, but I'd be surprised if there's not a way to tune the daemon to bring this down significantly.

A simpler option of course is to just replace ntpd with chronyd. Chrony supports a client slew method and while I don't have any hard data, Red Hat describe chronyd's leap second recovery as "minutes".

What Actually Happened: PTP

I calculated it would take a little over 15 minutes for PTP to bring the clock back into sync. It actually took 45 minutes. When using PHC Hardware Timestamping the PTPd daemon manages several clocks. The Master clock is whatever interface the PTP signal is coming over, and then there's the system clock which is a slave of the master clock. If the interface configured is also a Bonded interface, then any non-active interfaces are also managed as slave clocks.

Slave clocks are synchronised from the master clock using the same algorithm and rate limits, but more importantly slave clocks are not synchronised until the master clock is stable (ie: the LOCKED state). So what actually happens is the master clock - which from our graph below is the PHC device attached to interface em1 - synchronises it's time to the upstream PTP master clock first, and only once it is in sync do the rest of the slave clocks in the server start to be disciplined:



This is why the offset of em1 begins to track back into sync a little after 00:01:00. em2 and the System clock only begin to synchronise after 00:10:00, once em1 is LOCKED.  Why are the NIC clocks synchronising faster than the System clock though?

PTPd has the "clock:max_offset_ppm_hardware" setting which defaults to 2000ppm, which is also the daemon's maximum. This means it will take 1000000/2000/60 = 8.33 minutes to correct one second of offset. However the System clock is a software clock, who's rate is controlled by the "clock:max_offset_ppm" option which we specifically set to the maximum value of 1000ppm. The system clock should be recovering by 1ms every second but it's actually taking 2 seconds to recover 1ms, clearly seen in the slope of the graph if you zoom in (see below):



It looks like our value of 1000 for "clock:max_offset_ppm" didn't do anything. Wojciech Owczarek provided the answer - it is a known issue with the version of PTPd we're running. Support for slewing system clocks above the kernel maximum of 500ppm isn't finished yet, but will be in the final version.

While it's not as fast as I'd predicted, PTPd recovery is a lot faster than our NTP recovery.  We still want to know why our standard NTP recovery time is measured in days rather than hours, but that's less important if we move to Chrony for NTP.