Friday, June 12, 2020

Puppet Acceptance Tests, the hard way (part 1)

The first in a series of posts about build acceptance test infrastructure for my team's Puppet.

This first post is a mainly background, thoughts, and ranting.  It's been quite cathartic writing it but probably not super useful.  If you just want to see some tech stuff, skip to the next post.

Here is (will be) related posts, as I write them:
  1. ... There aren't any yet, this is the first one.

Testing @ LMAX

Our Development team has an amazing CI and Staging of their own (my team doesn't use it, we just keep it running).  They've been iterating over their CI/CD toolkit for 10+ years, and while bits and pieces have been swapped out and replaced with other components, it's essentially the same.  It helps enormously when you have a culture of continuous delivery instilled in your development team from the get-go, rather than trying to add CI in later.

It is very mature, and while it's not perfect and there's many things I'd like to improve, it is a lot better than others.  Every now and then someone moves on and gets to experience how other company's do their testing.  When meeting up for beers a few months later, often the feedback goes something like "No one does it as good as LMAX".  It's nice to hear, it's nice to be validated, but it's also nice to know there are others that would give us a run for our money.

Over on my side where the infrastructure is, things are not as rosy as my coder colleagues.  We are primarily designers and builders; hardware experts, systems and networking specialists.  We're not developers, but we do code to an extent.

We've used Puppet for about 8 years, however the testing of our Puppet code has been lacking; highlighted more so when I glance up from my desk at the 60" TVs displaying the Dev team's CI results in real time.

Puppet describes the vast majority of our operating systems, and I would say we are on the Power User side of the community.  Our Puppet configures complex storage and networking, makes BIOS and performance tuning changes, and drives the monitoring of our server infrastructure.

Such complexity without tests slowed us down; catalog compilation failures, missing Hiera data.  Stuff that works for 90% of machines, but then there's these few exceptions.  Touch one thing over here, and something all the way over there breaks.

In the last 4 years we've gotten a lot better.  We introduced puppet-rspec unit tests.  We adopted the Role Profile design pattern.  We built our own CI, using Git Lab.  We moved to Puppet 4, then 5, then 6.  We've retired many lmax-* modules and replaced them with upstream ones that are better tested.  We have a lot less catalog compilation failures now, and the culture is shifting to "I'll write a test for that".

As we learn and our testing gets better, I find we are slowed down less by the simple Puppet problems and more by the interactions between our complexity, our data and our platform.  The kind of problems where combining this Puppet Role, with this Hiera data, on top of a certain shape of system will cause a machine to provision incorrectly.

These complex failures are difficult to model using unit tests.  We need to actually run Puppet on a system, not just compile a catalog.  To be able to test how we have Puppet interact with a system we need to run it on something that looks as much like hardware as possible.

Provisioning

A good portion of what my team does is provisioning and re-provisioning systems, both hardware and VMs.  When it works, it works great; you send a piece of hardware for network boot, it kickstarts itself, reboots into what we call our initial Puppet run, and if that has no failures it reboots again.  After the second reboot the machine should be ready for use - it will be running in the kernel we want, networking has come up on boot, and storage is in place, our monitoring auto-magically updates to add the new machine and the everything goes green.

If it doesn't work perfectly you've got to figure out what went wrong.  In some cases it might be a simple failure of which a second Puppet run will fix it.  We aim for single-pass Puppet, but sometimes there's bugs.  If it's a serious problem like a catalog compilation failure, it takes longer to repair. Console, root login, check the logs, etc.

To be able to continually test provisioning machines over and over again, and notify the team of any failures, that would make us faster.  Taking this idea further, the long term dream would be to provision sets of machines as different things and run tests across these systems to make sure they interact with each other correctly.  If eventually we could simulate the provisioning of an entire financial exchange at once...  Now that would be something to brag about.

I would need to be able to test the low level things that our Puppet does now; things like creating Logical Volumes and file systems, kernel changes.  A container-based acceptance test system would probably be easier to start with, but it's not an accurate representation of hardware so would not provide as much value long term.  For these reasons I wanted to concentrate on testing with VMs.

It was simple enough to "break" one of our KVM kickstarts, then convert that disk image into a Vagrant image.  I now had a repeatable image of a machine that looked like it had just finished our kickstart, the perfect starting point to test our Puppet Roles from.  We use libvirt/KVM exclusively, so to have Vagrant create libvirt VMs via the vagrant-libvirt plugin would mean we are in familiar territory.

Getting Started (or not)

Being a Systems Engineer by trade, my approach to implementing a solution to a problem usually goes something like this;
  1. Figure out that what we've got right now doesn't solve my problem, nor will it do so sensibly
  2. Google about my problem
  3. Get annoyed that nothing appears on page 1 of the results, so add a few more words to Google I think might help
  4. Read some blog posts or marketing material of other people who have sort of done something similar, and learn about some New Thing
  5. Read a bit more about New Thing, maybe read some of the manual, enough to vaguely understand how it works and if it will solve the problem
  6. Look for example code that works with this New Thing that I can borrow
  7. Utilise upstream code in my own estate
  8. Deploy everywhere
  9. Give 60 minute presentation to my team about how amazing I am having done this all by myself and not copied anything from the Internet at all
Using this approach, I managed to get a small amount of Puppet acceptance tests running several years ago; I had some tests of our internal networking module, ad-hoc, on my laptop.

Then I did a Fedora upgrade and Vagrant broke.

I fixed that, then a new version of puppet-rspec clashed with Beaker.

I had to wait for a fix for that, and by then another Fedora upgrade broke Vagrant again.

We managed to get a magical combination of Gems on a CentOS 6 Git Lab Runner working at some point in time, then it broke.

Then Beaker 4 got released and a bunch of stuff stopped working.

Then my own Vagrant broke again, or more specifically vagrant-libvirt (I think we're at about Fedora 29 at the moment, I've suppressed a lot of these memories).

Then all the Beaker modules were archived on Git Hub.  That was a rather large nail in the coffin, but inconsequential at the moment, as my Vagrant was still broken.

My SysAdmin approach of "lets just whack some things together" wasn't working.  I wanted an acceptance test pipeline to test complex provisioning, but kept getting stuck provisioning the simplest SUT.

Even after some people that are smarter than me solved the current instance of my vagrant-libvirt problem, I'd pretty much convinced myself that my current approach was too fragile.  A lot of my problems boiled down to trying to provision machines using Vagrant while inside a Ruby development environment designed to test Puppet.  I simply didn't have the time or skill set to get it working.

I looked at other things.  Test Kitchen still had the same problems with Vagrant and Ruby clashing.  Puppet have their own provisioner built on VMWare, vmpooler and ABS, but we don't use VMWare.  I even thought about Testinfra...  Do the provisioning and testing in Python, a language I'm more familiar with, but mixing two different languages and test suites seemed like a bad idea.

It frustrated me that I could go to a public cloud and do this in 15 seconds, but was struggling so hard with Vagrant and vagrant-libvirt on my own laptop (now Fedora 32).

Getting Started (again)

A year or so ago Puppet announced Litmus, but it was still very new.  From what I gather reading a few conversations, it is mainly designed to run the same tests on SUTs in parallel. Very useful if you wanted to spin up 14 different operating systems and run the tests on each, like if you were testing a Puppet module for mass consumption.  Litmus has a provisioning system built on top of Bolt, currently supporting Docker only, so didn't immediately help me provision Vagrant machines.

Over time Bolt has gotten more provisioners.  Then recently someone in the Puppet Community Slack posted this Youtube video from Daniel Carabas that I found very exciting.  It was a quick demonstration of using Terraform to provision some Vagrant machines to run tests against, all driven by Litmus.

At first I missed the point; I saw Terraform and the Terraform Vagrant Provider as a way to get around my Vagrant provisioning problems.  I'd never used Terraform, Bolt, or Litmus before, so I paused the video every 10 seconds, typing out exactly what the video described so I could learn the bits and pieces.

I was able to provision Vagrant machines using Terraform directly.  Then when trying to provision when running the Litmus tests, the tests just... Stopped.  Turns out Terraform just stopped, and it stopped because it didn't handle the Vagrant failure, which failed in the exact same way it always failed if running from Bundler inside the Ruby environment.  I was right back where I started.

The Actual Point

Litmus runs on top of Bolt, Bolt interfaces with Terraform, and Terraform can do a lot.  Rather than trying so hard to make Vagrant work because it is easy to make Vagrant resemble our hardware, can I spend less effort making cloud infrastructure resemble our hardware, which is easy to provision?

I'm swapping one problem for another, and hope what I swap to is less work.  Terraform also has the potential to pay off long term, as it is very capable of modelling complex infrastructure.

In the next posts we'll do some real work as I explore this.

Thursday, February 16, 2017

Test Driven Infrastructure - Validating Layer 1 Networking with Nagios

Previously we've talked about how we use Nagios / Icinga for three broad types of monitoring at LMAX: alerting, metrics, and validation. The difference between our definitions of alerting and validation is a fine one and it more has to do with the importance of the state of the thing we are checking and the frequency in which we check it. An example of what I consider an "Alert" is if Apache is running or not on a web server. However the version of Apache might be something I "Validate" with Nagios as well, but I wouldn't bother checking this every few minutes and if there was a discrepancy I wouldn't react as fast as if the entire Apache service was down. It's a loose distinction but a distinction none the less.

The vast majority of our network infrastructure is implemented physically in a data centre by a human being. Someone has to go plug in all those cables, and there's usually some form of symmetry, uniformity and standard to how we patch things that gives Engineers like me warm fuzzy feelings. Over many years of building our Exchange platforms we've found that going back to correct physical work costs a lot of time, so we like to get it right the first time, or, be told very quickly if something is not where it's expected to be. Thus enters our Test Driven Networking Infrastructure - our approach uses Nagios / Icinga as the validation tool, Puppet as the configuration and deployment engine, LLDP as the protocol on which everything runs on top of, and Patch Manager as the source of truth.

Validating Network Patching

I've written about our Networking Puppet module before and how we use it to separate our logical network design from it's physical implementation. The same Puppet Networking module also defines the monitoring and validation for our network interfaces. Specifically this is defined inside Puppet Class networking::monitoring::interface, which has a hard dependency on LMAX's internal Nagios module which unfortunately at this time is not Open Source (and would be one long blog post of it's own to explain).

So since you can't see the code I'll skip over all the implementation and go straight to the result. Here is what our Puppet Networking module gives us in terms of alerts:



Pretty self explanatory. Here's the end result of our networking infrastructure validation, with server names and switch names obfuscated:



However a green "everything-is-ok" screenshot is probably not a helpful example of why this is so useful, so here's some examples of failing checks from out build and test environments:




To summarise the above, our validation fails when:
  • we think an interface should be patched somewhere but it's not up or configured
  • an interface is patched in to something different than to what it should be
  • an interface is up (and maybe patched in to something) but not in our source of truth
Next I'll describe how the Nagios check works. Combined with a specific provisioning process which I describe below, the above checks give us Test Driven Infrastructure that helps us quickly correct physical patching errors.

How The Nagios Check Actually Works

The idea behind the check is for the Nagios server to first retrieve what the server says the LLDP neighbour of each interface is, then compare this with it's own source of truth and raise an appropriate OK, WARNING or CRITICAL check result.

Nagios knows what interfaces to check for because Puppet describes every interface to monitor. Nagios makes an SNMP call to the server, getting back CSV output that looks like this:

em1,yes,switch01,1/31,10,Brocade ICX6450-48
em2,yes,switch02,1/31,10,Brocade ICX6450-48

The fields are:
  1. interface name
  2. link
  3. remote LLDP device name
  4. remote LLDP device port
  5. VLAN
  6. remote LLDP device model
A version of this script is up on GitHub here. It contains a lot of conditional logic to handle the LLDP information for different vendor hardware. For example certain Brocade switches don't mention the word "Brocade" so we infer that from the MAC address. Different switches use different fields for the same information as well, and the script parses the right field based on the remote side model type, eg: Brocades and Linux Kernels put the Port ID in the "descr" field but other devices put it in the "id" field.

The Nagios check cross references this data against it's own records which is the "source of truth" file, which looks like this:

server01,em1,switch01,0/31
server01,em2,switch02,0/31

The Nagios check script has some smarts built in to handle logical implementations that don't model well in Patch Manager. One of the complexities is stacked switches. The LLDP information from the server will describe a stacked switch port as something like "3/0/10", where 3 is the Stack ID. In Patch Manager it would get confusing if we labelled every device in a stack the same, so instead we name them switch1-3 where the "-3" indicates the stack number. The Nagios script looks for and parses this as Stack ID.

Our TDI Workflow

The Nagios checks are the critical part of a much larger workflow which gives us Test Driven Infrastructure when we provision new machines. The workflow follows the steps below roughly, and I go into each step in more detail in the following sections:
  1. Physical design is done in Patch Manager, including placement in the rack and patching connections
  2. Connections are exported from Patch Manager into a format that our Nagios servers can parse easily
  3. Logical design is done in Puppet - Roles are assigned and necessary data is put in Hiera
  4. Hardware is physically racked and the management patches are put in first
  5. Server is kickstarted and does it's first Puppet run, Nagios updates itself and begins to run checks against the new server
  6. Engineers use the Nagios checks as their test results, fixing any issues
As you might have deduced already the workflow is not perfectly optimised; the "tests" (Nagios checks) come from Puppet, so you need a machine to be installed before you get any test output. Also we need at least some patching done in order to kickstart the servers before we can get feedback on any of the other patching.

Physical Design in Patch Manager

We use Patch Manager's Software-As-A-Service solution to model our physical infrastructure in our data centres. It is our source of truth for what's in our racks and what connections are between devices. Here's an example of a connection (well, two connections really) going from Gb1 in a server, through a top of rack patch panel, and into a switch:



Exporting Patch Manager Connections

Having all our Nagios servers continually reach out to the Patch Manager API in order to search for connections is wasteful, considering that day to day the data in Patch Manager doesn't change much. Instead we export the connections in patch manager and at the same time filter to remove any intermediate patch panels or devices we don't care about - we only want to know about both ends of the connection. Each Nagios server has a copy of the "patchplan.txt" file, which is an easy to parse CSV that looks like this:

server01,em1,switch01,0/31
server01,em2,switch02,0/31


Logical Design In Puppet

As part of creating the new server in Puppet, the networking configuration is defined and modelled in line with what has been planned in Patch Manager. So for example if a Dell server has it's first two on board NICs connected to management switches in Patch Manager, somewhere in Puppet a bonded interface will be defined with NICs em1 and em2 as slaves (which are the default on board NIC names on a Dell Server).

How we model our logical network design in Puppet is covered in to much more detail here.

Hardware is Physically Racked

Obviously someone needs to go the data centre and rack the hardware. If it's a large build it can take several days, or weeks if there's restricted time we can work in the data centre (like only on weekends). We try to prioritise the patching for management first so we're able to kickstart machines as quickly as possible.

Kickstarts and Puppet Runs

Once a new has done it's first Puppet run and it's catalog is compiled, a set of Exported Puppet Resources that describe Nagios checks for this server are available for collection. The Puppet runs on our Nagios servers will collect all these resources and turn them into relevant Nagios configuration files and begin running these service checks.

Make the Red and Yellow go Green

Since this is a newly built server it's expected that a lot of the validation style Nagios checks will fail, especially if only the management networks are patched but our Puppet code and Patch Manager is expecting other NICs to be connected. Our engineers use the Nagios check results for the new server as the feedback for our Test Driven Infrastructure approach to provisioning new servers - make the tests pass (make the red and yellow go green) and the server is ready for production.

Monday, January 30, 2017

Puppet Networking Example

The majority of our Puppet modules - and I think most organisations that adopted Puppet over 5 years ago are in the same boat - is nothing to be proud of. It's written for a specific internal purpose for a single operating system, and has no tests. It was written very quickly to achieve our purpose. These internal modules aren't really worth sharing, as there's much better modules on the Forge or GitHub. Over the years I've been slowly retiring our home grown modules for superior, publicly available modules, but it's a long process.

There are a couple of our internal modules that I am quite proud of though. One of them is our Networking module, which we use to write the configuration files that describe the interfaces, bonds, vlans, bridges, routes and rules on our Red Hat derived systems. Our networking module allows us to quickly define an interface for a VM with a piece of Hiera config if we want something quickly, but the real strength of it comes from how we use it to model our defence in depth networking architecture for our platform.

The module's not perfect, but we've been able to largely abstract our logical network design from how we implement it physically. Our Puppet roles and profiles describe themselves as having "Application Networks" and the module takes care of what that looks like on servers in different environments - perhaps it's an untagged bonded network in production but it's vlan tagged in staging with a completely different IP range.

Here is the module + accompanying documentation on GitHub, along with the first few paragraphs of the Preface.


LMAX-Exchange/puppet-networking-example

This is not a "real" Puppet module - it's not designed to be cloned or put on the Forge. It even refers to other Puppet modules that are not publicly available. In fact, if you did blindly install this module into your infrastructure, I guarantee it will break your servers, eat your homework, and kill your cat.

This module is a fork of LMAX's internal networking module with a lot of internal information stripped out of it. The idea behind releasing this is to demonstrate a method of abstracting networking concepts from networking specifics. It is designed to educate new LMAX staff, plus a few people on the Puppet Users list who expressed some interest. The discussion in Puppet Users thread How to handle predictable network interface names is what motivated me to fork our internal module to describe it to other people.

I'm now going to fabricate a scenario (or a "story") that will explain the goals we are trying to reach by doing networking this way in Puppet. While the scenario's business is very loosely modelled on our own Financial systems architecture, the culture and values of the Infrastructure team in the scenario match our own Infrastructure team much more closely - which is how our Puppet networking evolved into what it is now.

If the scenario sounds completely alien to you - for example if you run a Cloud web farm where every instance is a transient short-lived VM - then the design pattern this module is promoting probably won't be that helpful to you. Likewise if you are a 1 man Sys Admin shop then this level of abstration will read like a monumental waste of time. If however you run an "enterprise" shop, manage several hundred servers and "things being the same" is very important to you, then hopefully you'll get something from this.

Saturday, January 14, 2017

Leaping Seconds

Just before New Year 2017 a leap second was inserted into Coordinated Universal Time (UTC). At LMAX we had some luxury to play with how we handled the leap second. January 1st is a public holiday, there's no trading, so we are free to do recovery if something didn't go according to plan. This blog post is an analysis of the results of various time synchronisation clients (NTP and PTP) using different methods to handle the leap second.

Some Research

Red Hat have a comprehensive article about the different clock synchronisation software they support on their Operating Systems and each one's capabilities. The section "Handling of the Leap Second" is especially worth a read to understand the various options and which ones would be applicable to you.

Since there's no financial trading on New Years day, this event became a real "live test" opportunity for us. We were able to consider all the available methods for correcting time. If the leap second was inserted during the middle of the year (June 30th), chances are the next day would be a working week (and in 2016 it was) and we'd have had less options to consider.

Our platform code assumes that time never goes backwards - it is always expected to go forwards at some rate. If it does go backwards, our application logic simply uses the last highest time it saw until the underlying clock source has progressed forwards again. In other words, our platform's view of time will "freeze" for one second if the clock is stepped back for one leap second.

During trading hours this is can be a problem. For previous leap seconds we've ignored the event and let NTP handle the clock drift naturally. The Red Hat page describes the clock being off for "hours" when you use this method. From our past experience it's more like days. Ideally we want clock synchronisation to recover rapidly and we want time to always progress forward - the "client slew" method.

Most of our platform uses the tried and tested NTP Daemon for clock synchronisation. The standard NTP Daemon doesn't have a fast slewing option, only Chrony can do this. Upgrading to Chrony before the leap second event wasn't an option for us unfortunately, so our hand was forced to use the "daemon step" method for this leap second. We judged safer than the kernel step method (less likely to trigger kernel bugs) but we knew our platform code needed to be tested heavily.

Some of our platform uses PTPd, and it's due to be rolled out more widely soon. PTPd's in built help describes it's leap second handling methods:

setting: clock:leap_second_handling (--clock:leap_second_handling)
   type: SELECT
  usage: Behaviour during a leap second event:
         accept: inform the OS kernel of the event
         ignore: do nothing - ends up with a 1-second offset which is then slewed
         step: similar to ignore, but steps the clock immediately after the leap second event
        smear: do not inform kernel, gradually introduce the leap second before the event
               by modifying clock offset (see clock:leap_second_smear_period)
options: accept ignore step smear
default: accept

Personally I was interested in knowing how quickly PTPd could bring the clock back in sync if we simply ignored the leap second and let it's normal error correction mechanism slew the clock. This would probably be our preferred method if a leap second is introduced during trading hours.

NTP Planning and Expectations

The plan was to have NTP step and PTP ignore the leap second.

Telling NTPd to step the clock is simple - we just needed to remove the "-x" flag from ntpd, but we had to make sure our platform code would handle it.  To do this we isolated one of our performance test environments and set up a fake stratum 1 NTP server by fudging a stratum 0 server. The configuration for this fake NTP server is:

restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict 127.127.1.0
restrict -6 ::1
restrict 10.101.0.0 mask 255.255.0.0 notrap nomodify
server 127.127.1.0
driftfile /var/lib/ntp/drift
fudge 127.127.1.0 stratum 0
leapfile /etc/ntp/leap-seconds.list

We set the fake NTP server's system clock to Dec 31st 23:45:00, force sync'd all performance machines to this NTP server, then started a performance run.  This particular run generally takes 10 minutes to get going so by 23:59:59 the environment would be running it's normal performance load, which is a simulation based on real production traffic patterns. This is one of the best tests we can come up with to simulate what would happen if the leap second occurred during business hours.

This leap second test was repeated a number of times and, as expected the timestamp 23:59:59.999 was used for the second time the clock ticked 23:59:59. Once the clock moved to 00:00:00 the exchange time progressed forward normally.

PTP Calculations

We wanted to test PTP slewing the clock post leap second, which is the method we'd be considering if the leap second occurred during trading hours. We know that NTP can take a long time to recover from a leap second,The inbuilt PTP configuration docs describe three options we set to slew the clock and improve recovery speed:

ptpengine:panic_mode=n
clock:leap_second_handling=ignore
clock:max_offset_ppm=1000

The first option is to stop the PTP daemon entering panic mode, which can result in the daemon stepping the clock (we want to avoid steps).

The second option simply tells PTPd to ignore the leap second from upstream, which will begin the slewing process after the leap second event occurs.

The third option sets the maximum frequency shift of a software clock. It's measured in Parts Per Million, where 1ppm is a shift of 1us per second. A value of 1000 means that we should be able to recover the clock by 1ms every second, which is 1000 seconds to recover from the leap second event.

There is also a default setting "clock:leap_second_pause_period=5" which makes the PTP daemon stop doing clock updates for 5 seconds before and 5 seconds after the leap second event, basically as a safety measure.

1000 seconds is 16 minutes and 39 seconds, adding the 5 second pause period we estimate that our PTP disciplined server clocks should be back in sync by 00:16:44 on January 1st.

What Actually Happened: NTP

The actual leap second event over all was fine. For the NTP disciplined servers, the testing of our code held up and as expected, our platform stopped processing for 1 second until real time caught up with it's view of time. If we look at the clock drift of one of our NTP disciplined servers at this time, there's no perceivable clock drift after Sunday 00:00 (the scale of the graph is in microseconds):



A much more interesting graph is a non-production machine that didn't pick up the NTP configuration change that removed the "-x" flag. On this hardware NTPd ignored the leap second and disciplined the clock using it's normal algorithms:



If you look at the X axis, it takes almost 12 hours for this NTP daemon to get remotely close to zero, and even after that it's not until Monday 12:00 that the system clock is within 10ms offset. This behaviour fits our observations during the previous leap second - it took much longer recover than we expected.

The ntpd man page talks about the maximum slew rate the linux kernel allows is 500ppm, so it will take a minimum 2000 seconds for NTP to correct 1 second of inaccuracy. What we're looking at here though is days. While we will be moving almost all platform servers to PTP we will still use NTP in our estate, and thus I'd like to understand the above behaviour. We haven't done much research into improving NTP recovery times, but I'd be surprised if there's not a way to tune the daemon to bring this down significantly.

A simpler option of course is to just replace ntpd with chronyd. Chrony supports a client slew method and while I don't have any hard data, Red Hat describe chronyd's leap second recovery as "minutes".

What Actually Happened: PTP

I calculated it would take a little over 15 minutes for PTP to bring the clock back into sync. It actually took 45 minutes. When using PHC Hardware Timestamping the PTPd daemon manages several clocks. The Master clock is whatever interface the PTP signal is coming over, and then there's the system clock which is a slave of the master clock. If the interface configured is also a Bonded interface, then any non-active interfaces are also managed as slave clocks.

Slave clocks are synchronised from the master clock using the same algorithm and rate limits, but more importantly slave clocks are not synchronised until the master clock is stable (ie: the LOCKED state). So what actually happens is the master clock - which from our graph below is the PHC device attached to interface em1 - synchronises it's time to the upstream PTP master clock first, and only once it is in sync do the rest of the slave clocks in the server start to be disciplined:



This is why the offset of em1 begins to track back into sync a little after 00:01:00. em2 and the System clock only begin to synchronise after 00:10:00, once em1 is LOCKED.  Why are the NIC clocks synchronising faster than the System clock though?

PTPd has the "clock:max_offset_ppm_hardware" setting which defaults to 2000ppm, which is also the daemon's maximum. This means it will take 1000000/2000/60 = 8.33 minutes to correct one second of offset. However the System clock is a software clock, who's rate is controlled by the "clock:max_offset_ppm" option which we specifically set to the maximum value of 1000ppm. The system clock should be recovering by 1ms every second but it's actually taking 2 seconds to recover 1ms, clearly seen in the slope of the graph if you zoom in (see below):



It looks like our value of 1000 for "clock:max_offset_ppm" didn't do anything. Wojciech Owczarek provided the answer - it is a known issue with the version of PTPd we're running. Support for slewing system clocks above the kernel maximum of 500ppm isn't finished yet, but will be in the final version.

While it's not as fast as I'd predicted, PTPd recovery is a lot faster than our NTP recovery.  We still want to know why our standard NTP recovery time is measured in days rather than hours, but that's less important if we move to Chrony for NTP.

Wednesday, October 5, 2016

How computer clocks work and how they are adjusted

A month or two ago I was asked by someone in our Operations team what clock synchronisation is and why we need to do it. I gave them a very basic few sentence answer. That got me thinking that I never read an easy explanation when I myself got started in this area, and the terminology used can be confusing if it's the first time you come across it. Below is a copy-paste out of our internal documentation where I attempt to explain computer clock synchronisation and the reason for it. The explanation is designed at non-technical people, it skips over many intricacies and is "wrong" from certain points of view but hopefully it describes the problem and solution accurately enough.

So here it is...

There is some terminology for time synchronisation that you have to get your head around (like Step, Slew, and Frequency), and to get that you have to understand how a "Clock" is implemented in a computer.

A computer only knows how to work with digital signals. Imagine a small device (called an Oscillator) where if you pass an electrical signal into it, it sends out pulses of electrical signals at constant intervals. If these intervals are small enough, we can approximate the passage of time versus the number of signal pulses.

For the sake of easy mathematics lets assume that 6000 pulses means 1 minute has passed, so that is 100 pulses per second, or 100 Hertz (Hz). If we put one of these devices onto a motherboard and send an electrical signal through it, and count the number of signals coming out, the computer will be able to know how much time has passed.

That's nice, but we need to know what the current time actually is. Knowing how many seconds has passed is not helpful if I don't even know what year it is. I can ask a clock that I trust what the real time is and just set my time to what it tells me. This is called a clock STEP - where the clock is forcibly set to a given value. Fantastic, my computer can now draw the correct time on my Desktop Clock.


Now imagine that like all computer components, the Oscillator is mass produced cheaply and never perfect. So even though it's supposed to send 100 pulses every second, my one actually only sends 90 pulses per second due to a manufacturing discrepancy, so my computer clock ends up running slow (when it counts 100 pulses, more than 1 second of real time has passed). In this case I can't trust the passage of time on my own machine. I can solve this problem by asking the trusted clock twice. After I step the clock the first time, if I wait what I think is 1 second and ask the same trusted time source what the time is I'll be able to figure out that I am 1/10 of a second slow, or, my OFFSET or DRIFT is 0.1 seconds. From there I can figure out that I need to count 10 fewer pulses to get to one second. If I now "remember" that I need to count 10 pulses less and step the clock again... Great, I have now compensated for my Oscillator defect and my desktop clock runs at the correct speed.


Now imagine that like all computer components, the Oscillator behaves differently under different conditions even though we don't want it to. Maybe if it gets hotter it maybe it sends pulses faster, and if it gets colder it sends pulses slower. Or maybe due to more manufacturing faults that even though it is supposed to be a constant pulse, it's not really.

This means that my +10 frequency adjustment is only correct for a single moment. Since I've got no idea how stable the Oscillator pulses are, in order to keep time as accurate as possible I have to constantly be referring to a clock that I can trust and adjust the frequency as needed. Maybe every time I ask it turns out I'm between +0.1 and -0.1 seconds out, so my frequency adjustment range is +/- 10 pulses. I don't particularly want to step the clock each time though, that's can be a bit brutal to the programs running on my computer - time would appear to be passing in "jumps". What I can do though is count more or less pulses to catch me up to where time is supposed to be, so the time will SLEW in the right direction.

For example if I am 0.1 seconds behind after one second passes, if I count 10 fewer pulses I will still be 0.1 seconds behind (I've compensated for the drift but not caught up). Instead I'm going to count 20 fewer pulses, so that in 1 seconds from now I should be at the exact same point as my trusted clock. This is nicer to my computer programs than stepping the clock all the time; it is a smooth FREQUENCY ADJUSTMENT.

I have to make these adjustments all the time though because my Oscillator varies, so what generally ends up happening is I adjust too fast and over shoot the real time, so I adjust to slow down, I under shoot, I speed up, I speed down, etc. If I were to draw my clock adjustments as a graph it would end up looking a bit like a "wave" pattern. It's not a big deal though because we're only talking about a couple pulses each time, and my desktop clock is generally hovering around the real time.


I can now keep time accurately, but I have to be in constant communication with the trusted clock, and make constant frequency adjustments. Since I know my Oscillator can lose up to 1/10 of a second every second, that means my computer clock can run freely (or FREERUN) for 10 of my seconds before I could be 1 second out. Lets say I don't care about seconds, only minutes, so I can free run for at most 600 seconds (or 10 minutes) before I would have to assume my clock could be 1 minute inaccurate. Maybe I do updates once every 10 seconds then, that is enough to keep me happy.


One Oscillator pulse is about 100 milliseconds in my pretend computer, so I can only accurately time stamp an event with 100 millisecond accuracy. There's no other faster signal to measure by in my PC, so it's physically impossible to tell the difference between events that happen at 0.01 and 0.02 seconds. Luckily real computer oscillators have much higher frequencies than 100 Hertz, they are on the range of hundreds of MHz (nanosecond accuracy). The same problems still exist though - they need constant adjustment to keep them in line, otherwise they go free running off in some direction.

Wednesday, May 18, 2016

Presenting at ITSF's MiFID II Workshop

The International Timing & Sync Forum (ITSF) and Executive Industry Events (EIE) have put together a workshop for MiFID II UTC Traceability specifically for the Finance sector. There will be quite a lot of knowledge in the room: Wojciech Owczarek (the developer of PTPd) is presenting, and Neil Horlock of Credit Suisse is chairing a session, who is a fountain of information regarding the regulations and how an organisation may satisfy them.

I will also be presenting my own findings in the MiFID II Clock Synchronisation space, specifically regarding PTP distribution designs and the monitoring around it, plus where I'd like to get to in the future. It should be a lot less Sales-y than other events, like STAC, so if you're in Fin Tech and want to come down for a chat, I encourage you to register.

Friday, April 8, 2016

Solving MiFID II Clock Synchronisation with minimum spend (part 7)

In this series we are attempting to solve a clock synchronisation problem to a degree of accuracy in order to satisfy MiFID II regulations, and we're trying to do it without spending a lot of money.

So far we have:
  1. Talked about the regulations and how we might solve this with Linux software
  2. Built a "PTP Bridge" with Puppet
  3. Started recording metrics with collectd and InfluxDB, and
  4. Finished recording metrics
  5. Drawn lots of graphs with Grafana and found contention on our firewall
  6. Tried a dedicated firewall for PTP
The start of 2016 opened up a few new avenues for this project. Both Solar Flare and the developer of PTPd showed an interest in what we were up to, and both suggested alternative designs to what we've tried so far. What I'll look at first though is a purely switched design, ie: no Firewall. This is for two reasons; the PTP Bridge's mean variance is 100ns from the Grand Master, if I can get the clients close to this, it's a good achievement. Second, I've had to return the firewall I loaned from our Network Engineering team, and I've had to find myself two more clients for testing with (in other words I need to rebuild most of my test bed).

Multicast out multiple interfaces

Here's an adjusted design diagram of what we're doing, emphasising that the bridge is now publishing individual streams of PTP onto more than one LAN:



With Puppet it's pretty easy to turn one ptp4l instance and one supervisord program into two. In fact the amount of configuration needed has reduced quite a lot. We don't have any multi-homed routing issues, so we don't need any firewall marking for policy based routing on the PTP Bridge. A Puppet run and restart of supervisord now has the PTP Bridge multicasting out two separate interfaces.

My PTP clients are now receiving PTP multicast, but they are not getting any Delay Responses back:

2016-03-15 18:03:44.968187: warning: failed to receive DelayResp for DelayReq sequence number 0
2016-03-15 18:03:45.280687: warning: failed to receive DelayResp for DelayReq sequence number 1
2016-03-15 18:03:45.843188: warning: failed to receive DelayResp for DelayReq sequence number 2
2016-03-15 18:03:45.843201: warning: failed to receive DelayResp 3 times in hybrid mode. Reverting to multicast mode.

Another thing is a client on one network is getting the multicast from both interfaces, which should be impossible as there shouldn't be any multicast routing or Rendezvous Point here, but our symptoms suggest there still is:

17:59:31.057579 IP 192.168.0.4.319 > 224.0.1.129.319: UDP, length 44
17:59:31.057626 IP 192.168.0.4.320 > 224.0.1.129.320: UDP, length 44
17:59:31.261173 IP 192.168.1.4.319 > 224.0.1.129.319: UDP, length 44
17:59:31.261227 IP 192.168.1.4.320 > 224.0.1.129.320: UDP, length 44

The double multicast feeds are messing with the client - it will only be able to Unicast back to one of them (the one on the local subnet) so the sfptpd daemon is rightly getting confused about who it's master should be. The Best Master Clock algorithm won't help in this circumstance as the clients are only able to communicate to one of the Masters by design, and the BMC algorithm is probably telling them to use the wrong one.

A quick look around our older software firewall that we used a few blog posts ago shows that we've still got RP config on there and the firewall is doing IGMP Querying on various subnets. Turning this off stops the second flow of multicast, but it also cuts off our local flow; the switch has stopped forwarding multicast to the clients. The switch also says it knows nothing about any PTP multicast group membership.

We always turn on IGMP Snooping here, so the switch maintains a list of ports to forward multicast to rather than broadcast out all multicast packets on the local subnet. The issue above arises because there's no IGMP Querier on the subnet any more - it used to be the firewall but we've turned that off - so the switch doesn't "know" where to send the multicast any more. We have an old home-grown fix for this, a little daemon written in C that we call 'igmpqd' that periodically broadcasts out IGMP Queries to force all the clients to reply. If we configure and start igmpqd on the PTP Bridge, very soon we have our switch aware of multicast group membership and PTP is flowing again.

The graphs below show a few minutes of offset tolerance from both clients. The top set is PTP through the ASIC firewall from my last post, the bottom set is the switch-only design:



The results look good, ±2μs is the best I've seen so far. The results look even better if you take a look over several hours:



Once again the top graph through the ASIC firewall, and the bottom is our most recent results through just a switch. What's immediately apparent is the large reduction in spikes when we're not running through the firewall.

Before we jump to any conclusions (or go running to the Networking team with a big "I told you so" grin on our faces), lets remind ourselves that we've had to use two different clients than what we had before. While the clients themselves are the same hardware as each other at this moment, they are different to the previous pair of clients. They are in the same rack though, so they are not physically further away than the previous clients (think switch hops).

The last big thing that could be different is the way the clients were configured - without having the old clients in front of me now it's impossible to tell whether they were actually busy running something that was interfering with our results, rather than idle. Looking at the new results I now have a sneaking suspicion that my previous clients were in fact running some part of our production workload. The down side of doing this work on the cheap without a dedicated test bed is you risk running into these repeatability problems...

Anyway, we have some evidence (whether it is as credible as it looks or not) to support the theory that the switch only design is "better". but why are we not a lot closer to the 100ns accuracy that the PTP Bridge is reporting it's getting from the Grand Master Clock?

Running with Wrong Assumptions

Last post I said "The PTP Bridge generally keeps a baseline of 100ns accuracy" and I then went running with the assumption that we should be able to achieve a similar accuracy on the clients. Turns out this idea was flawed. The 100ns as a value is correct, but this is a measurement based on the hardware timestamp of the ingress of the PTP packets from the GM. To then go and assume that it would take another 100ns to get to a PTP Client ignores quite a lot of the other moving parts. I've extended a diagram I made previously that showed how each of the PTP Bridge components interact to show exactly what is and what is not in that 100ns from GM to Bridge:




As you can see the 100ns in ingress is only a small part of the system. This does not account for the time spent synchronising the ingress PTP Hardware Clock time to other interface PHCs, and it's during these times that we can be hit by scheduling delays or other interruptions. The phc2sys program outputs statistics for this though, so let's look.

Here's a sample of the output from one of the "multicast" phc2sys processes, which is reading from the PHC that is receiving PTP from the grand master and writing to the PHC that is sending multicast to the client:

phc2sys[11830930.330]: phc offset       -11 s2 freq  +34431 delay   4959
phc2sys[11830931.330]: phc offset         1 s2 freq  +34439 delay   5327
phc2sys[11830932.331]: phc offset       -27 s2 freq  +34412 delay   5711
phc2sys[11830933.331]: phc offset         1 s2 freq  +34431 delay   5599
phc2sys[11830934.331]: phc offset        21 s2 freq  +34452 delay   4895
phc2sys[11830935.331]: phc offset      -120 s2 freq  +34317 delay   5695

Over time it's consistently around ±100ns for every phc2sys process running. Our experience with performance tuning our Java applications makes me distrust these numbers at first, they sound too small to be true. Looking at the phc2sys man page and code, the program takes a -N option which controls how many times it reads the master clock, and then chooses the values that were read the fastest, thus trying it's best to reduce any scheduling or bus delays. So perhaps the numbers are more accurate than I originally thought, but I lack an independent reference time source in the PTP Bridge to verify them.

Hardware Timestamping on a Client

I've wanted to try hardware timestamps on a client for a while now. It might also bring the design closer to sub microsecond accuracy. There is a development branch of PTPd that supports the Linux PHC API on bonded interfaces, so we're going to give it a go. Building it into an RPM is pretty easy, there's an rpmbuild wrapper script provided. If you are going to try it out I would suggest you add --enable-runtime-debug to the ./configure lines in the RPM Spec file, this allows you to set to various levels of debug logging in the configuration file.

To get this done very quickly I haven't done any Puppet work for PTPd, it's all configured by hand (gasp!). I will do it later when I try PTPd as the replacement software on the PTP Bridge. Turning off sfptpd and turning on ptpd is simple - just need to set which interface, the mode, and turn on stats logging, the rest of the defaults in /etc/ptpd.conf are fine:

ptpengine:interface=bond1

ptpengine:ip_mode=hybrid
global:statistics_file=/var/log/ptpd.stats

With the daemon running it is consuming multicast PTP from the PTP Bridge. There is a statistics file that reports the most recent metrics in a nice to read format, however at the moment I prefer to plot every data point, so will be looking at the statistics log, which is in CSV format:

2016-04-01 15:38:44.759881, slv, 001b21fffe6fa06c(unknown)/1,  0.000000000,  0.000012141,  0.000000000,  0.000024282, 550.827026367, S, 05623, 0.000000000, 0, 0.000000000, 0 0.000024282,  0.000000000

Parsing this file should be easy, it is comma separated and there's even a tail_csv plugin for collectd. Unfortunately tail_csv got added in collectd-5.3, and I've got the old CentOS 6 collectd-4.1. After finding a 5.5 RPM for CentOS 6 I wasn't able to get the tail_csv plugin working (Segfault) so gave up after a short amount of time.

I've been following the Influx Data projects, so wanted to see if Telegraf, their metrics collector, could do what I want. Unfortunately they also don't have a "tail" like plugin (there is a feature request), only an Exec plugin, so wouldn't quite work for what I wanted to do. So back to collectd's normal tail plugin. Writing and debugging non-Perl regular expressions for collectd turned out to be like pulling teeth - I seemed to be making what should be a simple problem unnecessarily difficult.

I have a stream of CSV lines that I want to extract certain fields from and then shoot them off to Influx; why I didn't think of Tail + Awk before I'll never know. I already have an Exec job in collectd that prints out the NTP offset every second, I can use the same approach to tail the PTPd stats log, send it to Awk for easy field splitting and then output the keys in collectd format that will get sent to Influx. Here's the Awk script:

It uses my custom Collectd types, and converts from PTPd's unit of seconds into nanoseconds. This is purely a cosmetic thing - Grafana is good at upscaling metrics (if you have 1000ms it will say 1s) but bad at downscaling (0.00000012s is hard to read).

There's a small side affect with using Collectd's Exec plugin, it wakes up every second, which end up sending multiple data points per second all at the same time, rather than an even spacing of 4 Sync messages per second. This is not a huge problem because a lot of the Grafana graphs will do an implicit Mean() over a given time period, and my Tolerance graphs use Min() and Max() functions. If we ever look at the raw data though it would look rather strange. I keep saying I will move away from Collectd, but the streaming-style Exec plugin is a just too easy to use at the moment.

Delay Messages, and PTP on cheap NICs

There's a problem indicated in the PTPd stats line above - the fourth column, One Way Delay, and the sixth column, Slave to Master, are always zero. Slave to Master communication is the Delay Request / Response messages, so there's something wrong with the Slave sending Unicast back to the Master (Unicast because we are in hybrid mode). This mode works with sfptpd on this machine, so architecturally there is no problem (no routing issues, etc). If we tcpdump we can see the Delay Request leave the Slave and we get a Delay Response from the Master, so there's no networking problem.

If we turn on runtime debugging (hence the --enable-runtime-debug RPM build option above), we see that PTPd is attempting to handle a Delay Response message, but it doesn't think it's waiting for a response so drops it:

2016-04-05 15:23:16.947792 ptpd[101366].bond242 (debug1)    (slv) ==> Handle DelayResp (9)
2016-04-05 15:23:16.947797 ptpd[101366].bond242 (debug3)    (slv) Timer DELAY_RECEIPT start requested at 20.   0 sec interval
2016-04-05 15:23:16.947804 ptpd[101366].bond242 (debug2)    (slv) timerStart:     Set timer DELAY_RECEIPT to 20.000000
2016-04-05 15:23:16.947811 ptpd[101366].bond242 (debug1)    (slv) Ignored DelayResp sequence 9 - wasn't waiting for one

This appears to be a software issue, so I got in touch with Wojciech Owczarek, the developer of PTPd. The first diagnosis is that PTPd had an issue reading the packet transmission timestamp of the Delay Request message, and thus was not expecting a Delay Response. This is almost always caused by PTPd being unable to read the TX hardware timestamp from the NIC, which is almost always caused by the NIC not providing a hardware timestamp for the Unicast Delay Request packet. This is strange, as "ethtool -T" says this Intel i350 GbE Controller supports hardware timestamps...

High grade network cards designed for packet capture workloads (like Solar Flare 7322s) have full hardware timestamping support, as in they will provide a NIC generated timestamp for every packet. This obviously requires a bit more silicon than normal and drivers to support the extra timestamp data as well as packet data, and so these high end cards are expensive. Many other server grade NICs support hardware timestamps, but for PTP packets only, and the cards themselves enforce this.

Exactly what a certain model of card does can be difficult to determine - the Intel Data Sheets are a largely beyond my ability to interpret. There is a good white paper that talks about the hardware timestamping capabilities of some Intel NICs in a sensible way (Section 6). One of the methods the cheaper NICs do to enforce hardware timestamps only on PTP packets is to use a hardware packet filter on certain MAC addresses. Since the Multicast address for PTP is a standard address, a NIC can filter on this and provide hardware timestamps just for PTP into a single register, and the driver can just read from that single register to get the packet timestamp. Since PTP is generally no more than a couple messages a second one register is enough. My guess is it's a great deal cheaper for NIC manufacturers to add a single register for PTP support, rather than implementing a high grade "timestamp every packet" solution.

Wojciech's diagnosis was that this was what was happening here - the Intel I350 was not providing a hardware timestamp for the Unicast PTP Delay Response message because the packet didn't match the hardware filter. His suggestion was to change from hybrid mode back to multicast mode. This took all of seven seconds and after a restart we now have stats about the Slave To Master offset and PTPd appears to be operating normally.

Interference

PTPd gives you a little bit more information that sfptpd. It will report the Master to Slave offset and the Slave to Master offset, which are the two offsets used to calculate One Way Delay (which is reported by sfptpd as Path Delay). While the values of these two metrics are largely useless when looked at individually, compared together and over a period of time can indicate a problem with PTP. After collecting metrics for a few minutes, an interesting pattern started to emerge regarding the Slave to Master offset.

The first graph below is the Path Delay for Client1 (which is still running sfptpd for comparison) is consistently around 31μs ±1μs, which is the number it has been at for months, nothing surprising there:


The Master to Slave offset of Client2, now running PTPd, is pretty stable at 35μs. Note that the actual value compared to Client1 here is not important, this value is greatly dependent on network architecture, and it could be that Client2 is taking a few more switch port hops than Client1, depending on which interface is the active member of the bond:


The one that looks strange is Client2 Slave to Master offset. This should be roughly the same as Master to Slave, it's the same path across the switches, but here we see 30μs with spikes up to 50μs:


The Slave to Master offset looks strange, and because these intermediate offsets are part of what PTPd uses to calculate adjustments, our accuracy is slightly affected:


Client1 is ±2μs while Client2 is ±4μs - how is sfptpd with software timestamps performing better than PTPd with hardware timestamps?

The hint is in the variance in the Slave to Master offset. The graph shows us we're getting upwards of 20μs variance in either the egress/TX time from the Slave or the ingress/RX time on the Master, which are the two timestamps that would affect Slave to Master offset. Client1 does not appear to be affected, so probably not the RX time on the master but something to do with Client2.

With the server now suspect it was pretty fast to find what the problem was. Somewhere during the development of the extra monitoring and trying out new versions I'd gotten Collectd into a Segfault-restart loop (it was crashing and respawning over and over). My NTP offset sanity check script is a Collectd Exec plugin that "always runs", and it looks like every time Collectd crashed and restarted it spawned a new NTP monitoring script. I had 140 instances of a script running doing 140 NTP queries every second to an NTP server. This is the cause of the interference. You can probably guess at what time I killed the runaway scripts:




Client2 has now stabilised, but the accuracy is still no where near what I was expecting. In fact, it's almost on par with Client1 in almost every metric taken. Why are software timestamps equal to hardware timestamps?

Hardware Timestamps on a Client, Actually

Something was not right. When Wojciech first told me about the 2.3.3-hwtest branch of PTPd he also gave me a sample configuration file with a number of additional filtering options. I'd been using the defaults, which, aside from the extra filtering, were missing one very important option:

ptpengine:hardware_timestamping=y

I think you can guess what that does. Configuration problem fixed, PTPd is now using the Linux PHC API:

2016-04-07 11:26:05.863773 ptpd[58310].bond242 (info)      (init) bond242 is a bonded interface, current active slave: em2
2016-04-07 11:26:05.863787 ptpd[58310].bond242 (info)      (init) Underlying physical device: em2
2016-04-07 11:26:05.866849 ptpd[58310].bond242 (info)      (init) Hardware timestamping successfully enabled on em2
2016-04-07 11:26:05.867864 ptpd[58310].bond242 (info)      (init) clock.linuxphc: Linux PHC clock driver em2 starting
2016-04-07 11:26:05.867888 ptpd[58310].bond242 (info)      (init) clock.linuxphc: Successfully started Linux PHC clock driver em2 (/dev/ptp1) clock ID ffffff8b
2016-04-07 11:26:05.867895 ptpd[58310].bond242 (notice)    (init) clock: Clock em2 changed state from INIT to FREERUN
2016-04-07 11:26:05.868921 ptpd[58310].bond242 (info)      (init) clock.linuxphc: Linux PHC clock driver em1 starting
2016-04-07 11:26:05.868942 ptpd[58310].bond242 (info)      (init) clock.linuxphc: Successfully started Linux PHC clock driver em1 (/dev/ptp0) clock ID ffffff83
2016-04-07 11:26:05.868955 ptpd[58310].bond242 (notice)    (init) clock: Clock em1 changed state from INIT to FREERUN

After running for a few minutes we're now looking a lot better. Our offset tolerance is ±200ns, the best we've seen so far:


If we look over a longer period of time PTPd with hardware timestamps is generally a lot more stable both from the hardware timestamps and the better outlier filtering it does:


Another interesting change in metrics is the Slave to Master and Master to Slave offsets, we've dropped from 30μs to 21.6μs:


By moving the point in where we timestamp from being in software to being ingress into the NIC, we've taken off 10μs of latency. From here you can observe that the time it takes to go from the NIC, through the kernel and into software is 10μs, which is comparable to the performance gain we see when using kernel bypass stacks like Solar Flare Onload. Not very relevant for clock sync, but it's nice to have independent numbers line up.

This raises a question though: if hardware timestamping was not enabled when we had the problem with PTPd ignoring Delay Response messages, then why was PTPd failing to read the transmit timestamp of the Unicast message? It would mean it was failing to read the software timestamp of a Unicast packet. One for the ptpd project GitHub issues, once this branch of code is merged. It does mean though that we can try Hybrid mode again... Yes, it works, so Unicast TX timestamps do work on the Intel i350 in this server and we're back to using a slightly less network intensive mode of PTP.

Terminology and Language

The MiFID II regulations have forced the Finance sector into areas where they may not have usually cared about before, so there are people, like myself, jumping into the unknown territory of clock synchronisation. One of the problems whenever you get started in new territory is finding what already exists. It is important to remember that other industries have solved these sorts of clock sync problems before, so there's a lot of learning that's already been done and standards that exist, and if we're not careful we'll end up reinventing the wheel and coming up with a whole new set of language for the same thing.

One example is what I've been graphing as Offset Tolerance, which is the minimum and maximum offset over a set time period that shows the "how bad" the offset was for a given range of time. Turns out there is a well defined term for this called Maximum Time Interval Error (MTIE) that was standardised by the International Telecommunications Union over a decade ago.

Guilty. I'll be renaming that, and will endeavour to use similar terms as our Telecom colleagues.

What's Next?

I'm pretty happy with these results. ±200ns is the best we've seen, and it seems very stable, I think for our use case we don't need to try go any lower at the moment. We've got a proof of concept design, although now I'd really like to go back and try the ASIC firewall again, as I suspect my previous results were tainted by system load. Most importantly we've got a PTP client that supports bonded interfaces and hardware timestamps on the model of NIC controller we have in some of our servers here.

There's a few things I think need to be tackled first before we starting looking at a roll out:
  1. PTPd to sync Solar Flare cards - we want to make sure we can receive accurate PTP over a cheap NIC and sync the clocks of expensive Solar Flare capture cards in the same server.
  2. Failure monitoring and recovery - we've seen PTPd work well, what if it fails? It's infeasible to constantly watch hundreds of graphs - how do we alarm about it? Can we recover nicely or automatically? Can we somehow revert to NTP if we're in trouble?
  3. PTP Bridge Resiliency - the PTP Bridge currently uses a mix of linuxptp software (ptp4l and phc2sys), can we use PTPd to replace this and get a little resiliency from bonded interfaces? What if the Bridge dies?
Failure monitoring is probably the most important here before we can look at any sort of internal roll out. The PTP Bridge is missing some Puppet work though, so perhaps next will be a Puppet module for PTPd.

Wednesday, February 10, 2016

Solving MiFID II Clock Synchronisation with minimum spend (part 6)

In this series we are attempting to solve a clock synchronisation problem to a degree of accuracy in order to satisfy MiFID II regulations, and we're trying to do it without spending a lot of money.

So far we have:
  1. Talked about the regulations and how we might solve this with Linux software
  2. Built a "PTP Bridge" with Puppet
  3. Started recording metrics with collectd and InfluxDB, and
  4. Finished recording metrics
  5. Drawn lots of graphs with Grafana and found contention on our firewall
In this post we will look at introducing a second ASIC based firewall to route PTP through so that PTP accuracy is not affected by our management firewall getting busy.

The New Design

Here is the adjusted design diagram, it's not that much more complicated:


Routing To Another Firewall

In order for this design to work every PTP Client must send it's PTP traffic back through the PTP firewall, rather than the management firewall. Each client needs a routing rule to reflect this. There are similar changes needed on the PTP Bridge to send through the PTP firewall.

With the new firewall in place and the routing changes applied, everything should "just work", but it doesn't. We see multicast leaving the PTP Bridge correctly:

[root@ptp-bridge ~]# tcpdump -i p4p3 multicast -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on p4p3, link-type EN10MB (Ethernet), capture size 65535 bytes
16:10:07.519357 IP 192.168.0.4.319 > 224.0.1.129.319: UDP, length 44
16:10:07.519405 IP 192.168.0.4.320 > 224.0.1.129.320: UDP, length 44

Restart sfptpd on a client and we see it cannot subscribe to the multicast group:

2015-12-15 16:01:18.306138: info: using SO_TIMESTAMPNS software timestamps
2015-12-15 16:01:18.406259: notice: Now in state: PTP_LISTENING
2015-12-15 16:01:18.406353: info: creating NTP sync-module
2015-12-15 16:01:30.306177: warning: failed to receive Announce within 12.000 seconds

A packet trace shows that we see PIM announcements from the right firewall, and the IGMPv3 packets from sfptpd are leaving the client:

16:07:15.444343 IP 192.168.0.2 > 224.0.0.13: PIMv2, Hello, length 34
16:07:18.307942 IP 192.168.0.5 > 224.0.0.22: igmp v3 report, 2 group record(s)
16:07:18.369939 IP 192.168.0.5 > 224.0.0.22: igmp v3 report, 2 group record(s)
16:07:18.686936 IP 192.168.0.5 > 224.0.0.22: igmp v3 report, 2 group record(s)

Looking at the multicast groups on the new firewall, we don't have the PTP multicast addresses appearing as registered groups that the firewall can forward.

To cut a very, very long story short, this was a TTL problem compounded by two firewall brands behaving differently. The ptp4l process sets the TTL of it's multicast packets to 1 to restrict them to the same subnet. Our management firewall accepted this and did the multicast forwarding anyway, the PTP firewall did not. Setting udp_ttl=2 for the ptp4l process that's doing the multicast fixed this. The lmaxexchange-linuxptp module has been changed to take this option now too.

The performance improvement moving from the software firewall to an ASIC firewall is quite apparent. Whereas before our "normal" range of variance was ±20μs (top image), we're now getting ±5μs (bottom image):


Our mean accuracy has improved substantially - at a glance the hardware change has saved 15% of the tolerance the regulations allow.

There are interesting patterns in the most recent data; regular and somewhat uniform spikes in client offset are seen on both clients. The graph below draws both client offsets together:



The spikes are recorded at almost exactly the same second, they last for a few seconds, they appear to come in groups of two and are generally over 9000ns in size. Identifying how regular these spikes are may indicate what the problem is. If they are perfectly regular on both clients, I would start to think it might be something on the firewall, like an administrative or housekeeping function.

We can use an InfluxDB query to get the raw data:

SELECT "time","value" FROM "tail_ns" WHERE "instance" = 'sfptpd_stats' AND "type_instance" = 'offset-system' AND ("host" = 'client1') AND ( "value" > '0' OR "value" < '0' ) AND time >= '2016-02-05 09:43:19' AND value > 9000

Now with a little bit of Python we can work out the difference in time between spikes:

Unfortunately there's not as much uniformity in the spikes as I was hoping to see:

$ python diff.py < foo  | awk '{print $4}' | sort | uniq -c  | sort -n | tail
      6 0:01:08
      6 0:01:10
      6 0:01:14
      6 0:01:16
      8 0:01:00
      8 0:01:03
      8 0:01:17
      9 0:01:04
      9 0:01:05
     10 0:01:06

The most common time difference is around 60-70 seconds, which is between the individual spikes in a pair of two. The difference between the pairs of spikes themselves though is quite varied, anywhere from 2 minutes to 8 minutes apart. This doesn't mean it's not the firewall, but this data doesn't reliably pin the problem to some X minute recurring job in the internals of the firewall's operating system. The network engineers in my team are also quick to point out that this is an ASIC based firewall that's doing nothing but shifting a couple PTP packets per second, so firewall spikes of 20μs would be a very strange thing to be encountering. Before getting lost in a rabbit hole chasing performance spikes (something our engineers here know all too much about), there may be another way to mitigate this.

Attracting Attention

January was an interesting month for this project. We had several Solarflare representatives in to talk specifically about sfptpd, how it may be better utilised here and what feedback we could give to the Solarflare developers. A few days later I was tracked down on LinkedIn by Wojciech Owczarek, developer of the PTPd project. There are some very interesting features coming into PTPd, most of which we'll get to in a later blog post, but Wojciech did recommend tuning the PTP Sync and DelayReq interval higher to give the filters more samples to work with. This is a very simple thing to adjust and measure. So lets see if adjusting these values smooths out our spikes to any degree.

Faster Message Rates

Changing the sync rate of the ptp4l process on the PTP Bridge requires a minor code addition to the lmaxexchange-linuxptp Puppet module, and at the same time I'll rename repositories on GitHub to be more indicative that they are actually Puppet modules. Setting logSyncInterval to -2 means ptp4l will send out 4 Sync messages a second. The sfptpd daemons recognise this immediately:

2016-02-05 14:47:24.187549: info: received new Sync interval -2 from master (was 0)

After half an hour it appears as if the size of the spikes has lessened from upwards of 20μs to 15μs. Observing over a longer period appears to show the intensity of a spike has lowered slightly, see the before (top) and after (bottom) graphs below:


Changing the Delay Request rate of the sfptpd client is also simple. Setting ptp_delayreq_interval to -2 in the sfptpd configuration file means the client will send a DelayReq message 4 times a second. We also have two clients, so by changing only one we can compare the difference between the two.

An interesting pattern has emerged a day after changing one of our clients to send 4 Delay Requests per second; the spikes are no longer uniform across both clients as seen in the graph below:


This debunks the theory that it's the firewalls causing the jitter, each client is seeing it's own variance at different intervals. One of our key performance guys, Mark Price, tells me 10-20μs is well within the range of Operating System jitter he's seen, so on an un-tuned system, even with no workload, there's a good chance this is what we are seeing.

A Different Architecture

Lets step back a bit and look at what we've found so far. Routing PTP through a firewall is possible but we are heavily affected by contention. We can reduce this by using a dedicated firewall, but are still seeing 20μs spikes of variance, and there's a good chance it's from the Operating System. We could continue down the OS tuning route now - it's definitely something we'll have to do eventually - but it would be a big time investment and right we're still experimenting with architecture and design. Finally, our Networking team want their spare ASIC firewall back, so I'm losing my dedicated PTP firewall. The PTP Bridge generally keeps a baseline of 100ns accuracy (excluding jitter), perhaps it's time to look at switching-only designs.

In the very first post I mused about a design that didn't require a firewall, something that involved connecting the PTP Bridge to multiple separate management networks and having it send multiple independent PTP multicast streams. The idea had security implications - the PTP Bridge would join networks that don't usually join, networks that are usually protected from each other by a firewall. With careful attention to detail these implications can be mitigated.

Along side this, I've got new options in my choice of software. Solar Flare are interested in us replacing the PTP Bridge component with a purely Solar Flare approach - I'm told there is an undocumented feature that allows sfptpd to work in UDP Unicast mode, so it would be able to talk directly to our S300 Grand Master. There's also been some advances with PTPd that make it viable to replace linuxptp on the PTP Bridge, as well as replace sfptpd as the PTP clients.

In the next post we'll remove the PTP firewall from the mix and look at purely switched designs.