Tuesday, January 12, 2016

Solving MiFID II Clock Synchronisation with minimum spend (part 5)

In this series we are attempting to solve a clock synchronisation problem to a degree of accuracy in order to satisfy MiFID II regulations. So far we have:
  1. Talked about the regulations and how we might solve this with Linux software
  2. Built a "PTP Bridge" with Puppet
  3. Started recording metrics with collectd and InfluxDB, and
  4. Finished recording metrics
In this post we will look at the data we've gathered to see how accurate my first design is, and also see if we can explain why NTP and PTP think the time is so different.

Here's a quick refresh on what the design looks like:


Minor Improvements

I've made some slight improvements to Collectd and InfluxDB since I last posted code examples (if the queries look different to previous posts, this is why). I figured out why the Tail plugin always came up as "tail_" in Influx - it's because I used a custom Collectd Type that InfluxDB didn't know about. To fix this I pointed InfluxDB at my custom Collectd types database file, and now the measurement names contain the unit. I also fixed the type used for remote NTP offset to be in microseconds. The measurements now look like this, which is a bit more descriptive:


Graphing PTP Client Metrics

I've got data from the PTP Bridge as well as two test servers I'm using as PTP Clients. Lets look at the offset of the clients, arguably the most important metric:



The queries for the above graphs are:

SELECT mean("value") AS "value" FROM "tail_ns" WHERE "instance" = 'sfptpd_stats' AND "type_instance" = 'offset-system' AND "host" = 'FQDN' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

Generally we are ±20μs offset from the PTP Bridge. Most of the time the PTP Bridge itself is ±200ns offset from the Grand Master Clock, however there are some spikes. Here is the PTP Bridge offset for the same time period:



A better way to visualise the offset variance is to plot the min() and max() on the same graph using a larger GROUP BY time period. The two queries for an sfptpd client would be:

SELECT min("value") AS "value" FROM "tail_ns" WHERE "instance" = 'sfptpd_stats' AND "type_instance" = 'offset-system' AND "host" = 'FQDN' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

SELECT max("value") AS "value" FROM "tail_ns" WHERE "instance" = 'sfptpd_stats' AND "type_instance" = 'offset-system' AND "host" = 'FQDN' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

The PTP Bridge queries are:

SELECT min("value") AS "value" FROM "tail_ns" WHERE "instance" = 'ptp-master' AND "type_instance" = 'offset' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

SELECT max("value") AS "value" FROM "tail_ns" WHERE "instance" = 'ptp-master' AND "type_instance" = 'offset' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

I find this easier to read the offset variance - it's the space between the lines. The image below contains the PTP Bridge (top) and two clients:


It would be nice to be able to represent "offset variance" as a single number. I know the formula - absolute(Max) plus absolute(Min) - but I don't know how to do that in InfluxDB or Grafana.

If we ignore that one spike on the Master for now, our PTP Clients have about ±20μs variance. This is "OK", but not great. It's within the tolerance we need to hit to be compliant, however this is not the only step in the chain. Our financial software is what timestamps events, and this is subject to some amount of operating system jitter. A finger-in-the-air number from some of our developers about what jitter to expect from our software at the moment is in the area of ±40μs. Adding ±20μs PTP variance is ±60μs, which is getting uncomfortably near to our compliance limit of ±100μs.

The difference in offset variance is not just the firewall, although it's probably a large part of it, there are two other factors. First, there is no hardware timestamping on the clients receiving PTP multicast - the sfptpd software only supports hardware timestamping on Solar Flare interfaces and we are using an existing management network that's not Solar Flare NICs. Second, we haven't tuned or isolated the PTP software so it's subject to operating system jitter. While the OS jitter is probably very small right now (the test clients are essentially idle machines) it will become more of an issue if we put PTP on our servers doing a normal workload.

The time period in the graphs above are a small 3 minute window I chose during a "good" period to show the data. When you look at the data over a larger period a much more serious problem emerges, one that is going to force our hand in terms of architecture changes.

The problem with re-using infrastructure

Here is a client's offset tolerance from 21:30 to 23:30:


In case you missed the scale of the graph, that's spikes of ±700μs in this 2 hour period. I won't hold you in suspense for too long - in short, it's network resource contention on the switch / firewall introducing variance to PTP. 22:00 UTC is a special time in the Finance world, it is 17:00 EST (5pm New York time) which is when the financial markets close in NY. Our platform does some house keeping work at this time, including several backup jobs. These backup jobs ship large amounts of data across our management networks, which is the same switch and firewall that PTP is running through.

To look at this problem in detail here are tiles of all metrics for both the PTP Bridge and PTP Clients for that time period (the top image is the PTP Bridge):



Eyeballing all graphs we see that the backup period affects both the PTP Bridge and the PTP Clients, but on different scales. The PTP Bridge's Offset Tolerance is interesting because it shows the switch has ±7μs of overhead when under load, but the PTP Client offsets spike up to hundreds of microseconds. This indicates that while the switch is also noticeably busy, the firewall is contributing the majority of this variance - makes sense, it's a software (CPU-based) firewall and it's also trying to ship gigabytes of backups around at the same time.

If we zoom in on one of these spikes we can see what is most likely a queuing backlog on the firewall:


You can see Path Delay spikes up and then drains out slowly. The offset smooths out reasonably quickly because the PTP formula is able to figure out the correct offset as long as the delay is reasonably stable. It's the initial spike which gives us all the trouble.

When zoomed in like this the spike may not seem such a big deal, after all the offset appears to settle back to normal in about 10-15 seconds. Our platform can process hundreds of thousands of client messages in that time period, so it's not a trivial amount of time in the Financial world, nor is it anywhere near compliant.

Rethinking Architecture

We know that it's contention on the firewall that's causing the majority of these issues. The switch is also affected, but it's not that big a contributor. When discussing this problem in the office the next simplest idea we came up with was to just route the PTP traffic through a different firewall. It wouldn't even need to be a very powerful device, as long as it wasn't congested during backup periods (we joked about using someone's home Netgear at one point).

We use ASIC based firewalls in other areas where we care about speed, and the brand in question has a pretty inexpensive low-end model. Even though it's a break from our "don't spend any money" stance the cost to buy these small firewalls is low enough that's it's not that big a deal. We happen to have a similar unit available for testing, so in the next post we'll look at adding this ASIC firewall into the mix, configuring PTP to route through this firewall, and the results we get from it.

No comments:

Post a Comment