Thursday, December 24, 2015

Solving MiFID II Clock Synchronisation with minimum spend (part 4)

This post is a continuation from Part 3, where we started on collecting metrics from our PTP Bridge and clients. We started graphing NTP offset as a proof of concept and noticed data that could indicate a problem with the accuracy of PTP, but we need more information. We're now diving right in where we left off - collecting PTP metrics.


Collecting PTP Metrics

We've got NTP Offset data into InfluxDB, but what we really want is data from PTP. I'll start with the PTP Bridge. In Part 2 of this series we set up various linuxptp processes being run with supervisord with Puppet, and we specified that STDOUT was written to specific log files under /var/log/linuxptp. We're interested in the stats from the ptp-master instance of ptp4l; this is the Slave Clock that receives Layer 2 PTP from the S300 Grandmaster Clock. A single line of stats output looks like this:

ptp4l[2089949.676]: master offset         38 s2 freq  +34476 path delay     35290

The three interesting stats here are offset (38), frequency adjustment (+34476) and path delay (35290). If you look at the output from a phc2sys process, the format is almost the same:

phc2sys[2090397.362]: phc offset         2 s2 freq  -12406 delay   2416

As mentioned before we chose Collectd as a stats gatherer because it had the ability to tail log files. Since the files are almost identical, we should be able to create a Defined Type in Puppet that will write a collectd Tail plugin configuration file and that we can re-use for any ptp4l or phc2sys log file we want to collect stats from:

There are a lot of lessons learned in the above code that need explaining. First, the type in each one of the matches is one of our custom PTP collectd types. The dstype parameter has to do with how collectd accumulates values. A Tail plugin matcher is designed to read a chunk of a file and accumulate it's contents into one metric. For example, read /var/log/maillog every minute and count how many emails were sent. It's not designed to generate a metric for every line of a file, which is what I want to do.

It can be made to do this by changing the collectd interval to be as fast as the PTP interval - in my case every 1 second - and setting Dstype to "GaugeLast". This will have collectd wake up every second, tail the log file which should find just one extra line from the last time it woke up, and then use that value found as a Gauge value.

Finally, the collectd Tail Plugin regular expressions are compatible with the regex library on the machine as described in the man page regex(7), which is not the same as Perl. I'm mentioning this because I spent almost a day debugging Perl-like regular expressions in collectd, only to find out that I misread the collectd documentation about what Regex engine was used under the hood (I was trying to use PCRE character classes; \w, \s, etc).

The way Collectd is designed to always aggregate data highlights that perhaps it is not the best tool for this job. A better tool might be to write a log file tailer script in Python and send metrics to InfluxDB using it's own wire format (not the collectd wire format). Collectd will do for now though, and I'll slowly build up a list of reasons to move away from it.

Now if we go back to the Puppet Profile for the PTP Bridge, we can use the above collectd tailer Defined Type for every linuxptp instance log file:

You may remember that we have four linuxptp processes running under supervisord, but we have only three collectd tailers here. The fourth process is the ptp4l multicast process (the one that multicasts to our PTP Slaves through a firewall), it is configured as a Master Clock and so doesn't log statistics, so there's no point tailing it's log file.

Lets look at our data in InfluxDB:


If you want to copy it, the query is:

SELECT value from "tail_" WHERE "instance" = 'ptp-master' and "type_instance" = 'offset' AND time > now() - 10s

InfluxDB 0.9 introduces "tags" to namespace sets of data points. In the query above we are filtering on tag "instance" having the value of 'ptp-master' and tag "type_instance" being 'offset'. These tags and their values are translated by InfluxDB from the collectd wire format. If you compare what tags are being created to the Puppet code, you can see the relationship.

The name of the measurement, tail_, is not the most descriptive. InfluxDB's translation of the collectd tail plugin data leaves a lot to be desired. A better way to structure this data in InfluxDB would be to have a measurement called "offset" (because that's what we're actually measuring) and tags of "instance" and "host". This is another +1 to dropping collectd and writing our own tailer script so we can write to InfluxDB in a better way. Anyway, it's working so let's soldier on.

Graphing PTP Metrics

Lets draw the ptp-master instance's offset for a small window of time:


The query for this graph is:

SELECT mean("value") AS "value" FROM "tail_" WHERE "instance" = 'ptp-master' AND "type_instance" = 'offset' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

You may wonder why I've put the value>0 and value<0. It is a workaround for this bug, which is actually a Go bug. To summarise, the collectd Tail plugin is writing "NaN" (Not a Number) into InfluxDB. It appears to do this on startup, when there are no metrics to read from the files. When you create a simple InfluxDB query in Grafana that returns these NaNs, you trigger that JSON error and the graphs don't draw. Since NaN is not less than zero nor is it greater than zero (because it's not a number), you can effectively "filter out" the NaNs by adding to the WHERE clause. This is the reason why I will always write the full InfluxDB query in Grafana, rather than use the query builder form - you can't specify brackets or OR statements using Grafana's query builder.

The data in this period says we're getting ±100-250ns maximum offset. I'm pretty happy with that result - it means that to receive PTP from our S300 Grandmaster Clock we're losing at most 1/4 of a microsecond in accuracy. That's a good start, but there's plenty more hops to go.

If we draw the ptp-master path delay we can see the time it takes for our PTP packets to cross the switch:


The query is very similar, only the type_instance value needs to change:

SELECT mean("value") AS "value" FROM "tail_" WHERE "instance" = 'ptp-master' AND "type_instance" = 'path-delay' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

Between 35.25μs and 35.3μs. This may seem alarmingly high at first but remember that the value of path delay is not important to PTP, as long as it is consistent. The variance here, ±50ns, is not too bad.

(Not) Explaining the Negative NTP Offset

Hopefully you remember I stopped the previous post after graphing the NTP Offset of the PTP Bridge (which is the value of the "ntpdate -q" command) and that the data we got threw into question whether PTP or NTP was working correctly.

Here's the data for the same time range as the graphs above:


I said that if we look closely at the metrics and the network architecture that I'd be able to explain the difference. When I wrote that I had convinced myself that the NTP Offset discrepancy was caused by the time it takes to cross the switch, which we measure as the ptp-master Path Delay.

If we assume the clocks are perfectly synchronised by PTP and we know that it takes 35μs to cross the switch, the timestamp in the NTP packet would be 35μs behind. Add a little bit of operating system jitter... That explains our -25μs NTP offset, right?

... Not really. If the NTP packets were delayed by 35μs then our PTP Bridge would be 35μs ahead, not behind. Also, the assumptions about network architecture are incorrect, the NTP packets do not just cross a switch.

Here is a revised diagram of the network architecture from Part 2 showing the path that NTP takes:



The NTP packets flow from PTP Bridge through the switch to the firewall, then back onto the same switch to the S300. I also said that ntpdate was a "onetime" query and has no smarts built in to calculate network delays - this is also incorrect (lesson learned: don't blindly trust what you read on Stack Exchange). The ntpdate man page says it makes several requests to a single NTP server and attempts to calculate the offset, so it's not a "dumb" time query, it is trying to figure out it's own delay and compensate for it - explaining the NTP offset on the PTP Bridge just got a lot harder.

The ntpdate command reports a delay of 25.79 milliseconds. This seems way too high. The time to ping the NTP interface on the S300 is only 0.2ms. The delay to other NTP servers in the same network is around 25ms as well, so this could be a property of how ntpdate works, or the NTP server software. I seem to have more questions than answers at the moment, so lets look at gathering the stats from two PTP Clients and see if that enlightens anything.


Collecting sfptpd Metrics

The Puppet work we've done earlier for the PTP Bridge can be mostly re-used for our PTP Clients. The NTP Offset statistics is exactly the same, and the Collectd Tail plugin for a linuxptp stats file is similar in format to sfptpd, so we've pretty much got the regular expressions we need. There are two things that make it more complicated though; how sfptpd writes to disk, and how sfptpd synchronises all Solar Flare adapter clocks.

The first problem has to do with sfptpd's output. The linuxptp processes are line buffered, so that works nicely with our collectd tailer that reads a new line from the file every second, and we never miss a data point. sfptpd is block buffered, so you get a disk write every 5-10 seconds and it's almost never at the end of a line. Collectd's Tail plugin will read the latest data from the file but it this will be several lines of output. Due to the way the Tail plugin aggregates, it will only ever send one data point even if it read 10 lines.

The issue really lies with collectd's Tail plugin, which we talked about before. We can make sfptpd behave in a way that fits with collectd though; we can run the stdbuf(1) command beforehand to force sfptpd's STDOUT to be line buffered. I can do this by modifying the sfptpd init script, or by running sfptpd with supervisord.

I've gone with the second option because I already use supervisord for the linuxptp processes, and I think modifying the init script is "more dirty". Here is the modified Puppet code for sfptpd:


Nothing too hard there, the complicated bit is generating the collectd Tail Plugin config for the sfptpd stats file.

sfptpd will synchronise the adapter clocks of all Solar Flare adapters in the server, as well as the system clock. In the machines I'm using as PTP Clients a single second of stats logs looks like this:

2015-12-22 17:14:10.759688 [ptp-gm->system(em1)], offset: 348.000, freq-adj: 1444.090, in-sync: 1, one-way-delay: 57616.000, grandmaster-id: 0000:0000:0000:0000
2015-12-22 17:14:10.759688 [system->phc4(p5p1/p5p2)], offset: -227.562, freq-adj: -334.226, in-sync: 0
2015-12-22 17:14:10.759688 [system->phc5(p4p1/p4p2)], offset: -227.938, freq-adj: -261.973, in-sync: 0
2015-12-22 17:14:10.759688 [system->phc6(p3p1/p3p2)], offset: -223.562, freq-adj: 88.446, in-sync: 0


The first line is the "Local Reference Clock". The output "ptp-gm->system(em1)" shows that we are receiving PTP from a Grand Master using software timestamping on interface em1. The subsequent lines are Solar Flare adapters; there are three dual port cards, each with a single PTP Hardware Clock. The output "system->phc4(p5p1/p5p2)" shows that the adapter clock phc4 is being synchronised, which is the clock for interfaces p5p1 and p5p2.

The system line is the most interesting to me right now, but we should capture metrics for all cards as well. Specifying each Solar Flare adapter clock for every node would be very tedious. We also can't use Collectd's Tail plugin multiple times on the same file, so we have to build a big list of regular expressions of everything we want to match on one Puppet Array. Here's how I've done it automatically.

First, we need a Facter Fact to "know" which interfaces are our Solar Flare adapters. Here is a Custom Fact that will create Facts for every network interface in a machine and what driver they are using. An example of these Facts on a PTP Client:

nic_driver_bonding => bond0,bond1
nic_driver_igb => em1,em2,em3,em4
nic_driver_sfc => p3p1,p3p2,p4p1,p4p2,p5p1,p5p2
nic_driver_virtual => dummy0,lo


... and here is the Fact itself. It comes from an internal "standard lib" Puppet module which is not available to the public:


Next we need a way of knowing what PHC devices belong to our Solar Flare adapters. Many other brand of network card will create /dev/phcX devices on recent kernels, we just want the Solar Flare ones. Here is a Fact from the lmaxexchange-linuxptp Puppet module:


The last thing we want to do is take a template of Collectd Tail Plugin Matches and substitute in the PHC devices we are looking for. We need a chunk of imperative programming logic to do this efficiently, something which Puppet 3.x is very poor at. I've written a Ruby function that substitutes and replicates a template hash against an array. The Gist below has docs that demonstrate an example that explains all:


Now I put all these things together in a Puppet class to create the Collectd Tail Plugin config file for our sfptpd stats file:


The resulting file is large and repetitive, you can see an example from one of my PTP Clients here. With the above Puppet class applied we are now gathering PTP stats on the test clients.

All those code examples have made this a long post. The next post will be drawing graphs and analysing data.

No comments:

Post a Comment