Wednesday, December 16, 2015

Solving MiFID II Clock Synchronisation with minimum spend (part 3)

In my last post we created a "PTP Bridge" which translates from one Precision Time Protocol transport mechanism to another in order to distribute precise time to a PTP client. I'm doing this to try solve specific clock synchronisation requirements that are coming in new regulations. You can read about the regulations, my limitations, and subsequent design ideas in the first post in this series. This post will be half technical and half observation as I go in to how I'm capturing statistics, and what we can see from the data.

We know that the design we've started with is inherently flawed. PTP is the most accurate when it runs on dedicated network infrastructure where every device in the network is PTP aware. Every device in the network path that's not PTP aware adds some amount of variable delay. Switches will add some, software firewalls and routers will add more.

There's a few reasons we started with a flawed design. The first is the effort involved was low - multicasting over a network that already exists required only a small amount of configuration changes. We only had to build one new server and attach half a dozen network cables to get up and running. Expenditure is also a big factor - we haven't had to buy anything new so far, we've just used what we've got on hand. It's also a nicely Agile-ish approach - lets get our hands dirty with PTP early, learn what we need to learn and make small improvements as we need it. After all, we are trying to hit a specific level of clock sync accuracy by year 2017, it's going to be an iterative improvement process over a period of time.

When you adopt a design you know will be "bad", soon you ask yourself a very important question...

How bad is bad?

How slow is that firewall? How much time is lost through that switch? Very good questions that are answered with statistics. Before we get into that though there is another important point that I haven't talked about so far.

An amendment to Regulatory Technical Standard 25 (RTS 25) in the Markets in Financial Instruments Regulation (MiFIR) talks about venues documenting the design of their clock synchronisation system, and specifications showing that the required accuracy is maintained. What this means is that we need to be able to prove to any auditors that we maintain ±100μs off UTC at all times. To do this we have to monitor our system clocks and how accurate PTP is.

So, by answering the question "how bad is bad" we're taking early steps towards our requirement to prove we can maintain the accuracy we need to. Enough theory for now, lets get technical.

PTP Metrics

We are using two software implementations of PTP - ptp4l and sfptpd. Both output the standard PTP statistics. ptp4l is very basic and can only write to STDOUT, while sfptpd can write to STDOUT, a file, or syslog. The three main PTP statistics are offset, frequency adjustment and path delay.

Offset is how much the PTP protocol has calculated that the Slave Clock is off from the Master Clock. Frequency Adjustment is how much the oscillator of the clock is adjusted to try make it run at the same rate as the Master Clock. Path Delay is how long it takes for a PTP packet to get from Master Clock to Slave Clock (or vice versa, because PTP assumes they are the same).

The file format of both daemons is not exactly the same but it is very similar. If we can get something to constantly read from the files and parse metrics from them then we should be able to handle both ptp4l and sfptpd in an almost identical manner.

Capturing, Storing and Displaying Metrics

Historically it's common to store server metrics in Round Robin Database (RRD) files, and for most metrics this is perfectly fine. After all, most metrics get less important the longer ago in time they were recorded. My use case is a little different though. RRD's smoothing over time affect is not helpful if you are trying to compare a day six months ago with yesterday in per second granularity, the detail just gets normalised away. The smoothing effect is also not appropriate for proving compliance at any point in time over the past year.  So I want to avoid RRD backends for the moment, which also includes Graphite's Whisper/Carbon backend which is similar in concept to RRD.

I've used an early version of InfluxDB before and found it reasonably intuitive. We also use it in a few places elsewhere in the company. InfluxDB seems to change data storage engines every phase of the moon, and what should be harmless RPM upgrades have corrupted data stores here in the past. From version 0.9.5 InfluxDB are using a storage engine they developed themselves. Hopefully it remains stable, but even if it doesn't, we're only playing around at the moment and we can change storage backends if need be.

There are a number of ways to send data into InfluxDB, and it can run as a collector for various different wire formats as well. One of these is collectd, which is a popular metrics collection system with a huge library of Plugins. Collectd has a "Tail" plugin which reads from a file and applies regular expressions in order to gather metrics - that sounds like it can extract numbers from the ptp4l and sfptpd stats files for me.

For displaying data I've used Graphite in the past but I'm really taken with Grafana, which I find much prettier and modern. Grafana also works with InfluxDB 0.9.

Adding Statistics Collection

For lack of a better place right now I'm going to send my time metrics to the PTP Bridge server. So it is both the Master Clock and source of record of how accurate all it's Slave Clocks are. Until I'm happy that Grafana and InfluxDB are the right choices I'm not going to code their setup into Puppet just yet. I've just installed influxdb and grafana RPMs by hand. In hindsight this was a good move - I've upgraded the InfluxDB RPM twice now, and it has moved data and config files each time. Back porting these constant changes into Puppet would have annoyed me.

I am using a puppetlabs-apache module to proxy Grafana:

You'll notice there's only a non-ssl Apache vhost here. I ran into issues proxying InfluxDB's API and using SSL. In the interest of getting some results on screen I've allowed my browser to simply fetch data from InfluxDB directly on it's default port 8086. The problem was a certificate / mixed content issue so it's probably solvable, I'll come back to it later.

Turning our PTP clients into statistics generators is pretty simple, the puppet-community-collectd module has all of the features I've needed so far. Here is a profile to install collectd on a server and add some extra collectd types that are more relevant for PTP than the standard types:

Collectd has native types - time_offset, frequency_offset and delay - that are almost appropriate to use for PTP but not quite, so I've created my own. PTP Offset is always measured in nanoseconds rather than collectd's time_offset which is in seconds. PTP Frequency is measured in Parts Per Billion (ppb), not frequency_offset's Parts Per Million (ppm). PTP Path Delay is also measured in nanoseconds, and should never be negative so the minimum value is zero.

The Annoying Thing About Time

We're working towards collecting PTP statistics which is great, but how do we know that PTP is actually doing it's job right? It would be nice to be able to compare the clocks that PTP is disciplining to some other time source so we can see how accurate it is. A digital version of looking at your wrist watch's seconds hand tick around and comparing it to the speed of the clock on the wall, so to speak.

This is actually a very difficult thing to do, made even harder because we're working with such small units of time. What we would do in an NTP environment is run "ntpdate -q TIMESOURCE" and check the drift is acceptable. PTP is supposed to be more accurate than NTP though, so this is not going to give us the level of depth we want. The PTP protocol does output statistics about it's accuracy, but this can only be trusted if it calculates it's path delay. If there's a device in our network that's introducing a consistent one-sided delay, the PTP algorithm will not be able to correct this and we'll end up with a clock that's slightly out of sync - and PTP will never know.

The only way to test PTP accuracy is to compare it to a second reference clock that you trust more. MicroSemi have a good paper on how you would go about doing this. In short, you plug a 1 Pulse-Per-Second (1PPS) card into your PTP slave and compare the PTP card to the 1PPS card.

I don't have a 1PPS card (yet...), so collecting NTP Offset while we're messing around with PTP is better than nothing. NTP running as a daemon has some clever math to improve it's accuracy over time. ntpdate as a one-shot command doesn't have that, so it will be subject to all sorts of network delays. So I'm not expecting it to be very accurate or usable as a trusted reference clock, but at the very least it will make a good starting point for our metrics collection.

Collecting NTP Offset

Now to test we can gather metrics correctly with collectd. I've created a simple collectd Exec plugin that executes a Bash script in an endless loop, outputting the offset from a given NTP server in the correct collectd format. The NTP server I'll be querying is the same Symmetricom S300. This is the Puppet class:

The Bash script is pretty easy, I translate the output from seconds to microseconds because I think it's better to visualise (less decimal places):

We also need a bit of collectd configuration to do something with our statistics. When debugging it's helpful to quickly add the CSV plugin which writes to disk locally. We want to send our data the InfluxDB server, which you can configure to accept in collectd format. Here's a simple Puppet class to send data to a remote server using the collectd Network plugin:

We can now add the above components to the PTP Bridge Puppet Profile:

After a Puppet run on the PTP Bridge server I can see my first InfluxDB Measurement "remote_ntp_offset" being created and storing data points:

The name of the measurement in the screenshot above is confusing. It gets constructed from the plugin instance remote_ntp_offset that I wrote, plus the unit of measurement of the type I used. I chose time_offset, which is in the default collectd types database as being measured in seconds. My original script worked in seconds but I didn't like the long decimal places, so I translated it the script output to microseconds but didn't change the collectd type - I'll fix it later.

Graphing NTP Offset

We have data in InfluxDB, now let's visualise this in Grafana. I find the Query Builder editor mode doesn't work for me very well; the JavaScript has issues (clicking on certain controls doesn't work) and you just can't represent some features of the InfluxDB query language using the form (like an OR statement). Below I show the pure InfluxDB query to draw the remote NTP Offset for the PTP Bridge server:

If you want to copy it, the InfluxDB query is:

SELECT mean("value") AS "value" FROM "remote_ntp_offset_seconds" WHERE "host" = 'FQDN' AND $timeFilter GROUP BY time($interval)

First thing to note, I have obfuscated the "host" tag value, it's not just a list of spaces or 'FQDN'. We're selecting mean("value") and grouping by a dynamic time($interval). This is a common pattern that Grafana does so that you don't request every single data point over a large period of time and grind your browser to a halt. The $interval is worked out by whatever $timeFilter is applied to the graph. Grafana has a very easy to use time range selector and zoom button in the top right corner, so the $timeFilter and $interval change each time you "move" the graph to a different time range. If you do want to see every single data point, it's simple enough to modify the query, but be careful not to zoom out too far.

The astute of you will also note that the actual values in the graph above are pretty strange. It shows that the PTP Bridge is -25 ±5 μs behind the GPS timesource. A quick glance of the ptp4l log file that is the Slave Clock of the PTP Grandmaster (the same time source as NTP) shows that we barely drift over a microsecond offset fromt the Master Clock, and most PTP Sync messages are offset by only ±100 nanoseconds.

So which is correct? PTP or NTP? Which do I trust more? If only we had a perfect reference clock plugged in to the server for comparison... Even without a fully trusted reference clock, we can still deduce with reasonable confidence which of NTP or PTP is more correct. To do this we will first gather more data and look at the PTP metrics from both our PTP Bridge and PTP clients, and then take a close look at the network architecture to see if we can make some reasonable conclusions.

This post has already gotten very long, so I'll continue with PTP metrics collection and what we can infer from the data in the next post.


  1. Any idea how to convert frequency adjustment (which is in parts per billion) to nanoseconds? My initial thought was to divide the value by the CPU clock speed in MHz, but there isn't enough variance in the gps->system log lines for that calculation to be correct. On top of that, the freq_adj is always lower for the Solarflare ASIC to GPS, which makes me assume the Soalrflare clock runs at a slower speed

    1. Sounds tricky, as there will be a number of variables that are dragging the oscillator frequency slower and faster compared to pure wall clock time, so it'll be relative to those various unknowns. Reminds me of my boss' blog post about getting sanity out of CPU clock sources ( - have a read, because even though it's talking about CPUs, some of the core problems are probably going to be the same (CPU power levels, etc).

      What exactly is your need to try convert from frequency to nanos?

    2. How confident are you that each offset really is the true offset from the grandmaster? The short answer is that I want to know how many nanoseconds PTP is actually adjusting the clock during each adjustment.

      The longer explanation follows...

      Offset is how inaccurate PTP thinks the host is from the PTP grandmaster. Freq-adj is the dispersion (correction) that PTP actually applys to try and bring the host back into sync. Assume you are perfectly in sync (0 ns offset), then the next packet is +100 ns offset. Now consider two scenarios:

      1. If the packets continue at +100ns PTP will slew you forward using the freq-adj until the offset is 0 ns, but it won't jump your time in one interval. Knowing how freq-adj gets converted to nanoseconds answers how long it takes to get in sync and how aggressively PTP acts to make this happen.

      2. If the packets immediately go back to 0 ns, perhaps the +100 ns packet was an "error". PTP has probably started to slew your time towards +100 ns and will need to slew your time back. Knowing how freq-adj converts to ns in this case allows you to know by how much your clock was inaccurate.

    3. Completely not confident :-) PTP will not report offset that it itself doesn't see, and I'm sure there's "hidden" delay in all sorts of places in this test infrastructure, as none of the layer 1 gear is PTP aware at the moment.

      This holds true for frequency adjustment as well, because PTP will only slew the clock based on the offset that it thinks it knows about. any freq-adj values are made in response to the calculated offset (even if this offset is wrong). So if we stick with your first scenario, a sudden +100ns gain in signal that stays constant. Using a P/I controller the actual deltas of offset when graphed might look something like this:

      The freq-adj would be a similar-ish shape but opposite; PTP should be slowing down the oscillator to compensate for it thinking that the clock is now 100ns ahead of the master. When the offset is detected as being 0ns again, the freq-adj should be stable as well. The time it takes for PTP to be in sync is the same time as the offset was not 0ns, which is the same time freq-adj was not stable. Since both the offset and freq-adj numbers come from within PTP, both are as trustworthy as each other, so using either offset or freq-adj to figure out how long before you are back to 0ns should be the same, in theory. Plugging in a magic "perfect" clock to compare with might show something very different, of course.

      Are you trying to figure out the offset at any arbitrary point in time? To explain the question...

      You'll only get as many data points as your PTP sync interval, so lets assume once a second, and the pulses land on the second boundary (.000000). Imagine the second sync pulse of +100ns, the next packet after the first time PTP adjusted the frequency. Let's assume PTP now thinks it is +60ns out, so in the second that's gone past it has made a certain frequency adjustment, and that adjustment has correct 40ns of drift.

      However you don't care what the offset is at 1.00000, you want to know what the offset was at 0.43716, so somewhere between +100ns and +60ns. Is that why you want freq-adj to nanos, so you can retrospectively figure out what the clock offset was *between* the frequency adjustments?

    4. The "why" is a bit academic. We seem to get about +/-100ns of accuracy, but with occasional blips to +/-200 to 300ns. When that happens, I want to understand how sfptpd is correcting (and whether that correction is consistent across the servers).

      Thought you might be interested - the following is what Solarflare support tells me:

      "The frequency correction values are in parts-per-billion (PPB). The clock frequency doesn't matter, but the timescale is significant. One way to think of the parts-pre-billion value is as a ns/s (nanoseconds per second) slew rate. It is typical for there to be a "base" correction to the oscillator to make its frequency in the correct range. Your freq-adj or around -640PPB = -0.64PPM (parts-per-million) is well within the 4.6PPM Free Run frequency stability for a Stratum 3 oscillator. Across these two readings, the freq-adj for phc0 is modified by -638.674 - (-640.160) = 1.486 ns/s so the slew rate to keep everything in sync is very modest."

      We record sfptp stats every second, so any ns/sec values get updated every second, which means for me they show the ns slew applied.

  2. This comment has been removed by the author.