Nazca Lines: 2015

Thursday, December 24, 2015

Solving MiFID II Clock Synchronisation with minimum spend (part 4)

This post is a continuation from Part 3, where we started on collecting metrics from our PTP Bridge and clients. We started graphing NTP offset as a proof of concept and noticed data that could indicate a problem with the accuracy of PTP, but we need more information. We're now diving right in where we left off - collecting PTP metrics.

Collecting PTP Metrics

We've got NTP Offset data into InfluxDB, but what we really want is data from PTP. I'll start with the PTP Bridge. In Part 2 of this series we set up various linuxptp processes being run with supervisord with Puppet, and we specified that STDOUT was written to specific log files under /var/log/linuxptp. We're interested in the stats from the ptp-master instance of ptp4l; this is the Slave Clock that receives Layer 2 PTP from the S300 Grandmaster Clock. A single line of stats output looks like this:

ptp4l[2089949.676]: master offset 38 s2 freq +34476 path delay 35290

The three interesting stats here are offset (38), frequency adjustment (+34476) and path delay (35290). If you look at the output from a phc2sys process, the format is almost the same:

phc2sys[2090397.362]: phc offset 2 s2 freq -12406 delay 2416

As mentioned before we chose Collectd as a stats gatherer because it had the ability to tail log files. Since the files are almost identical, we should be able to create a Defined Type in Puppet that will write a collectd Tail plugin configuration file and that we can re-use for any ptp4l or phc2sys log file we want to collect stats from:

There are a lot of lessons learned in the above code that need explaining. First, the type in each one of the matches is one of our custom PTP collectd types. The dstype parameter has to do with how collectd accumulates values. A Tail plugin matcher is designed to read a chunk of a file and accumulate it's contents into one metric. For example, read /var/log/maillog every minute and count how many emails were sent. It's not designed to generate a metric for every line of a file, which is what I want to do.

It can be made to do this by changing the collectd interval to be as fast as the PTP interval - in my case every 1 second - and setting Dstype to "GaugeLast". This will have collectd wake up every second, tail the log file which should find just one extra line from the last time it woke up, and then use that value found as a Gauge value.

Finally, the collectd Tail Plugin regular expressions are compatible with the regex library on the machine as described in the man page regex(7), which is not the same as Perl. I'm mentioning this because I spent almost a day debugging Perl-like regular expressions in collectd, only to find out that I misread the collectd documentation about what Regex engine was used under the hood (I was trying to use PCRE character classes; \w, \s, etc).

The way Collectd is designed to always aggregate data highlights that perhaps it is not the best tool for this job. A better tool might be to write a log file tailer script in Python and send metrics to InfluxDB using it's own wire format (not the collectd wire format). Collectd will do for now though, and I'll slowly build up a list of reasons to move away from it.

Now if we go back to the Puppet Profile for the PTP Bridge, we can use the above collectd tailer Defined Type for every linuxptp instance log file:

You may remember that we have four linuxptp processes running under supervisord, but we have only three collectd tailers here. The fourth process is the ptp4l multicast process (the one that multicasts to our PTP Slaves through a firewall), it is configured as a Master Clock and so doesn't log statistics, so there's no point tailing it's log file.

Lets look at our data in InfluxDB:

If you want to copy it, the query is:

SELECT value from "tail_" WHERE "instance" = 'ptp-master' and "type_instance" = 'offset' AND time > now() - 10s

InfluxDB 0.9 introduces "tags" to namespace sets of data points. In the query above we are filtering on tag "instance" having the value of 'ptp-master' and tag "type_instance" being 'offset'. These tags and their values are translated by InfluxDB from the collectd wire format. If you compare what tags are being created to the Puppet code, you can see the relationship.

The name of the measurement, tail_, is not the most descriptive. InfluxDB's translation of the collectd tail plugin data leaves a lot to be desired. A better way to structure this data in InfluxDB would be to have a measurement called "offset" (because that's what we're actually measuring) and tags of "instance" and "host". This is another +1 to dropping collectd and writing our own tailer script so we can write to InfluxDB in a better way. Anyway, it's working so let's soldier on.

Graphing PTP Metrics

Lets draw the ptp-master instance's offset for a small window of time:

The query for this graph is:

SELECT mean("value") AS "value" FROM "tail_" WHERE "instance" = 'ptp-master' AND "type_instance" = 'offset' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

You may wonder why I've put the value>0 and value<0. It is a workaround for this bug, which is actually a Go bug. To summarise, the collectd Tail plugin is writing "NaN" (Not a Number) into InfluxDB. It appears to do this on startup, when there are no metrics to read from the files. When you create a simple InfluxDB query in Grafana that returns these NaNs, you trigger that JSON error and the graphs don't draw. Since NaN is not less than zero nor is it greater than zero (because it's not a number), you can effectively "filter out" the NaNs by adding to the WHERE clause. This is the reason why I will always write the full InfluxDB query in Grafana, rather than use the query builder form - you can't specify brackets or OR statements using Grafana's query builder.

The data in this period says we're getting ±100-250ns maximum offset. I'm pretty happy with that result - it means that to receive PTP from our S300 Grandmaster Clock we're losing at most 1/4 of a microsecond in accuracy. That's a good start, but there's plenty more hops to go.

If we draw the ptp-master path delay we can see the time it takes for our PTP packets to cross the switch:

The query is very similar, only the type_instance value needs to change:

SELECT mean("value") AS "value" FROM "tail_" WHERE "instance" = 'ptp-master' AND "type_instance" = 'path-delay' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

Between 35.25μs and 35.3μs. This may seem alarmingly high at first but remember that the value of path delay is not important to PTP, as long as it is consistent. The variance here, ±50ns, is not too bad.

(Not) Explaining the Negative NTP Offset

Hopefully you remember I stopped the previous post after graphing the NTP Offset of the PTP Bridge (which is the value of the "ntpdate -q" command) and that the data we got threw into question whether PTP or NTP was working correctly.

Here's the data for the same time range as the graphs above:

I said that if we look closely at the metrics and the network architecture that I'd be able to explain the difference. When I wrote that I had convinced myself that the NTP Offset discrepancy was caused by the time it takes to cross the switch, which we measure as the ptp-master Path Delay.

If we assume the clocks are perfectly synchronised by PTP and we know that it takes 35μs to cross the switch, the timestamp in the NTP packet would be 35μs behind. Add a little bit of operating system jitter... That explains our -25μs NTP offset, right?

... Not really. If the NTP packets were delayed by 35μs then our PTP Bridge would be 35μs ahead, not behind. Also, the assumptions about network architecture are incorrect, the NTP packets do not just cross a switch.

Here is a revised diagram of the network architecture from Part 2 showing the path that NTP takes:

The NTP packets flow from PTP Bridge through the switch to the firewall, then back onto the same switch to the S300. I also said that ntpdate was a "onetime" query and has no smarts built in to calculate network delays - this is also incorrect (lesson learned: don't blindly trust what you read on Stack Exchange). The ntpdate man page says it makes several requests to a single NTP server and attempts to calculate the offset, so it's not a "dumb" time query, it is trying to figure out it's own delay and compensate for it - explaining the NTP offset on the PTP Bridge just got a lot harder.

The ntpdate command reports a delay of 25.79 milliseconds. This seems way too high. The time to ping the NTP interface on the S300 is only 0.2ms. The delay to other NTP servers in the same network is around 25ms as well, so this could be a property of how ntpdate works, or the NTP server software. I seem to have more questions than answers at the moment, so lets look at gathering the stats from two PTP Clients and see if that enlightens anything.

Collecting sfptpd Metrics

The Puppet work we've done earlier for the PTP Bridge can be mostly re-used for our PTP Clients. The NTP Offset statistics is exactly the same, and the Collectd Tail plugin for a linuxptp stats file is similar in format to sfptpd, so we've pretty much got the regular expressions we need. There are two things that make it more complicated though; how sfptpd writes to disk, and how sfptpd synchronises all Solar Flare adapter clocks.

The first problem has to do with sfptpd's output. The linuxptp processes are line buffered, so that works nicely with our collectd tailer that reads a new line from the file every second, and we never miss a data point. sfptpd is block buffered, so you get a disk write every 5-10 seconds and it's almost never at the end of a line. Collectd's Tail plugin will read the latest data from the file but it this will be several lines of output. Due to the way the Tail plugin aggregates, it will only ever send one data point even if it read 10 lines.

The issue really lies with collectd's Tail plugin, which we talked about before. We can make sfptpd behave in a way that fits with collectd though; we can run the stdbuf(1) command beforehand to force sfptpd's STDOUT to be line buffered. I can do this by modifying the sfptpd init script, or by running sfptpd with supervisord.

I've gone with the second option because I already use supervisord for the linuxptp processes, and I think modifying the init script is "more dirty". Here is the modified Puppet code for sfptpd:

Nothing too hard there, the complicated bit is generating the collectd Tail Plugin config for the sfptpd stats file.

sfptpd will synchronise the adapter clocks of all Solar Flare adapters in the server, as well as the system clock. In the machines I'm using as PTP Clients a single second of stats logs looks like this:

2015-12-22 17:14:10.759688 [ptp-gm->system(em1)], offset: 348.000, freq-adj: 1444.090, in-sync: 1, one-way-delay: 57616.000, grandmaster-id: 0000:0000:0000:0000
2015-12-22 17:14:10.759688 [system->phc4(p5p1/p5p2)], offset: -227.562, freq-adj: -334.226, in-sync: 0
2015-12-22 17:14:10.759688 [system->phc5(p4p1/p4p2)], offset: -227.938, freq-adj: -261.973, in-sync: 0
2015-12-22 17:14:10.759688 [system->phc6(p3p1/p3p2)], offset: -223.562, freq-adj: 88.446, in-sync: 0

The first line is the "Local Reference Clock". The output "ptp-gm->system(em1)" shows that we are receiving PTP from a Grand Master using software timestamping on interface em1. The subsequent lines are Solar Flare adapters; there are three dual port cards, each with a single PTP Hardware Clock. The output "system->phc4(p5p1/p5p2)" shows that the adapter clock phc4 is being synchronised, which is the clock for interfaces p5p1 and p5p2.

The system line is the most interesting to me right now, but we should capture metrics for all cards as well. Specifying each Solar Flare adapter clock for every node would be very tedious. We also can't use Collectd's Tail plugin multiple times on the same file, so we have to build a big list of regular expressions of everything we want to match on one Puppet Array. Here's how I've done it automatically.

First, we need a Facter Fact to "know" which interfaces are our Solar Flare adapters. Here is a Custom Fact that will create Facts for every network interface in a machine and what driver they are using. An example of these Facts on a PTP Client:

nic_driver_bonding => bond0,bond1
nic_driver_igb => em1,em2,em3,em4
nic_driver_sfc => p3p1,p3p2,p4p1,p4p2,p5p1,p5p2
nic_driver_virtual => dummy0,lo

... and here is the Fact itself. It comes from an internal "standard lib" Puppet module which is not available to the public:

Next we need a way of knowing what PHC devices belong to our Solar Flare adapters. Many other brand of network card will create /dev/phcX devices on recent kernels, we just want the Solar Flare ones. Here is a Fact from the lmaxexchange-linuxptp Puppet module:

The last thing we want to do is take a template of Collectd Tail Plugin Matches and substitute in the PHC devices we are looking for. We need a chunk of imperative programming logic to do this efficiently, something which Puppet 3.x is very poor at. I've written a Ruby function that substitutes and replicates a template hash against an array. The Gist below has docs that demonstrate an example that explains all:

Now I put all these things together in a Puppet class to create the Collectd Tail Plugin config file for our sfptpd stats file:

The resulting file is large and repetitive, you can see an example from one of my PTP Clients here. With the above Puppet class applied we are now gathering PTP stats on the test clients.

All those code examples have made this a long post. The next post will be drawing graphs and analysing data.

Wednesday, December 16, 2015

Solving MiFID II Clock Synchronisation with minimum spend (part 3)

In my last post we created a "PTP Bridge" which translates from one Precision Time Protocol transport mechanism to another in order to distribute precise time to a PTP client. I'm doing this to try solve specific clock synchronisation requirements that are coming in new regulations. You can read about the regulations, my limitations, and subsequent design ideas in the first post in this series. This post will be half technical and half observation as I go in to how I'm capturing statistics, and what we can see from the data.

We know that the design we've started with is inherently flawed. PTP is the most accurate when it runs on dedicated network infrastructure where every device in the network is PTP aware. Every device in the network path that's not PTP aware adds some amount of variable delay. Switches will add some, software firewalls and routers will add more.

There's a few reasons we started with a flawed design. The first is the effort involved was low - multicasting over a network that already exists required only a small amount of configuration changes. We only had to build one new server and attach half a dozen network cables to get up and running. Expenditure is also a big factor - we haven't had to buy anything new so far, we've just used what we've got on hand. It's also a nicely Agile-ish approach - lets get our hands dirty with PTP early, learn what we need to learn and make small improvements as we need it. After all, we are trying to hit a specific level of clock sync accuracy by year 2017, it's going to be an iterative improvement process over a period of time.

When you adopt a design you know will be "bad", soon you ask yourself a very important question...

How bad is bad?

How slow is that firewall? How much time is lost through that switch? Very good questions that are answered with statistics. Before we get into that though there is another important point that I haven't talked about so far.

An amendment to Regulatory Technical Standard 25 (RTS 25) in the Markets in Financial Instruments Regulation (MiFIR) talks about venues documenting the design of their clock synchronisation system, and specifications showing that the required accuracy is maintained. What this means is that we need to be able to prove to any auditors that we maintain ±100μs off UTC at all times. To do this we have to monitor our system clocks and how accurate PTP is.

So, by answering the question "how bad is bad" we're taking early steps towards our requirement to prove we can maintain the accuracy we need to. Enough theory for now, lets get technical.

PTP Metrics

We are using two software implementations of PTP - ptp4l and sfptpd. Both output the standard PTP statistics. ptp4l is very basic and can only write to STDOUT, while sfptpd can write to STDOUT, a file, or syslog. The three main PTP statistics are offset, frequency adjustment and path delay.

Offset is how much the PTP protocol has calculated that the Slave Clock is off from the Master Clock. Frequency Adjustment is how much the oscillator of the clock is adjusted to try make it run at the same rate as the Master Clock. Path Delay is how long it takes for a PTP packet to get from Master Clock to Slave Clock (or vice versa, because PTP assumes they are the same).

The file format of both daemons is not exactly the same but it is very similar. If we can get something to constantly read from the files and parse metrics from them then we should be able to handle both ptp4l and sfptpd in an almost identical manner.

Capturing, Storing and Displaying Metrics

Historically it's common to store server metrics in Round Robin Database (RRD) files, and for most metrics this is perfectly fine. After all, most metrics get less important the longer ago in time they were recorded. My use case is a little different though. RRD's smoothing over time affect is not helpful if you are trying to compare a day six months ago with yesterday in per second granularity, the detail just gets normalised away. The smoothing effect is also not appropriate for proving compliance at any point in time over the past year. So I want to avoid RRD backends for the moment, which also includes Graphite's Whisper/Carbon backend which is similar in concept to RRD.

I've used an early version of InfluxDB before and found it reasonably intuitive. We also use it in a few places elsewhere in the company. InfluxDB seems to change data storage engines every phase of the moon, and what should be harmless RPM upgrades have corrupted data stores here in the past. From version 0.9.5 InfluxDB are using a storage engine they developed themselves. Hopefully it remains stable, but even if it doesn't, we're only playing around at the moment and we can change storage backends if need be.

There are a number of ways to send data into InfluxDB, and it can run as a collector for various different wire formats as well. One of these is collectd, which is a popular metrics collection system with a huge library of Plugins. Collectd has a "Tail" plugin which reads from a file and applies regular expressions in order to gather metrics - that sounds like it can extract numbers from the ptp4l and sfptpd stats files for me.

For displaying data I've used Graphite in the past but I'm really taken with Grafana, which I find much prettier and modern. Grafana also works with InfluxDB 0.9.

Adding Statistics Collection

For lack of a better place right now I'm going to send my time metrics to the PTP Bridge server. So it is both the Master Clock and source of record of how accurate all it's Slave Clocks are. Until I'm happy that Grafana and InfluxDB are the right choices I'm not going to code their setup into Puppet just yet. I've just installed influxdb and grafana RPMs by hand. In hindsight this was a good move - I've upgraded the InfluxDB RPM twice now, and it has moved data and config files each time. Back porting these constant changes into Puppet would have annoyed me.

I am using a puppetlabs-apache module to proxy Grafana:

You'll notice there's only a non-ssl Apache vhost here. I ran into issues proxying InfluxDB's API and using SSL. In the interest of getting some results on screen I've allowed my browser to simply fetch data from InfluxDB directly on it's default port 8086. The problem was a certificate / mixed content issue so it's probably solvable, I'll come back to it later.

Turning our PTP clients into statistics generators is pretty simple, the puppet-community-collectd module has all of the features I've needed so far. Here is a profile to install collectd on a server and add some extra collectd types that are more relevant for PTP than the standard types:

Collectd has native types - time_offset, frequency_offset and delay - that are almost appropriate to use for PTP but not quite, so I've created my own. PTP Offset is always measured in nanoseconds rather than collectd's time_offset which is in seconds. PTP Frequency is measured in Parts Per Billion (ppb), not frequency_offset's Parts Per Million (ppm). PTP Path Delay is also measured in nanoseconds, and should never be negative so the minimum value is zero.

The Annoying Thing About Time

We're working towards collecting PTP statistics which is great, but how do we know that PTP is actually doing it's job right? It would be nice to be able to compare the clocks that PTP is disciplining to some other time source so we can see how accurate it is. A digital version of looking at your wrist watch's seconds hand tick around and comparing it to the speed of the clock on the wall, so to speak.

This is actually a very difficult thing to do, made even harder because we're working with such small units of time. What we would do in an NTP environment is run "ntpdate -q TIMESOURCE" and check the drift is acceptable. PTP is supposed to be more accurate than NTP though, so this is not going to give us the level of depth we want. The PTP protocol does output statistics about it's accuracy, but this can only be trusted if it calculates it's path delay. If there's a device in our network that's introducing a consistent one-sided delay, the PTP algorithm will not be able to correct this and we'll end up with a clock that's slightly out of sync - and PTP will never know.

The only way to test PTP accuracy is to compare it to a second reference clock that you trust more. MicroSemi have a good paper on how you would go about doing this. In short, you plug a 1 Pulse-Per-Second (1PPS) card into your PTP slave and compare the PTP card to the 1PPS card.

I don't have a 1PPS card (yet...), so collecting NTP Offset while we're messing around with PTP is better than nothing. NTP running as a daemon has some clever math to improve it's accuracy over time. ntpdate as a one-shot command doesn't have that, so it will be subject to all sorts of network delays. So I'm not expecting it to be very accurate or usable as a trusted reference clock, but at the very least it will make a good starting point for our metrics collection.

Collecting NTP Offset

Now to test we can gather metrics correctly with collectd. I've created a simple collectd Exec plugin that executes a Bash script in an endless loop, outputting the offset from a given NTP server in the correct collectd format. The NTP server I'll be querying is the same Symmetricom S300. This is the Puppet class:

The Bash script is pretty easy, I translate the output from seconds to microseconds because I think it's better to visualise (less decimal places):

We also need a bit of collectd configuration to do something with our statistics. When debugging it's helpful to quickly add the CSV plugin which writes to disk locally. We want to send our data the InfluxDB server, which you can configure to accept in collectd format. Here's a simple Puppet class to send data to a remote server using the collectd Network plugin:

We can now add the above components to the PTP Bridge Puppet Profile:

After a Puppet run on the PTP Bridge server I can see my first InfluxDB Measurement "remote_ntp_offset" being created and storing data points:

The name of the measurement in the screenshot above is confusing. It gets constructed from the plugin instance remote_ntp_offset that I wrote, plus the unit of measurement of the type I used. I chose time_offset, which is in the default collectd types database as being measured in seconds. My original script worked in seconds but I didn't like the long decimal places, so I translated it the script output to microseconds but didn't change the collectd type - I'll fix it later.

Graphing NTP Offset

We have data in InfluxDB, now let's visualise this in Grafana. I find the Query Builder editor mode doesn't work for me very well; the JavaScript has issues (clicking on certain controls doesn't work) and you just can't represent some features of the InfluxDB query language using the form (like an OR statement). Below I show the pure InfluxDB query to draw the remote NTP Offset for the PTP Bridge server:

If you want to copy it, the InfluxDB query is:

SELECT mean("value") AS "value" FROM "remote_ntp_offset_seconds" WHERE "host" = 'FQDN' AND $timeFilter GROUP BY time($interval)

First thing to note, I have obfuscated the "host" tag value, it's not just a list of spaces or 'FQDN'. We're selecting mean("value") and grouping by a dynamic time($interval). This is a common pattern that Grafana does so that you don't request every single data point over a large period of time and grind your browser to a halt. The $interval is worked out by whatever $timeFilter is applied to the graph. Grafana has a very easy to use time range selector and zoom button in the top right corner, so the $timeFilter and $interval change each time you "move" the graph to a different time range. If you do want to see every single data point, it's simple enough to modify the query, but be careful not to zoom out too far.

The astute of you will also note that the actual values in the graph above are pretty strange. It shows that the PTP Bridge is -25 ±5 μs behind the GPS timesource. A quick glance of the ptp4l log file that is the Slave Clock of the PTP Grandmaster (the same time source as NTP) shows that we barely drift over a microsecond offset fromt the Master Clock, and most PTP Sync messages are offset by only ±100 nanoseconds.

So which is correct? PTP or NTP? Which do I trust more? If only we had a perfect reference clock plugged in to the server for comparison... Even without a fully trusted reference clock, we can still deduce with reasonable confidence which of NTP or PTP is more correct. To do this we will first gather more data and look at the PTP metrics from both our PTP Bridge and PTP clients, and then take a close look at the network architecture to see if we can make some reasonable conclusions.

This post has already gotten very long, so I'll continue with PTP metrics collection and what we can infer from the data in the next post.

Saturday, November 28, 2015

Solving MiFID II Clock Synchronisation with minimum spend (part 2)

(Due to the embedded github.com Gists this post looks best in Blogger)

In my previous post I talked about the new Legislation coming to the European Financial sector in the form of MiFID II, and specifically about the Clock Synchronisation requirements. We touched on the Precision Time Protocol (PTP) and how it's accuracy is affected by various factors. I gave my opinion on various Linux software implementations, and I proposed a few designs that may make us MiFID II compliant.

The approach I am starting with is the easiest - it requires very little physical work and we don't have to go out and buy special PTP hardware. It makes use of what I call a "PTP Bridge", which is being used to translate PTP time from our time source and distribute it around our network. We will be Multicasting PTP through a firewall in order to reach all of the servers we need to. We know from the PTP theory that is not a very optimal design, but as I said, it's the quickest and easiest to start off with.

In this post we will get into the technical details on building the PTP Bridge and attaching a single client. The majority of the configuration samples will be in Puppet code - who builds things by hand any more?

In future posts we'll go in to measuring the design's accuracy and either improving upon it or trying different designs.

PTP Bridge + Firewall Architecture

To re-cap very quickly, I have at my disposal a MicroSemi SyncServer S300 as a GPS time source. It only supports PTP using Unicast Signaling or the 802.3 (Ethernet / Layer 2) transport. We will use the linuxptp software to consume the L2 PTP from the S300 and to multicast it out to our PTP clients through a firewall. Our PTP clients will run SolarFlare's sfptpd daemon to consume multicast PTP. The previous post covers these design choices. The design looks roughly like this:

The S300 only broadcasts L2 PTP from a single interface (LAN2). We have created a dedicated VLAN just for this purpose, and our PTP Bridge will be on this VLAN as well. For the moment it will be the only other device attached, but in the future we may have other devices here as well.

The PTP Bridge is a CentOS 6 server with a bonded management interface for standard Linux services (SSH, etc). Since linuxptp does not support bonded interfaces, we need a separate dedicated interface to connect to the S300. We will also need another separate interface to multicast PTP traffic to the firewall.

The firewall is configured for IGMPv3 and has the necessary configuration to allow the PTP Bridge and PTP clients to join the standard PTP Multicast group - 224.0.1.129 as defined in IEEE 1588.

The sfptpd daemon can work on Bonded interfaces so all of our PTP clients should simply need to specify the management interface to receive PTP from and it should "just work".

Building A PTP Bridge

The interaction between the different hardware and software components in the PTP Bridge can be a little confusing, there's a lot of moving parts. Here's a colorful drawing to help out. What each component does and why is explained below, inlined with Puppet code that creates them.

I've written a Puppet module to manage linuxptp software. It does not support configuring every single thing possible right now, I'm adding functionality to it as I need it. I'll definitely take Pull Requests for added functionality.

The init script that comes with the Red Hat RPM manages ptp4l and phc2sys as a single service, but for our PTP Bridge we need to run multiple instances of both. We have to disable the normal linuxptp services and use supervisord instead. I find the ajcrowe-supervisord module the most functional:

I need one ptp4l process configured in Layer 2 mode to be a slave of the S300 Grandmaster Clock, and we need to know what interface to have this ptp4l bind to. The linuxptp::ptp4l defined type takes care of writing the correct configuration file for me. I then use a supervisord::program to keep this ptp4l program running with the generated configuration file, also specifying that it writes to STDOUT rather than syslog so supervisord can handle the logging:

I now need a second instance of ptp4l to do our multicasting to the firewall, specifying a different interface to send multicast PTP out of:

This takes care of the daemons handling the PTP protocol. If both the network interfaces shared the same PTP Hardware Clock (PHC) device then this is all we would need, but in my PTP Bridge server the network interfaces have separate clocks. The linuxptp Puppet module contains a Facter Fact that will map network interface to PHC device:

[root@ptp-bridge ~]# facter -p phc
{"em1"=>"ptp4", "em2"=>"ptp5", "p4p1"=>"ptp0", "p4p2"=>"ptp1", "p4p3"=>"ptp2", "p4p4"=>"ptp3"}

Or "ethtool -T " will tell you.

This means I have to use linuxptp's phc2sys program to synchronise the time on the interface connected to the PTP Master with the interface that's sending the Multicast. phc2sys does not have a configuration file, only command line arguments, so we only need a supervisord::program, specifying the Master interface as the master clock, and the multicast interface as the slave clock:

This takes care of everything PTP. We should probably synchronise the Linux System clock as well though, as nothing will be doing that by default. A fourth phc2sys supervisord::program is used:

You'll notice the '-z' argument to phc2sys - this option comes in with linuxptp version 1.5, but I recommend at least version 1.6 (more on this later). The flag is used to specify the ptp4l socket and is necessary to differentiate when you have multiple ptp4l processes on one box. The linuxptp::ptp4l defined type will configure distinctly named socket files.

linuxptp-1.6 is available in Fedora 24. I was able to take the linuxptp-1.5 RPM Spec file and just replace the source tarball and test suites. Here is the head of the file showing the test suite and clknetsim GitHub hashes I used:

Now that I have all the components, I can run Puppet and I get the four supervisord programs I expect started up:

[root@ptp-bridge ~]# supervisorctl status
multicast                        RUNNING    pid 43206, uptime 0:01:57
phc2sys_multicast                RUNNING    pid 43207, uptime 0:01:57
phc2sys_system                   RUNNING    pid 43205, uptime 0:01:57
ptp-master                       RUNNING    pid 43208, uptime 0:01:57

I can see the PTP Master ptp4l process is synchronising with the S300 time source:

[root@ptp-bridge ~]# tail -f /var/log/linuxptp/ptp-master.log
ptp4l[874317.443]: selected /dev/ptp3 as PTP clock
ptp4l[874317.459]: port 1: INITIALIZING to LISTENING on INITIALIZE
ptp4l[874317.459]: port 0: INITIALIZING to LISTENING on INITIALIZE
ptp4l[874318.039]: port 1: new foreign master 000000.0000.000000-1
ptp4l[874322.038]: selected best master clock 000000.0000.000000
ptp4l[874322.038]: port 1: LISTENING to UNCALIBRATED on RS_SLAVE
ptp4l[874323.038]: master offset      35389 s0 freq +34540 path delay         0
ptp4l[874324.037]: master offset      35415 s1 freq +34566 path delay         0
ptp4l[874325.037]: master offset      -2372 s2 freq +32194 path delay         0
ptp4l[874325.037]: port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
ptp4l[874326.038]: master offset        -85 s2 freq +33769 path delay         0

I can also see the Multicast ptp4l process has gone into the Master state:

[root@ptp-bridge ~]# tail -f /var/log/linuxptp/multicast.log
ptp4l[874317.439]: selected /dev/ptp2 as PTP clock
ptp4l[874317.439]: port 0: hybrid_e2e only works with E2E
ptp4l[874317.440]: port 1: INITIALIZING to LISTENING on INITIALIZE
ptp4l[874317.440]: port 0: INITIALIZING to LISTENING on INITIALIZE
ptp4l[874324.062]: port 1: LISTENING to MASTER on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES
ptp4l[874324.062]: selected best master clock 001b21.fffe.6fa06c
ptp4l[874324.062]: assuming the grand master role

The server has joined the PTP Multicast group:

[root@ptp-bridge ~]# netstat -ng | grep -P '224.0.[01].(107|129)'
p4p3            2      224.0.0.107
p4p3            2      224.0.1.129

And if I snoop this network I can see it Multicasting out packets:

[root@ldprof-live-ptpb01 ~]# tcpdump -i p4p3 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on p4p3, link-type EN10MB (Ethernet), capture size 65535 bytes
14:06:13.447594 IP 192.168.0.1.319 > 224.0.1.129.319: UDP, length 44
14:06:13.447657 IP 192.168.0.1.320 > 224.0.1.129.320: UDP, length 44

We should check that our PHC clocks and system clock are being synchronised correctly as well:

[root@ptp-bridge ~]# tail -n1 /var/log/linuxptp/phc2sys_multicast.log
phc2sys[875054.546]: phc offset       135 s2 freq +34671 delay   5023

[root@ptp-bridge ~]# tail -n1 /var/log/linuxptp/phc2sys_system.log
phc2sys[875099.526]: phc offset        39 s2 freq -12380 delay   2379

That takes care of everything on the PTP Bridge for now.

Slave Clock with sfptpd

Our PTP Bridge appears to be sending out Multicast fine, now I want to get another server consuming this Multicast using sfptpd. I have written a Puppet module for sfptpd as well.

We need to turn off NTP and configure sfptpd on the management interface that's connected to the firewall handling the PTP Multicast traffic. We'll also turn on stats logging to a file:

If we start sfptpd and look at the log it looks like we are joined the multicast group correctly:

[root@ptp-client ~]# netstat -ng | grep -P '224.0.[01].(107|129)'
bond241 2 224.0.0.107
bond241 2 224.0.1.129

However, looking at the log we have a problem:

2015-11-26 14:46:16.058064: info: running as a daemon
2015-11-26 14:46:16.058326: info: creating PTP sync-module
2015-11-26 14:46:16.058498: info: PTP clock: local reference clock is system, PTP clock is system
2015-11-26 14:46:16.063963: info: using SO_TIMESTAMPNS software timestamps
2015-11-26 14:46:16.164341: notice: Now in state: PTP_LISTENING
2015-11-26 14:46:16.164510: info: creating NTP sync-module
2015-11-26 14:46:17.497587: info: New best master selected: 0000:0000:0000:0000(unknown)/1
2015-11-26 14:46:17.497640: notice: Now in state: PTP_SLAVE, Best master: 0000:0000:0000:0000(unknown)/1
2015-11-26 14:46:18.497152: info: received first Sync from Master
2015-11-26 14:46:19.689019: warning: failed to receive DelayResp for DelayReq sequence number 0
2015-11-26 14:46:20.751518: warning: failed to receive DelayResp for DelayReq sequence number 1
2015-11-26 14:46:21.751518: warning: failed to receive DelayResp for DelayReq sequence number 2
2015-11-26 14:46:21.751554: warning: failed to receive DelayResp 3 times in hybrid mode. Reverting to multicast mode.
2015-11-26 14:46:23.439019: warning: failed to receive DelayResp for DelayReq sequence number 3
2015-11-26 14:46:24.564018: warning: failed to receive DelayResp for DelayReq sequence number 4
2015-11-26 14:46:25.064020: warning: failed to receive DelayResp for DelayReq sequence number 5

The sfptpd daemon is not getting back any response to our Delay Request messages. After three attempts it drops out of hybrid mode and reverts to multicast mode, but still does not get any response.

sfptpd starts in hybrid mode first by default, which is a property of the "Enterprise" PTP Profile where Delay messages are unicast back to the Master Clock, rather than being multicast back. In large networks, every client multicasting back their Delay messages could get very noisy. After hybrid mode fails it falls back to the default mutlicast mode for Delay messages, but still does not work.

Snooping the interfaces on the PTP Bridge reveals an asymmetric routing problem. This is a side affect of our network architecture, our firewall, and using dedicated interfaces on the PTP Bridge - the PTP Bridge is sending the unicast Delay Responses out the wrong interface, they need to be sent out the same interface as the multicast packets. This is solved with policy based routing and some iptables rules on the PTP Bridge:

The Puppet code above creates firewal rules to mark inbound and outbound PTP Unicast and mark them with firewall mark "319". A network routing rule then says to look up routing table 100 for packets marked 319, and table 100 says the default gateway for these packets is the firewall on the other side of the Multicast interface.

Now lets try sfptpd again on a client:

2015-11-26 15:00:42.366823: info: running as a daemon
2015-11-26 15:00:42.367083: info: creating PTP sync-module
2015-11-26 15:00:42.367240: info: PTP clock: local reference clock is system, PTP clock is system
2015-11-26 15:00:42.372742: info: using SO_TIMESTAMPNS software timestamps
2015-11-26 15:00:42.473143: notice: Now in state: PTP_LISTENING
2015-11-26 15:00:42.473285: info: creating NTP sync-module
2015-11-26 15:00:43.517501: info: New best master selected: 0000:0000:0000:0000(unknown)/1
2015-11-26 15:00:43.517540: notice: Now in state: PTP_SLAVE, Best master: 0000:0000:0000:0000(unknown)/1
2015-11-26 15:00:44.517268: info: received first Sync from Master
2015-11-26 15:00:44.517383: info: clock system: applying offset 35.999998423 seconds
2015-11-26 15:01:20.554657: info: clock phc4: applying offset 35.999978085 seconds
2015-11-26 15:01:20.554792: info: clock phc5: applying offset 35.999979747 seconds
2015-11-26 15:01:20.554903: info: clock phc6: applying offset 35.999979123 seconds
2015-11-26 15:01:21.498046: info: ignoring DelayResp because offset from master not valid
2015-11-26 15:01:21.498084: info: received first DelayResp from Master

That appears to be working, we're not missing any Delay Response messages from the Master Clock. However, the server time is wrong, very wrong in fact. It's over 30 seconds too fast. We can see that in the output above where the clocks are stepped forward with +36 second offset.

In the year 2015, 35 / 36 seconds is a "magic" number in PTP land - it is the difference between International Atomic Time (TAI) and Coordinated Universal Time (UTC). Let's go into a little more theory to understand what this is.

TAI time does not account for the slowing of the rotation of the earth whereas UTC does. As of 30th June 2015 there are exactly 36 leap seconds applied to UTC, which means TAI time is 36 seconds ahead of UTC. International Atomic Time (TAI) is the time standard that all PTP Clocks run in. They don't run in UTC. You can see this on the PTP Bridge if you query the interface PHC:

[root@ptp-bridge ~]# date; phc_ctl /dev/ptp3 get
Fri Nov 27 12:46:36 UTC 2015
phc_ctl[956362.783]: clock time is 1448628432.159550826 or Fri Nov 27 12:47:12 2015

So the PTP wire protocol is in TAI, and it's the job of the PTP software implementation to translate from TAI to UTC by applying the UTC offset, which may or may not be communicated from the Grandmaster Clock.

Now back to our issue - we're 36 seconds out, which is a strong indication our system clock is being set to TAI time. There could be one of two things going wrong here:

sfptpd is using Software Timestamping on ingress, which is in UTC, but the PTP packets are stamped with TAI. When comparing these timestamps sfptpd thinks the system clock is 36 seconds out.
sfptpd and ptp4l are using the PTP Valid Offset flag in different ways, and so sfptpd is not applying the UTC offset.

There is an option in the default sfptpd.conf template called ptp_utc_valid_handling (it is not mentioned in the Advanced User Guide). I think the comments say it all so will paste verbatim:

# Configures how PTP handles the UTC offset valid flag. The specification is
# ambigious in its description of the meaning of the UTC offset valid flag
# and this has resulted in varying different implementations. In most
# implementations, if the UTC offset valid flag is not set then the UTC offset
# is not used but in others, the UTC offset valid is an indcation that the
# master is completely confident that the UTC offset is correct. Various
# options are supported:
#    default If UTCV is set use the UTC offset, otherwise do not use it
#    ignore   Do not used the UTCV flag - always apply the indicated UTC offset
#    prefer   Prefer GMs that have UTCV flag set above those that don't
#    require Do not accept GMs that do not set UTCV

I can't get my hands on the IEEE 1588-2008 specification (you have to buy it) so can't read about the flag myself. I still don't know whether what I'm seeing is a software timestamping issue or UTC Offset Valid flag issue. It can be worked around though by telling sfptpd to always apply the UTC offset:

What's Next?

We now have PTP time being translated from Layer 2 to Multicast through our PTP Bridge and being consumed on one of our servers. We know the current approach has design flaws that will affect the accuracy of PTP, which is important because we need to achieve a minimum level of accuracy in the coming legislation.

In the next post we will look at capturing statistics from PTP software in order to see how accurate the current design is.

Solving MiFID II Clock Synchronisation with minimum spend (part 1)

This blog post - and what is now a series of blog posts because of how long this one became - will look at implementing Precision Time Protocol (PTP) with sufficient accuracy in my organisation in order to satisfy upcoming European Financial regulations.

I'll first talk about where the regulations are coming from and what they are. Then we'll go into the Precision Time Protocol (PTP), and then we'll move into looking to solve this problem with the infrastructure already at my disposal. In other words, I'm going to try do it without buying anything fancy.

If you want to skip all this and go straight into something technical, you want the next blog post.

Clock Synchronisation in the Financial Sector

New legislation is coming to the European Financial Services sector. The Markets in Financial Instruments Directive (MiFID) II is due to come in to effect on the 3rd of January 2017, although there are whispers of rumors it will be delayed. The technical standards that need to be met are still yet to be confirmed by local governing bodies, but we already have some of an idea of what's coming based off public consultation and feedback from the European Securities and Markets Authority (ESMA).

Regulatory Technical Standard (RTS) 25 is regarding clock synchronisation. There are two important pieces of information in RTS 25 that these blog posts will focus on solving:

an institution's clock synchronisation must be traceable to UTC
that minimum levels of accuracy must be maintained

The minimum level of accuracy for "High Frequency Trading" in the original draft was microsecond (μs) granularity with ±1μs accuracy. After a lot of feedback regarding the technical difficulties of hitting such a high accuracy the RTS was amended so that the minimum level of accuracy is now ±100μs accuracy.

It is well known that standard Network Time Protocol (NTP) is only good to about a millisecond accuracy. This means that we have to move to something else. We also have to prove it's traceable to UTC, and we need to ensure that we don't fall outside our ±100μs accuracy.

Precision Time Protocol

The Precision Time Protocol (PTP) has been around for a while. PTP "version 2" was outlined in IEEE 1588-2008, and it is used a lot in industries that require precise timing, the Telecoms industry for example, as well as in audio video broadcasting.

A PTP Master Clock will send out PTP messages containing the current time that PTP Slaves consume. Slaves periodically send back their own messages to the Master. I won't go into the details of the protocol formula right now, there are better explanations elsewhere (Wikipedia has one, National Instruments has another with numbers). Very quickly though, the formula adds up a series of time stamps from these PTP messages and then divides by 2 so the slave can determine the network path delay from the master and thus determine the correct time.

In order for the mathematics in the protocol to work it assumes "network path symmetry", or rather that it takes the same amount of time from master to slave as it does from slave to master. Another big assumption is that both slave and master can accurately measure when they send and receive PTP messages. If there is something interfering with either of these two assumptions, then PTP becomes less accurate. Less accurate is a problem for me because I have regulations to satisfy.

The nemesis of the first assumption is network path asymmetry, which can occur when a PTP packet gets delayed in an intermediate network device, such as a switch, router or firewall, or even in the Operating System's networking stack. The second assumption can be improved by using hardware timestamps, where the timestamps used in the PTP calculations are generated by the network interfaces when they actually send and receive packets, rather than when the Operating System thinks it has. If PTP software has to fall back to using Software timestamps then the accuracy will drop.

Considering the above, the most accurate PTP network would be each slave clock having a direct cable into the master clock, with no switches or routers in between, and with both devices supporting hardware timestamping. Considering the number of cables this would require for even a small sized data center this is not feasible, so some places use dedicated PTP switching infrastructure, where the switches themselves are "PTP aware" and can either eliminate their own switching delay from PTP messages (called a Transparent Clock) or act as a PTP Master Clock themselves (called a Boundary Clock). Using a 'dedicated' approach like this means you are not mixing your PTP traffic with your application traffic, and so the PTP messages are not subject to any queue delays caused by bursts in application traffic.

If you can afford to go out and just do the above, that's great. To do such a thing can be quite costly though. Most financial organisations will already have a reasonable investment in their network infrastructure, so if I can could utilise what I've got in place now that would be much more efficient. We still have to hit RTS 25 accuracy though, ±100μs...

Receiving UTC

We are lucky enough to have some Symmetricom SyncServer S300s (now owned by MicroSemi) that have GPS satellite antennae on the roof of some of our data centres. This is nice for us because this should satisfy the requirement that our clock synchronisation be traceable to UTC. I say should because the regulations are not confirmed yet. We purchased the S300s a while ago because they supported PTP, even though we've only been using them as an NTP time source so far.

Receiving an accurate time is one thing, but distributing it to everywhere that is needed is a whole other ball game. The PTP standard (IEEE 1588) is actually rather broad and very flexible. It specifies multiple different transport mechanisms, multiple ways slaves can talk back to master clocks, a vast range of intervals that devices can talk to each other, etc. The standard being as flexible as it is means that when something says "supports PTP" on the side of the box, it may not work with everything else that supports PTP.

In theory what should be happening is devices and software should fully support one or more PTP "Profiles", which are a pre-defined subset of PTP options designed a specific purpose. For example, the Telecom Profile is used to transfer frequency around telecommunications networks. The PTP standard itself defines only a single "Default" profile.

In practice it's a little bit more difficult than that. The hardware guys have mostly got it sorted out, I can see in the S300 manual that it supports a superset of the Default Profile options. The software side is not so specific.

Most useful for my environment is the emerging PTP "Enterprise" Profile (not sure if it's official yet, all I can find are drafts). This Profile supports Multicasting if Sync messages from a Master Clock but UDP unicasting Delay messages back from a Slave to Master (the default is to Multicast back Delay messages which is would get noisy in a large network).

Consuming PTP

Consuming PTP gets complicated because of all the different transport mechanisms the PTP standard defines. On the Linux software side, there are a couple of choices available, I focused on the three I thought were the most prevalent: ptpd, linuxptp and sfptpd. Unfortunately the software project pages don't specifically mention what PTP Profile they support, they more talk about what PTP features they support. For example, the SolarFlare Advanced User's Guide doens't even mention the word "profile". Here are my personal opinions on my three software choices based off reading their respective manuals, email lists and forums.

SolarFlare's sfptpd

We'll start with sfptpd, which is SolarFlare's Enhanced PTP Daemon. SolarFlare have taken the ptpd project source code and modified it to add hardware support for their own adapters. They've also added support for things like VLAN tagged interfaces and bonded interfaces which the other projects don't have, which makes this software probably the friendliest PTP consumer for me, as we make heavy use of bonded interfaces and it will "just work". The disadvantage is that they only support hardware timestamping on SolarFlare adapters. While we use SolarFlare here for our latency sensitive systems, we don't have them everywhere.

The daemon can be a Master Clock or Slave Clock, but it only supports Multicast as a transport protocol which is annoying. It does supports "hybrid" mode for Delay messages which means it Unicasts it's delay messages back to the Master Clock, rather than Multicasting back. The daemon also has a nice feature where it will synchronise every SolarFlare adapter in a machine to the incoming PTP time, even if the incoming message is not consumed on a SolarFlare adapter - it falls back to software timestamping.

linuxptp

The linuxptp project, started by Richard Cochran, is an implementation of PTP that's tied to the Linux kernel. In fact, Richard Cochran wrote the PTP Hardware Clock (PHC) Linux kernel API for hardware timestamping. The linuxptp project page mentions which Linux kernel versions various drivers begin to implement hardware timestamping, though you'll be happy to know that Red Hat have back ported a lot of these patches in to Red Hat 6. Certainly on CentOS 6.6 with a 2.6.32 I was able get hardware support for various Intel and Broadcom HBAs. It can be set up to act as a Boundary Clock, it supports Multicast and 802.3 (Ethernet) PTP messages, and version 1.6 now supports "hybrid" mode as well (PTP Unicast for Slave Clock Delay mesages).

The linuxptp project is very "low to the wire" which is good in some ways, but lacks a few bells and whistles that make it useful in multiple situations. For example, the code make a socket option call SO_BINDTODEVICE to bind to specific hardware devices, which means that it simply cannot consume Multicast messages from a bonded or VLAN tagged interface. This is because the Linux kernel delivers the packet to the bonded interface, not to the underlying slave. It can consume 802.3 (Ethernet) encapsulated messages that are part of a bond, but this is only somewhat useful as you won't get the high availability advantage of having a bonded interface if you are only consuming PTP from one of the bond members. The software is also split in an interesting way: there is the ptp4l daemon which is designed to consume PTP messages off a HBA and write the time to that HBA's PTP Hardware Clock (PHC). Then there is phc2sys which takes the time from a PHC device and synchronises the Linux system clock and other PHCs as well. All other software implementations do these two functions in the one daemon. This feature of linuxptp becomes important later.

Regarding the project itself, suggestions to problems on the mailing list are sometimes very... Kernel Developer-ish :-) ie: "comment out this option and re-compile". Several times I had to cross reference man pages to mailing list posts, and code-dive once or twice. As I said before, "low to the wire" :-) This is not a bad thing, once you get the hang of the moving parts and understand PTP concepts the software is easy to use, but I can understand how it could turn away some people.

ptpd

The ptpd project, started by Wojciech Owczarek, is the one I think will become the most widely adopted in maybe a two years or so. It is the code base that people fork and copy to add their own hardware support (ie; SolarFlare), it's designed to run across multiple platforms, not just Linux, and so my impression of the project is it's very portable and "nice". I also like the long and detailed posts Wojciech leaves to questions on the ptpd Forum.

It supports all transport mechanisms (Multicast, Unicast and Ethernet) but the biggest down side is it's software-only timestamping. Wojciech says that while adding Linux kernel PHC support would be relatively easy, it's not cross platform and not something the ptpd project just want to bolt straight away, they want to do it properly with an abstraction layer.

A Compatability Matrix

Hopefully we've covered enough now that you can see that PTP != PTP across various hardware and software implementatations. On top of the software I looked at above, I also compiled a list of PTP features for the other switches and appliances in my environment.

Probably the most frustrating was to find out that our SyncServer S300 does not support Multicast, it only supports only 802.3 (Ethernet) encapsulation and Unicast Singaling - which is a method where a Slave negotiates how the rate of PTP messages it wants to receive from the Master. I find this out a few years after we bought our S300s... I've now got a GPS antenna on the roof but a difficult job of getting it anywhere in our network. In case you were wondering, in order to get Multicast from a MicroSemi appliance you need to have one of the TimeProvider appliances.

I also discovered that Arista's PTP Transparent Clock support only came in with the 7150 Series of switches, which was not the prevalent switch series in use in the data centres I'm looking to put PTP into. Neither was there any PTP support on the Brocade switches we have. I've now got a GPS antenna on the roof, a difficult job getting it anywhere in our network, and a switching infrastructure that doesn't support PTP... I still don't want to go buy anything if I don't have to, so... Can we solve this with software?

Here is the compatibility matrix I compiled:

Device / Software	UDP Signaling	Multicast	802.3	Hardware Timestamping	Bonding	Boundary Clock
SyncServer S300	Y		Y	n/a	n/a	n/a
linuxptp		Y	Y	Y		Y
ptpd	Y	Y	Y
sfptpd		Y		SF	Y

Distributing PTP

What's clear is that there's only two software daemons we can use to interface with the S300: ptpd in UDP Signaling or Ethernet mode, or linuxptp in Ethernet mode.

The sfptpd daemon is by far the best client for us, simply because it will work with bonded interfaces. It only consumes Multicast though, and you only get hardware timestamping on solarflare adapters. We've only got SolarFlare adapters in specific places, namely on our application networks where latency is critical. Architecture-wise, I don't particularly want to throw PTP messages onto our application network for two reasons: 1) I don't want bursty application traffic to interfere with PTP messages, and 2) it's nice to have a clear separation of application traffic from management traffic, and I consider PTP to be management traffic.

The ptpd project does not have hardware support as yet, and we know we may take an accuracy hit if we use it, but I don't know how much (it may be within our tolerance). It should be able to talk directly to the S300 though and negotiate PTP messages. The problem with this approach is that our network is not very flat, so no matter where I put the S300, the PTP packets will still have to traverse one or two firewalls and switches in order to get to every Linux server that needs to consume PTP. Every hop in the network potentially reduces PTP accuracy. Another unknown is how loaded the S300 appliance will become if we individually subscribe a reasonable number of PTP Slaves to it.

The linuxptp software has the most hardware support so makes it potentially the most accurate. Multicast doesn't work on bonded interfaces at all, but you can get Ethernet working on slaves of bonds. Since linuxptp can be configured as a Boundary Clock, one possible design would be to first consume Ethernet encapsulated PTP messages from the S300 and then broadcast them out onto a number of other Layer 2 networks in order to get it to all Linux servers. We would need to use linuxptp or ptpd to consume this though, rather than my favourite client, sfptpd, as the Slave Clock.

Getting a little bit crazy...

Considering the modular design of linuxptp, it's possible to run multiple instances of ptp4l and phc2sys on the one server quite easily. This makes for another interesting design. Theoretically we should be able to use one ptp4l process to consume PTP from the S300 and write the time to a PHC on a server. We can then use phc2sys to synchronise the first PHC with another PHC in the server. We then use a second ptp4l process to Multicast out on a different network. The goal being to translate from Ethernet encapsulation to Multicast so it can be consumed by sfptpd on a bonded interface. In PTP terms this not quite a Boundary Clock - the ports may not have a shared internal clock (there is software in between), nor do I think translating between transport protocols what IEEE 1588 would consider normal for a Boundary Clock. So I'm calling it a "PTP Bridge". The most important thing in this design is that I can do this "right now", I don't have to add many cables or go buy any special hardware.

The use of hardware timestamps should mean that variance in the PTP Bridge software are mitigated somewhat. We've still got some variable path delay coming from S300 to PTP Bridge because they are connected via a switch that is not PTP aware. It is a switch though, so hopefully that's not as variable as being stuck in a busy router or firewall queue. There is also some delay variance when reading from one PHC device and writing to another, caused by Linux Operating System jitter. This can be reduced by a bit of tuning and CPU pinning. In the environments I'm working with we will have to use sfptpd with software timestamping when we consume PTP messages, so that's another loss in accuracy.

There will also be another bit of variable path delay when we Multicast to the sfptpd instances. As mentioned before the Linux servers we're trying to get PTP to are not in a flat network, and some of them are heavily firewalled off from each other. The absolute simplest thing to do first is to Multicast through this firewall. We know this will have issues and it probably won't work long term, but we can also find out exactly how bad it is.

If the firewall variance is too much to keep us compliant, another option may be to attach the PTP Bridge to a network that every Linux server can reach, and run multiple ptp4l Multicasting daemons. This way we'd only get the switching delay, which should be a lot less variable than passing through a firewall. There are security implications that must be carefully considered with this approach, as we would be attaching a single server to every other server.

One last design may be to attach the PTP Bridge to one of the primary application networks and Multicast onto it, which will reach about 85% of the servers it needs to. For the remaining few servers that need PTP time, we could run another dedicated network for it. The advantage of this approach is we would get to utilise the SolarFlare adapters hardware timestamping and improve our accuracy. The downside is of course I'd be mixing PTP traffic with application traffic, so my PTP accuracy is subject to application network load. I'd also need to run a relatively small amount of cables for the remaining servers that need precise time but aren't on the application network.

Now for something actually technical

This post turned into a large theoretical ramble. In the next post I'll go in to how I've built the PTP Bridge server, complete with configuration examples, and then we'll move on to testing and measuring the various design options above.