Nazca Lines: Solving MiFID II Clock Synchronisation with minimum spend (part 1)

This blog post - and what is now a series of blog posts because of how long this one became - will look at implementing Precision Time Protocol (PTP) with sufficient accuracy in my organisation in order to satisfy upcoming European Financial regulations.

I'll first talk about where the regulations are coming from and what they are. Then we'll go into the Precision Time Protocol (PTP), and then we'll move into looking to solve this problem with the infrastructure already at my disposal. In other words, I'm going to try do it without buying anything fancy.

If you want to skip all this and go straight into something technical, you want the next blog post.

Clock Synchronisation in the Financial Sector

New legislation is coming to the European Financial Services sector. The Markets in Financial Instruments Directive (MiFID) II is due to come in to effect on the 3rd of January 2017, although there are whispers of rumors it will be delayed. The technical standards that need to be met are still yet to be confirmed by local governing bodies, but we already have some of an idea of what's coming based off public consultation and feedback from the European Securities and Markets Authority (ESMA).

Regulatory Technical Standard (RTS) 25 is regarding clock synchronisation. There are two important pieces of information in RTS 25 that these blog posts will focus on solving:

an institution's clock synchronisation must be traceable to UTC
that minimum levels of accuracy must be maintained

The minimum level of accuracy for "High Frequency Trading" in the original draft was microsecond (μs) granularity with ±1μs accuracy. After a lot of feedback regarding the technical difficulties of hitting such a high accuracy the RTS was amended so that the minimum level of accuracy is now ±100μs accuracy.

It is well known that standard Network Time Protocol (NTP) is only good to about a millisecond accuracy. This means that we have to move to something else. We also have to prove it's traceable to UTC, and we need to ensure that we don't fall outside our ±100μs accuracy.

Precision Time Protocol

The Precision Time Protocol (PTP) has been around for a while. PTP "version 2" was outlined in IEEE 1588-2008, and it is used a lot in industries that require precise timing, the Telecoms industry for example, as well as in audio video broadcasting.

A PTP Master Clock will send out PTP messages containing the current time that PTP Slaves consume. Slaves periodically send back their own messages to the Master. I won't go into the details of the protocol formula right now, there are better explanations elsewhere (Wikipedia has one, National Instruments has another with numbers). Very quickly though, the formula adds up a series of time stamps from these PTP messages and then divides by 2 so the slave can determine the network path delay from the master and thus determine the correct time.

In order for the mathematics in the protocol to work it assumes "network path symmetry", or rather that it takes the same amount of time from master to slave as it does from slave to master. Another big assumption is that both slave and master can accurately measure when they send and receive PTP messages. If there is something interfering with either of these two assumptions, then PTP becomes less accurate. Less accurate is a problem for me because I have regulations to satisfy.

The nemesis of the first assumption is network path asymmetry, which can occur when a PTP packet gets delayed in an intermediate network device, such as a switch, router or firewall, or even in the Operating System's networking stack. The second assumption can be improved by using hardware timestamps, where the timestamps used in the PTP calculations are generated by the network interfaces when they actually send and receive packets, rather than when the Operating System thinks it has. If PTP software has to fall back to using Software timestamps then the accuracy will drop.

Considering the above, the most accurate PTP network would be each slave clock having a direct cable into the master clock, with no switches or routers in between, and with both devices supporting hardware timestamping. Considering the number of cables this would require for even a small sized data center this is not feasible, so some places use dedicated PTP switching infrastructure, where the switches themselves are "PTP aware" and can either eliminate their own switching delay from PTP messages (called a Transparent Clock) or act as a PTP Master Clock themselves (called a Boundary Clock). Using a 'dedicated' approach like this means you are not mixing your PTP traffic with your application traffic, and so the PTP messages are not subject to any queue delays caused by bursts in application traffic.

If you can afford to go out and just do the above, that's great. To do such a thing can be quite costly though. Most financial organisations will already have a reasonable investment in their network infrastructure, so if I can could utilise what I've got in place now that would be much more efficient. We still have to hit RTS 25 accuracy though, ±100μs...

Receiving UTC

We are lucky enough to have some Symmetricom SyncServer S300s (now owned by MicroSemi) that have GPS satellite antennae on the roof of some of our data centres. This is nice for us because this should satisfy the requirement that our clock synchronisation be traceable to UTC. I say should because the regulations are not confirmed yet. We purchased the S300s a while ago because they supported PTP, even though we've only been using them as an NTP time source so far.

Receiving an accurate time is one thing, but distributing it to everywhere that is needed is a whole other ball game. The PTP standard (IEEE 1588) is actually rather broad and very flexible. It specifies multiple different transport mechanisms, multiple ways slaves can talk back to master clocks, a vast range of intervals that devices can talk to each other, etc. The standard being as flexible as it is means that when something says "supports PTP" on the side of the box, it may not work with everything else that supports PTP.

In theory what should be happening is devices and software should fully support one or more PTP "Profiles", which are a pre-defined subset of PTP options designed a specific purpose. For example, the Telecom Profile is used to transfer frequency around telecommunications networks. The PTP standard itself defines only a single "Default" profile.

In practice it's a little bit more difficult than that. The hardware guys have mostly got it sorted out, I can see in the S300 manual that it supports a superset of the Default Profile options. The software side is not so specific.

Most useful for my environment is the emerging PTP "Enterprise" Profile (not sure if it's official yet, all I can find are drafts). This Profile supports Multicasting if Sync messages from a Master Clock but UDP unicasting Delay messages back from a Slave to Master (the default is to Multicast back Delay messages which is would get noisy in a large network).

Consuming PTP

Consuming PTP gets complicated because of all the different transport mechanisms the PTP standard defines. On the Linux software side, there are a couple of choices available, I focused on the three I thought were the most prevalent: ptpd, linuxptp and sfptpd. Unfortunately the software project pages don't specifically mention what PTP Profile they support, they more talk about what PTP features they support. For example, the SolarFlare Advanced User's Guide doens't even mention the word "profile". Here are my personal opinions on my three software choices based off reading their respective manuals, email lists and forums.

SolarFlare's sfptpd

We'll start with sfptpd, which is SolarFlare's Enhanced PTP Daemon. SolarFlare have taken the ptpd project source code and modified it to add hardware support for their own adapters. They've also added support for things like VLAN tagged interfaces and bonded interfaces which the other projects don't have, which makes this software probably the friendliest PTP consumer for me, as we make heavy use of bonded interfaces and it will "just work". The disadvantage is that they only support hardware timestamping on SolarFlare adapters. While we use SolarFlare here for our latency sensitive systems, we don't have them everywhere.

The daemon can be a Master Clock or Slave Clock, but it only supports Multicast as a transport protocol which is annoying. It does supports "hybrid" mode for Delay messages which means it Unicasts it's delay messages back to the Master Clock, rather than Multicasting back. The daemon also has a nice feature where it will synchronise every SolarFlare adapter in a machine to the incoming PTP time, even if the incoming message is not consumed on a SolarFlare adapter - it falls back to software timestamping.

linuxptp

The linuxptp project, started by Richard Cochran, is an implementation of PTP that's tied to the Linux kernel. In fact, Richard Cochran wrote the PTP Hardware Clock (PHC) Linux kernel API for hardware timestamping. The linuxptp project page mentions which Linux kernel versions various drivers begin to implement hardware timestamping, though you'll be happy to know that Red Hat have back ported a lot of these patches in to Red Hat 6. Certainly on CentOS 6.6 with a 2.6.32 I was able get hardware support for various Intel and Broadcom HBAs. It can be set up to act as a Boundary Clock, it supports Multicast and 802.3 (Ethernet) PTP messages, and version 1.6 now supports "hybrid" mode as well (PTP Unicast for Slave Clock Delay mesages).

The linuxptp project is very "low to the wire" which is good in some ways, but lacks a few bells and whistles that make it useful in multiple situations. For example, the code make a socket option call SO_BINDTODEVICE to bind to specific hardware devices, which means that it simply cannot consume Multicast messages from a bonded or VLAN tagged interface. This is because the Linux kernel delivers the packet to the bonded interface, not to the underlying slave. It can consume 802.3 (Ethernet) encapsulated messages that are part of a bond, but this is only somewhat useful as you won't get the high availability advantage of having a bonded interface if you are only consuming PTP from one of the bond members. The software is also split in an interesting way: there is the ptp4l daemon which is designed to consume PTP messages off a HBA and write the time to that HBA's PTP Hardware Clock (PHC). Then there is phc2sys which takes the time from a PHC device and synchronises the Linux system clock and other PHCs as well. All other software implementations do these two functions in the one daemon. This feature of linuxptp becomes important later.

Regarding the project itself, suggestions to problems on the mailing list are sometimes very... Kernel Developer-ish :-) ie: "comment out this option and re-compile". Several times I had to cross reference man pages to mailing list posts, and code-dive once or twice. As I said before, "low to the wire" :-) This is not a bad thing, once you get the hang of the moving parts and understand PTP concepts the software is easy to use, but I can understand how it could turn away some people.

ptpd

The ptpd project, started by Wojciech Owczarek, is the one I think will become the most widely adopted in maybe a two years or so. It is the code base that people fork and copy to add their own hardware support (ie; SolarFlare), it's designed to run across multiple platforms, not just Linux, and so my impression of the project is it's very portable and "nice". I also like the long and detailed posts Wojciech leaves to questions on the ptpd Forum.

It supports all transport mechanisms (Multicast, Unicast and Ethernet) but the biggest down side is it's software-only timestamping. Wojciech says that while adding Linux kernel PHC support would be relatively easy, it's not cross platform and not something the ptpd project just want to bolt straight away, they want to do it properly with an abstraction layer.

A Compatability Matrix

Hopefully we've covered enough now that you can see that PTP != PTP across various hardware and software implementatations. On top of the software I looked at above, I also compiled a list of PTP features for the other switches and appliances in my environment.

Probably the most frustrating was to find out that our SyncServer S300 does not support Multicast, it only supports only 802.3 (Ethernet) encapsulation and Unicast Singaling - which is a method where a Slave negotiates how the rate of PTP messages it wants to receive from the Master. I find this out a few years after we bought our S300s... I've now got a GPS antenna on the roof but a difficult job of getting it anywhere in our network. In case you were wondering, in order to get Multicast from a MicroSemi appliance you need to have one of the TimeProvider appliances.

I also discovered that Arista's PTP Transparent Clock support only came in with the 7150 Series of switches, which was not the prevalent switch series in use in the data centres I'm looking to put PTP into. Neither was there any PTP support on the Brocade switches we have. I've now got a GPS antenna on the roof, a difficult job getting it anywhere in our network, and a switching infrastructure that doesn't support PTP... I still don't want to go buy anything if I don't have to, so... Can we solve this with software?

Here is the compatibility matrix I compiled:

Device / Software	UDP Signaling	Multicast	802.3	Hardware Timestamping	Bonding	Boundary Clock
SyncServer S300	Y		Y	n/a	n/a	n/a
linuxptp		Y	Y	Y		Y
ptpd	Y	Y	Y
sfptpd		Y		SF	Y

Distributing PTP

What's clear is that there's only two software daemons we can use to interface with the S300: ptpd in UDP Signaling or Ethernet mode, or linuxptp in Ethernet mode.

The sfptpd daemon is by far the best client for us, simply because it will work with bonded interfaces. It only consumes Multicast though, and you only get hardware timestamping on solarflare adapters. We've only got SolarFlare adapters in specific places, namely on our application networks where latency is critical. Architecture-wise, I don't particularly want to throw PTP messages onto our application network for two reasons: 1) I don't want bursty application traffic to interfere with PTP messages, and 2) it's nice to have a clear separation of application traffic from management traffic, and I consider PTP to be management traffic.

The ptpd project does not have hardware support as yet, and we know we may take an accuracy hit if we use it, but I don't know how much (it may be within our tolerance). It should be able to talk directly to the S300 though and negotiate PTP messages. The problem with this approach is that our network is not very flat, so no matter where I put the S300, the PTP packets will still have to traverse one or two firewalls and switches in order to get to every Linux server that needs to consume PTP. Every hop in the network potentially reduces PTP accuracy. Another unknown is how loaded the S300 appliance will become if we individually subscribe a reasonable number of PTP Slaves to it.

The linuxptp software has the most hardware support so makes it potentially the most accurate. Multicast doesn't work on bonded interfaces at all, but you can get Ethernet working on slaves of bonds. Since linuxptp can be configured as a Boundary Clock, one possible design would be to first consume Ethernet encapsulated PTP messages from the S300 and then broadcast them out onto a number of other Layer 2 networks in order to get it to all Linux servers. We would need to use linuxptp or ptpd to consume this though, rather than my favourite client, sfptpd, as the Slave Clock.

Getting a little bit crazy...

Considering the modular design of linuxptp, it's possible to run multiple instances of ptp4l and phc2sys on the one server quite easily. This makes for another interesting design. Theoretically we should be able to use one ptp4l process to consume PTP from the S300 and write the time to a PHC on a server. We can then use phc2sys to synchronise the first PHC with another PHC in the server. We then use a second ptp4l process to Multicast out on a different network. The goal being to translate from Ethernet encapsulation to Multicast so it can be consumed by sfptpd on a bonded interface. In PTP terms this not quite a Boundary Clock - the ports may not have a shared internal clock (there is software in between), nor do I think translating between transport protocols what IEEE 1588 would consider normal for a Boundary Clock. So I'm calling it a "PTP Bridge". The most important thing in this design is that I can do this "right now", I don't have to add many cables or go buy any special hardware.

The use of hardware timestamps should mean that variance in the PTP Bridge software are mitigated somewhat. We've still got some variable path delay coming from S300 to PTP Bridge because they are connected via a switch that is not PTP aware. It is a switch though, so hopefully that's not as variable as being stuck in a busy router or firewall queue. There is also some delay variance when reading from one PHC device and writing to another, caused by Linux Operating System jitter. This can be reduced by a bit of tuning and CPU pinning. In the environments I'm working with we will have to use sfptpd with software timestamping when we consume PTP messages, so that's another loss in accuracy.

There will also be another bit of variable path delay when we Multicast to the sfptpd instances. As mentioned before the Linux servers we're trying to get PTP to are not in a flat network, and some of them are heavily firewalled off from each other. The absolute simplest thing to do first is to Multicast through this firewall. We know this will have issues and it probably won't work long term, but we can also find out exactly how bad it is.

If the firewall variance is too much to keep us compliant, another option may be to attach the PTP Bridge to a network that every Linux server can reach, and run multiple ptp4l Multicasting daemons. This way we'd only get the switching delay, which should be a lot less variable than passing through a firewall. There are security implications that must be carefully considered with this approach, as we would be attaching a single server to every other server.

One last design may be to attach the PTP Bridge to one of the primary application networks and Multicast onto it, which will reach about 85% of the servers it needs to. For the remaining few servers that need PTP time, we could run another dedicated network for it. The advantage of this approach is we would get to utilise the SolarFlare adapters hardware timestamping and improve our accuracy. The downside is of course I'd be mixing PTP traffic with application traffic, so my PTP accuracy is subject to application network load. I'd also need to run a relatively small amount of cables for the remaining servers that need precise time but aren't on the application network.

Now for something actually technical

This post turned into a large theoretical ramble. In the next post I'll go in to how I've built the PTP Bridge server, complete with configuration examples, and then we'll move on to testing and measuring the various design options above.

Nazca Lines

Saturday, November 28, 2015

Solving MiFID II Clock Synchronisation with minimum spend (part 1)