Saturday, November 28, 2015

Solving MiFID II Clock Synchronisation with minimum spend (part 2)

(Due to the embedded Gists this post looks best in Blogger)

In my previous post I talked about the new Legislation coming to the European Financial sector in the form of MiFID II, and specifically about the Clock Synchronisation requirements. We touched on the Precision Time Protocol (PTP) and how it's accuracy is affected by various factors. I gave my opinion on various Linux software implementations, and I proposed a few designs that may make us MiFID II compliant.

The approach I am starting with is the easiest - it requires very little physical work and we don't have to go out and buy special PTP hardware. It makes use of what I call a "PTP Bridge", which is being used to translate PTP time from our time source and distribute it around our network. We will be Multicasting PTP through a firewall in order to reach all of the servers we need to. We know from the PTP theory that is not a very optimal design, but as I said, it's the quickest and easiest to start off with.

In this post we will get into the technical details on building the PTP Bridge and attaching a single client. The majority of the configuration samples will be in Puppet code - who builds things by hand any more?

In future posts we'll go in to measuring the design's accuracy and either improving upon it or trying different designs.

PTP Bridge + Firewall Architecture

To re-cap very quickly, I have at my disposal a MicroSemi SyncServer S300 as a GPS time source. It only supports PTP using Unicast Signaling or the 802.3 (Ethernet / Layer 2) transport. We will use the linuxptp software to consume the L2 PTP from the S300 and to multicast it out to our PTP clients through a firewall. Our PTP clients will run SolarFlare's sfptpd daemon to consume multicast PTP. The previous post covers these design choices. The design looks roughly like this:

The S300 only broadcasts L2 PTP from a single interface (LAN2). We have created a dedicated VLAN just for this purpose, and our PTP Bridge will be on this VLAN as well. For the moment it will be the only other device attached, but in the future we may have other devices here as well.

The PTP Bridge is a CentOS 6 server with a bonded management interface for standard Linux services (SSH, etc). Since linuxptp does not support bonded interfaces, we need a separate dedicated interface to connect to the S300. We will also need another separate interface to multicast PTP traffic to the firewall.

The firewall is configured for IGMPv3 and has the necessary configuration to allow the PTP Bridge and PTP clients to join the standard PTP Multicast group - as defined in IEEE 1588.

The sfptpd daemon can work on Bonded interfaces so all of our PTP clients should simply need to specify the management interface to receive PTP from and it should "just work".

Building A PTP Bridge

The interaction between the different hardware and software components in the PTP Bridge can be a little confusing, there's a lot of moving parts. Here's a colorful drawing to help out. What each component does and why is explained below, inlined with Puppet code that creates them.

I've written a Puppet module to manage linuxptp software. It does not support configuring every single thing possible right now, I'm adding functionality to it as I need it. I'll definitely take Pull Requests for added functionality.

The init script that comes with the Red Hat RPM manages ptp4l and phc2sys as a single service, but for our PTP Bridge we need to run multiple instances of both. We have to disable the normal linuxptp services and use supervisord instead. I find the ajcrowe-supervisord module the most functional:

I need one ptp4l process configured in Layer 2 mode to be a slave of the S300 Grandmaster Clock, and we need to know what interface to have this ptp4l bind to. The linuxptp::ptp4l defined type takes care of writing the correct configuration file for me. I then use a supervisord::program to keep this ptp4l program running with the generated configuration file, also specifying that it writes to STDOUT rather than syslog so supervisord can handle the logging:

I now need a second instance of ptp4l to do our multicasting to the firewall, specifying a different interface to send multicast PTP out of:

This takes care of the daemons handling the PTP protocol. If both the network interfaces shared the same PTP Hardware Clock (PHC) device then this is all we would need, but in my PTP Bridge server the network interfaces have separate clocks. The linuxptp Puppet module contains a Facter Fact that will map network interface to PHC device:

[root@ptp-bridge ~]# facter -p phc
{"em1"=>"ptp4", "em2"=>"ptp5", "p4p1"=>"ptp0", "p4p2"=>"ptp1", "p4p3"=>"ptp2", "p4p4"=>"ptp3"}

Or "ethtool -T " will tell you.

This means I have to use linuxptp's phc2sys program to synchronise the time on the interface connected to the PTP Master with the interface that's sending the Multicast. phc2sys does not have a configuration file, only command line arguments, so we only need a supervisord::program, specifying the Master interface as the master clock, and the multicast interface as the slave clock:

This takes care of everything PTP. We should probably synchronise the Linux System clock as well though, as nothing will be doing that by default. A fourth phc2sys supervisord::program is used:

You'll notice the '-z' argument to phc2sys - this option comes in with linuxptp version 1.5, but I recommend at least version 1.6 (more on this later). The flag is used to specify the ptp4l socket and is necessary to differentiate when you have multiple ptp4l processes on one box. The linuxptp::ptp4l defined type will configure distinctly named socket files.

linuxptp-1.6 is available in Fedora 24. I was able to take the linuxptp-1.5 RPM Spec file and just replace the source tarball and test suites. Here is the head of the file showing the test suite  and clknetsim GitHub hashes I used:
Now that I have all the components, I can run Puppet and I get the four supervisord programs I expect started up:

[root@ptp-bridge ~]# supervisorctl status
multicast                        RUNNING    pid 43206, uptime 0:01:57
phc2sys_multicast                RUNNING    pid 43207, uptime 0:01:57
phc2sys_system                   RUNNING    pid 43205, uptime 0:01:57
ptp-master                       RUNNING    pid 43208, uptime 0:01:57

I can see the PTP Master ptp4l process is synchronising with the S300 time source:

[root@ptp-bridge ~]# tail -f /var/log/linuxptp/ptp-master.log
ptp4l[874317.443]: selected /dev/ptp3 as PTP clock
ptp4l[874317.459]: port 1: INITIALIZING to LISTENING on INITIALIZE
ptp4l[874317.459]: port 0: INITIALIZING to LISTENING on INITIALIZE
ptp4l[874318.039]: port 1: new foreign master 000000.0000.000000-1
ptp4l[874322.038]: selected best master clock 000000.0000.000000
ptp4l[874322.038]: port 1: LISTENING to UNCALIBRATED on RS_SLAVE
ptp4l[874323.038]: master offset      35389 s0 freq  +34540 path delay         0
ptp4l[874324.037]: master offset      35415 s1 freq  +34566 path delay         0
ptp4l[874325.037]: master offset      -2372 s2 freq  +32194 path delay         0
ptp4l[874325.037]: port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
ptp4l[874326.038]: master offset        -85 s2 freq  +33769 path delay         0

I can also see the Multicast ptp4l process has gone into the Master state:

[root@ptp-bridge ~]# tail -f /var/log/linuxptp/multicast.log
ptp4l[874317.439]: selected /dev/ptp2 as PTP clock
ptp4l[874317.439]: port 0: hybrid_e2e only works with E2E
ptp4l[874317.440]: port 1: INITIALIZING to LISTENING on INITIALIZE
ptp4l[874317.440]: port 0: INITIALIZING to LISTENING on INITIALIZE
ptp4l[874324.062]: selected best master clock 001b21.fffe.6fa06c
ptp4l[874324.062]: assuming the grand master role

The server has joined the PTP Multicast group:

[root@ptp-bridge ~]# netstat -ng | grep -P '224.0.[01].(107|129)'
p4p3            2
p4p3            2

And if I snoop this network I can see it Multicasting out packets:

[root@ldprof-live-ptpb01 ~]# tcpdump -i p4p3 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on p4p3, link-type EN10MB (Ethernet), capture size 65535 bytes
14:06:13.447594 IP > UDP, length 44
14:06:13.447657 IP > UDP, length 44

We should check that our PHC clocks and system clock are being synchronised correctly as well:

[root@ptp-bridge ~]# tail -n1 /var/log/linuxptp/phc2sys_multicast.log
phc2sys[875054.546]: phc offset       135 s2 freq  +34671 delay   5023

[root@ptp-bridge ~]# tail -n1 /var/log/linuxptp/phc2sys_system.log
phc2sys[875099.526]: phc offset        39 s2 freq  -12380 delay   2379

That takes care of everything on the PTP Bridge for now.

Slave Clock with sfptpd

Our PTP Bridge appears to be sending out Multicast fine, now I want to get another server consuming this Multicast using sfptpd. I have written a Puppet module for sfptpd as well.

We need to turn off NTP and configure sfptpd on the management interface that's connected to the firewall handling the PTP Multicast traffic. We'll also turn on stats logging to a file:

If we start sfptpd and look at the log it looks like we are joined the multicast group correctly:

[root@ptp-client ~]# netstat -ng | grep -P '224.0.[01].(107|129)'
bond241         2
bond241         2

However, looking at the log we have a problem:

2015-11-26 14:46:16.058064: info: running as a daemon
2015-11-26 14:46:16.058326: info: creating PTP sync-module
2015-11-26 14:46:16.058498: info: PTP clock: local reference clock is system, PTP clock is system
2015-11-26 14:46:16.063963: info: using SO_TIMESTAMPNS software timestamps
2015-11-26 14:46:16.164341: notice: Now in state: PTP_LISTENING
2015-11-26 14:46:16.164510: info: creating NTP sync-module
2015-11-26 14:46:17.497587: info: New best master selected: 0000:0000:0000:0000(unknown)/1
2015-11-26 14:46:17.497640: notice: Now in state: PTP_SLAVE, Best master: 0000:0000:0000:0000(unknown)/1
2015-11-26 14:46:18.497152: info: received first Sync from Master
2015-11-26 14:46:19.689019: warning: failed to receive DelayResp for DelayReq sequence number 0
2015-11-26 14:46:20.751518: warning: failed to receive DelayResp for DelayReq sequence number 1
2015-11-26 14:46:21.751518: warning: failed to receive DelayResp for DelayReq sequence number 2
2015-11-26 14:46:21.751554: warning: failed to receive DelayResp 3 times in hybrid mode. Reverting to multicast mode.
2015-11-26 14:46:23.439019: warning: failed to receive DelayResp for DelayReq sequence number 3
2015-11-26 14:46:24.564018: warning: failed to receive DelayResp for DelayReq sequence number 4
2015-11-26 14:46:25.064020: warning: failed to receive DelayResp for DelayReq sequence number 5

The sfptpd daemon is not getting back any response to our Delay Request messages. After three attempts it drops out of hybrid mode and reverts to multicast mode, but still does not get any response.

sfptpd starts in hybrid mode first by default, which is a property of the "Enterprise" PTP Profile where Delay messages are unicast back to the Master Clock, rather than being multicast back. In large networks, every client multicasting back their Delay messages could get very noisy. After hybrid mode fails it falls back to the default mutlicast mode for Delay messages, but still does not work.

Snooping the interfaces on the PTP Bridge reveals an asymmetric routing problem. This is a side affect of our network architecture, our firewall, and using dedicated interfaces on the PTP Bridge - the PTP Bridge is sending the unicast Delay Responses out the wrong interface, they need to be sent out the same interface as the multicast packets. This is solved with policy based routing and some iptables rules on the PTP Bridge:

The Puppet code above creates firewal rules to mark inbound and outbound PTP Unicast and mark them with firewall mark "319". A network routing rule then says to look up routing table 100 for packets marked 319, and table 100 says the default gateway for these packets is the firewall on the other side of the Multicast interface.

Now lets try sfptpd again on a client:

2015-11-26 15:00:42.366823: info: running as a daemon
2015-11-26 15:00:42.367083: info: creating PTP sync-module
2015-11-26 15:00:42.367240: info: PTP clock: local reference clock is system, PTP clock is system
2015-11-26 15:00:42.372742: info: using SO_TIMESTAMPNS software timestamps
2015-11-26 15:00:42.473143: notice: Now in state: PTP_LISTENING
2015-11-26 15:00:42.473285: info: creating NTP sync-module
2015-11-26 15:00:43.517501: info: New best master selected: 0000:0000:0000:0000(unknown)/1
2015-11-26 15:00:43.517540: notice: Now in state: PTP_SLAVE, Best master: 0000:0000:0000:0000(unknown)/1
2015-11-26 15:00:44.517268: info: received first Sync from Master
2015-11-26 15:00:44.517383: info: clock system: applying offset 35.999998423 seconds
2015-11-26 15:01:20.554657: info: clock phc4: applying offset 35.999978085 seconds
2015-11-26 15:01:20.554792: info: clock phc5: applying offset 35.999979747 seconds
2015-11-26 15:01:20.554903: info: clock phc6: applying offset 35.999979123 seconds
2015-11-26 15:01:21.498046: info: ignoring DelayResp because offset from master not valid
2015-11-26 15:01:21.498084: info: received first DelayResp from Master

That appears to be working, we're not missing any Delay Response messages from the Master Clock. However, the server time is wrong, very wrong in fact. It's over 30 seconds too fast. We can see that in the output above where the clocks are stepped forward with +36 second offset.

In the year 2015, 35 / 36 seconds is a "magic" number in PTP land - it is the difference between International Atomic Time (TAI) and Coordinated Universal Time (UTC). Let's go into a little more theory to understand what this is.

TAI time does not account for the slowing of the rotation of the earth whereas UTC does. As of 30th June 2015 there are exactly 36 leap seconds applied to UTC, which means TAI time is 36 seconds ahead of UTC. International Atomic Time (TAI) is the time standard that all PTP Clocks run in. They don't run in UTC. You can see this on the PTP Bridge if you query the interface PHC:

 [root@ptp-bridge ~]# date; phc_ctl /dev/ptp3 get
Fri Nov 27 12:46:36 UTC 2015
phc_ctl[956362.783]: clock time is 1448628432.159550826 or Fri Nov 27 12:47:12 2015

So the PTP wire protocol is in TAI, and it's the job of the PTP software implementation to translate from TAI to UTC by applying the UTC offset, which may or may not be communicated from the Grandmaster Clock.

Now back to our issue - we're 36 seconds out, which is a strong indication our system clock is being set to TAI time. There could be one of two things going wrong here:
  1. sfptpd is using Software Timestamping on ingress, which is in UTC, but the PTP packets are stamped with TAI. When comparing these timestamps sfptpd thinks the system clock is 36 seconds out.
  2. sfptpd and ptp4l are using the PTP Valid Offset flag in different ways, and so sfptpd is not applying the UTC offset.
There is an option in the default sfptpd.conf template called ptp_utc_valid_handling (it is not mentioned in the Advanced User Guide). I think the comments say it all so will paste verbatim:

# Configures how PTP handles the UTC offset valid flag. The specification is
# ambigious in its description of the meaning of the UTC offset valid flag
# and this has resulted in varying different implementations. In most
# implementations, if the UTC offset valid flag is not set then the UTC offset
# is not used but in others, the UTC offset valid is an indcation that the
# master is completely confident that the UTC offset is correct. Various
# options are supported:
#    default  If UTCV is set use the UTC offset, otherwise do not use it
#    ignore   Do not used the UTCV flag - always apply the indicated UTC offset
#    prefer   Prefer GMs that have UTCV flag set above those that don't
#    require  Do not accept GMs that do not set UTCV

I can't get my hands on the IEEE 1588-2008 specification (you have to buy it) so can't read about the flag myself. I still don't know whether what I'm seeing is a software timestamping issue or UTC Offset Valid flag issue. It can be worked around though by telling sfptpd to always apply the UTC offset:

What's Next?

We now have PTP time being translated from Layer 2 to Multicast through our PTP Bridge and being consumed on one of our servers. We know the current approach has design flaws that will affect the accuracy of PTP, which is important because we need to achieve a minimum level of accuracy in the coming legislation.

In the next post we will look at capturing statistics from PTP software in order to see how accurate the current design is.

1 comment:

  1. The complexity of MiFID II is not synchronising clocks. That's the easy part - albeit you need multiple clock sources. What makes MiFID II clock synchronisation challenging is the fact that you need to be able to verify traceability to UTC. You can't just sync your clocks and hope for the best.