Nazca Lines: 2016

Wednesday, October 5, 2016

How computer clocks work and how they are adjusted

A month or two ago I was asked by someone in our Operations team what clock synchronisation is and why we need to do it. I gave them a very basic few sentence answer. That got me thinking that I never read an easy explanation when I myself got started in this area, and the terminology used can be confusing if it's the first time you come across it. Below is a copy-paste out of our internal documentation where I attempt to explain computer clock synchronisation and the reason for it. The explanation is designed at non-technical people, it skips over many intricacies and is "wrong" from certain points of view but hopefully it describes the problem and solution accurately enough.

So here it is...

There is some terminology for time synchronisation that you have to get your head around (like Step, Slew, and Frequency), and to get that you have to understand how a "Clock" is implemented in a computer.

A computer only knows how to work with digital signals. Imagine a small device (called an Oscillator) where if you pass an electrical signal into it, it sends out pulses of electrical signals at constant intervals. If these intervals are small enough, we can approximate the passage of time versus the number of signal pulses.

For the sake of easy mathematics lets assume that 6000 pulses means 1 minute has passed, so that is 100 pulses per second, or 100 Hertz (Hz). If we put one of these devices onto a motherboard and send an electrical signal through it, and count the number of signals coming out, the computer will be able to know how much time has passed.

That's nice, but we need to know what the current time actually is. Knowing how many seconds has passed is not helpful if I don't even know what year it is. I can ask a clock that I trust what the real time is and just set my time to what it tells me. This is called a clock STEP - where the clock is forcibly set to a given value. Fantastic, my computer can now draw the correct time on my Desktop Clock.

Now imagine that like all computer components, the Oscillator is mass produced cheaply and never perfect. So even though it's supposed to send 100 pulses every second, my one actually only sends 90 pulses per second due to a manufacturing discrepancy, so my computer clock ends up running slow (when it counts 100 pulses, more than 1 second of real time has passed). In this case I can't trust the passage of time on my own machine. I can solve this problem by asking the trusted clock twice. After I step the clock the first time, if I wait what I think is 1 second and ask the same trusted time source what the time is I'll be able to figure out that I am 1/10 of a second slow, or, my OFFSET or DRIFT is 0.1 seconds. From there I can figure out that I need to count 10 fewer pulses to get to one second. If I now "remember" that I need to count 10 pulses less and step the clock again... Great, I have now compensated for my Oscillator defect and my desktop clock runs at the correct speed.

Now imagine that like all computer components, the Oscillator behaves differently under different conditions even though we don't want it to. Maybe if it gets hotter it maybe it sends pulses faster, and if it gets colder it sends pulses slower. Or maybe due to more manufacturing faults that even though it is supposed to be a constant pulse, it's not really.

This means that my +10 frequency adjustment is only correct for a single moment. Since I've got no idea how stable the Oscillator pulses are, in order to keep time as accurate as possible I have to constantly be referring to a clock that I can trust and adjust the frequency as needed. Maybe every time I ask it turns out I'm between +0.1 and -0.1 seconds out, so my frequency adjustment range is +/- 10 pulses. I don't particularly want to step the clock each time though, that's can be a bit brutal to the programs running on my computer - time would appear to be passing in "jumps". What I can do though is count more or less pulses to catch me up to where time is supposed to be, so the time will SLEW in the right direction.

For example if I am 0.1 seconds behind after one second passes, if I count 10 fewer pulses I will still be 0.1 seconds behind (I've compensated for the drift but not caught up). Instead I'm going to count 20 fewer pulses, so that in 1 seconds from now I should be at the exact same point as my trusted clock. This is nicer to my computer programs than stepping the clock all the time; it is a smooth FREQUENCY ADJUSTMENT.

I have to make these adjustments all the time though because my Oscillator varies, so what generally ends up happening is I adjust too fast and over shoot the real time, so I adjust to slow down, I under shoot, I speed up, I speed down, etc. If I were to draw my clock adjustments as a graph it would end up looking a bit like a "wave" pattern. It's not a big deal though because we're only talking about a couple pulses each time, and my desktop clock is generally hovering around the real time.

I can now keep time accurately, but I have to be in constant communication with the trusted clock, and make constant frequency adjustments. Since I know my Oscillator can lose up to 1/10 of a second every second, that means my computer clock can run freely (or FREERUN) for 10 of my seconds before I could be 1 second out. Lets say I don't care about seconds, only minutes, so I can free run for at most 600 seconds (or 10 minutes) before I would have to assume my clock could be 1 minute inaccurate. Maybe I do updates once every 10 seconds then, that is enough to keep me happy.

One Oscillator pulse is about 100 milliseconds in my pretend computer, so I can only accurately time stamp an event with 100 millisecond accuracy. There's no other faster signal to measure by in my PC, so it's physically impossible to tell the difference between events that happen at 0.01 and 0.02 seconds. Luckily real computer oscillators have much higher frequencies than 100 Hertz, they are on the range of hundreds of MHz (nanosecond accuracy). The same problems still exist though - they need constant adjustment to keep them in line, otherwise they go free running off in some direction.

Wednesday, May 18, 2016

Presenting at ITSF's MiFID II Workshop

The International Timing & Sync Forum (ITSF) and Executive Industry Events (EIE) have put together a workshop for MiFID II UTC Traceability specifically for the Finance sector. There will be quite a lot of knowledge in the room: Wojciech Owczarek (the developer of PTPd) is presenting, and Neil Horlock of Credit Suisse is chairing a session, who is a fountain of information regarding the regulations and how an organisation may satisfy them.

I will also be presenting my own findings in the MiFID II Clock Synchronisation space, specifically regarding PTP distribution designs and the monitoring around it, plus where I'd like to get to in the future. It should be a lot less Sales-y than other events, like STAC, so if you're in Fin Tech and want to come down for a chat, I encourage you to register.

Friday, April 8, 2016

Solving MiFID II Clock Synchronisation with minimum spend (part 7)

In this series we are attempting to solve a clock synchronisation problem to a degree of accuracy in order to satisfy MiFID II regulations, and we're trying to do it without spending a lot of money.

So far we have:

Talked about the regulations and how we might solve this with Linux software
Built a "PTP Bridge" with Puppet
Started recording metrics with collectd and InfluxDB, and
Finished recording metrics
Drawn lots of graphs with Grafana and found contention on our firewall
Tried a dedicated firewall for PTP

The start of 2016 opened up a few new avenues for this project. Both Solar Flare and the developer of PTPd showed an interest in what we were up to, and both suggested alternative designs to what we've tried so far. What I'll look at first though is a purely switched design, ie: no Firewall. This is for two reasons; the PTP Bridge's mean variance is 100ns from the Grand Master, if I can get the clients close to this, it's a good achievement. Second, I've had to return the firewall I loaned from our Network Engineering team, and I've had to find myself two more clients for testing with (in other words I need to rebuild most of my test bed).

Multicast out multiple interfaces

Here's an adjusted design diagram of what we're doing, emphasising that the bridge is now publishing individual streams of PTP onto more than one LAN:

With Puppet it's pretty easy to turn one ptp4l instance and one supervisord program into two. In fact the amount of configuration needed has reduced quite a lot. We don't have any multi-homed routing issues, so we don't need any firewall marking for policy based routing on the PTP Bridge. A Puppet run and restart of supervisord now has the PTP Bridge multicasting out two separate interfaces.

My PTP clients are now receiving PTP multicast, but they are not getting any Delay Responses back:

2016-03-15 18:03:44.968187: warning: failed to receive DelayResp for DelayReq sequence number 0
2016-03-15 18:03:45.280687: warning: failed to receive DelayResp for DelayReq sequence number 1
2016-03-15 18:03:45.843188: warning: failed to receive DelayResp for DelayReq sequence number 2
2016-03-15 18:03:45.843201: warning: failed to receive DelayResp 3 times in hybrid mode. Reverting to multicast mode.

Another thing is a client on one network is getting the multicast from both interfaces, which should be impossible as there shouldn't be any multicast routing or Rendezvous Point here, but our symptoms suggest there still is:

17:59:31.057579 IP 192.168.0.4.319 > 224.0.1.129.319: UDP, length 44
17:59:31.057626 IP 192.168.0.4.320 > 224.0.1.129.320: UDP, length 44
17:59:31.261173 IP 192.168.1.4.319 > 224.0.1.129.319: UDP, length 44
17:59:31.261227 IP 192.168.1.4.320 > 224.0.1.129.320: UDP, length 44

The double multicast feeds are messing with the client - it will only be able to Unicast back to one of them (the one on the local subnet) so the sfptpd daemon is rightly getting confused about who it's master should be. The Best Master Clock algorithm won't help in this circumstance as the clients are only able to communicate to one of the Masters by design, and the BMC algorithm is probably telling them to use the wrong one.

A quick look around our older software firewall that we used a few blog posts ago shows that we've still got RP config on there and the firewall is doing IGMP Querying on various subnets. Turning this off stops the second flow of multicast, but it also cuts off our local flow; the switch has stopped forwarding multicast to the clients. The switch also says it knows nothing about any PTP multicast group membership.

We always turn on IGMP Snooping here, so the switch maintains a list of ports to forward multicast to rather than broadcast out all multicast packets on the local subnet. The issue above arises because there's no IGMP Querier on the subnet any more - it used to be the firewall but we've turned that off - so the switch doesn't "know" where to send the multicast any more. We have an old home-grown fix for this, a little daemon written in C that we call 'igmpqd' that periodically broadcasts out IGMP Queries to force all the clients to reply. If we configure and start igmpqd on the PTP Bridge, very soon we have our switch aware of multicast group membership and PTP is flowing again.

The graphs below show a few minutes of offset tolerance from both clients. The top set is PTP through the ASIC firewall from my last post, the bottom set is the switch-only design:

The results look good, ±2μs is the best I've seen so far. The results look even better if you take a look over several hours:

Once again the top graph through the ASIC firewall, and the bottom is our most recent results through just a switch. What's immediately apparent is the large reduction in spikes when we're not running through the firewall.

Before we jump to any conclusions (or go running to the Networking team with a big "I told you so" grin on our faces), lets remind ourselves that we've had to use two different clients than what we had before. While the clients themselves are the same hardware as each other at this moment, they are different to the previous pair of clients. They are in the same rack though, so they are not physically further away than the previous clients (think switch hops).

The last big thing that could be different is the way the clients were configured - without having the old clients in front of me now it's impossible to tell whether they were actually busy running something that was interfering with our results, rather than idle. Looking at the new results I now have a sneaking suspicion that my previous clients were in fact running some part of our production workload. The down side of doing this work on the cheap without a dedicated test bed is you risk running into these repeatability problems...

Anyway, we have some evidence (whether it is as credible as it looks or not) to support the theory that the switch only design is "better". but why are we not a lot closer to the 100ns accuracy that the PTP Bridge is reporting it's getting from the Grand Master Clock?

Running with Wrong Assumptions

Last post I said "The PTP Bridge generally keeps a baseline of 100ns accuracy" and I then went running with the assumption that we should be able to achieve a similar accuracy on the clients. Turns out this idea was flawed. The 100ns as a value is correct, but this is a measurement based on the hardware timestamp of the ingress of the PTP packets from the GM. To then go and assume that it would take another 100ns to get to a PTP Client ignores quite a lot of the other moving parts. I've extended a diagram I made previously that showed how each of the PTP Bridge components interact to show exactly what is and what is not in that 100ns from GM to Bridge:

As you can see the 100ns in ingress is only a small part of the system. This does not account for the time spent synchronising the ingress PTP Hardware Clock time to other interface PHCs, and it's during these times that we can be hit by scheduling delays or other interruptions. The phc2sys program outputs statistics for this though, so let's look.

Here's a sample of the output from one of the "multicast" phc2sys processes, which is reading from the PHC that is receiving PTP from the grand master and writing to the PHC that is sending multicast to the client:

phc2sys[11830930.330]: phc offset -11 s2 freq +34431 delay 4959
phc2sys[11830931.330]: phc offset 1 s2 freq +34439 delay 5327
phc2sys[11830932.331]: phc offset -27 s2 freq +34412 delay 5711
phc2sys[11830933.331]: phc offset 1 s2 freq +34431 delay 5599
phc2sys[11830934.331]: phc offset 21 s2 freq +34452 delay 4895
phc2sys[11830935.331]: phc offset -120 s2 freq +34317 delay 5695

Over time it's consistently around ±100ns for every phc2sys process running. Our experience with performance tuning our Java applications makes me distrust these numbers at first, they sound too small to be true. Looking at the phc2sys man page and code, the program takes a -N option which controls how many times it reads the master clock, and then chooses the values that were read the fastest, thus trying it's best to reduce any scheduling or bus delays. So perhaps the numbers are more accurate than I originally thought, but I lack an independent reference time source in the PTP Bridge to verify them.

Hardware Timestamping on a Client

I've wanted to try hardware timestamps on a client for a while now. It might also bring the design closer to sub microsecond accuracy. There is a development branch of PTPd that supports the Linux PHC API on bonded interfaces, so we're going to give it a go. Building it into an RPM is pretty easy, there's an rpmbuild wrapper script provided. If you are going to try it out I would suggest you add --enable-runtime-debug to the ./configure lines in the RPM Spec file, this allows you to set to various levels of debug logging in the configuration file.

To get this done very quickly I haven't done any Puppet work for PTPd, it's all configured by hand (gasp!). I will do it later when I try PTPd as the replacement software on the PTP Bridge. Turning off sfptpd and turning on ptpd is simple - just need to set which interface, the mode, and turn on stats logging, the rest of the defaults in /etc/ptpd.conf are fine:

ptpengine:interface=bond1
ptpengine:ip_mode=hybrid
global:statistics_file=/var/log/ptpd.stats

With the daemon running it is consuming multicast PTP from the PTP Bridge. There is a statistics file that reports the most recent metrics in a nice to read format, however at the moment I prefer to plot every data point, so will be looking at the statistics log, which is in CSV format:

2016-04-01 15:38:44.759881, slv, 001b21fffe6fa06c(unknown)/1, 0.000000000, 0.000012141, 0.000000000, 0.000024282, 550.827026367, S, 05623, 0.000000000, 0, 0.000000000, 0 0.000024282, 0.000000000

Parsing this file should be easy, it is comma separated and there's even a tail_csv plugin for collectd. Unfortunately tail_csv got added in collectd-5.3, and I've got the old CentOS 6 collectd-4.1. After finding a 5.5 RPM for CentOS 6 I wasn't able to get the tail_csv plugin working (Segfault) so gave up after a short amount of time.

I've been following the Influx Data projects, so wanted to see if Telegraf, their metrics collector, could do what I want. Unfortunately they also don't have a "tail" like plugin (there is a feature request), only an Exec plugin, so wouldn't quite work for what I wanted to do. So back to collectd's normal tail plugin. Writing and debugging non-Perl regular expressions for collectd turned out to be like pulling teeth - I seemed to be making what should be a simple problem unnecessarily difficult.

I have a stream of CSV lines that I want to extract certain fields from and then shoot them off to Influx; why I didn't think of Tail + Awk before I'll never know. I already have an Exec job in collectd that prints out the NTP offset every second, I can use the same approach to tail the PTPd stats log, send it to Awk for easy field splitting and then output the keys in collectd format that will get sent to Influx. Here's the Awk script:

It uses my custom Collectd types, and converts from PTPd's unit of seconds into nanoseconds. This is purely a cosmetic thing - Grafana is good at upscaling metrics (if you have 1000ms it will say 1s) but bad at downscaling (0.00000012s is hard to read).

There's a small side affect with using Collectd's Exec plugin, it wakes up every second, which end up sending multiple data points per second all at the same time, rather than an even spacing of 4 Sync messages per second. This is not a huge problem because a lot of the Grafana graphs will do an implicit Mean() over a given time period, and my Tolerance graphs use Min() and Max() functions. If we ever look at the raw data though it would look rather strange. I keep saying I will move away from Collectd, but the streaming-style Exec plugin is a just too easy to use at the moment.

Delay Messages, and PTP on cheap NICs

There's a problem indicated in the PTPd stats line above - the fourth column, One Way Delay, and the sixth column, Slave to Master, are always zero. Slave to Master communication is the Delay Request / Response messages, so there's something wrong with the Slave sending Unicast back to the Master (Unicast because we are in hybrid mode). This mode works with sfptpd on this machine, so architecturally there is no problem (no routing issues, etc). If we tcpdump we can see the Delay Request leave the Slave and we get a Delay Response from the Master, so there's no networking problem.

If we turn on runtime debugging (hence the --enable-runtime-debug RPM build option above), we see that PTPd is attempting to handle a Delay Response message, but it doesn't think it's waiting for a response so drops it:

2016-04-05 15:23:16.947792 ptpd[101366].bond242 (debug1) (slv) ==> Handle DelayResp (9)

2016-04-05 15:23:16.947797 ptpd[101366].bond242 (debug3) (slv) Timer DELAY_RECEIPT start requested at 20. 0 sec interval

2016-04-05 15:23:16.947804 ptpd[101366].bond242 (debug2) (slv) timerStart: Set timer DELAY_RECEIPT to 20.000000

2016-04-05 15:23:16.947811 ptpd[101366].bond242 (debug1) (slv) Ignored DelayResp sequence 9 - wasn't waiting for one

This appears to be a software issue, so I got in touch with Wojciech Owczarek, the developer of PTPd. The first diagnosis is that PTPd had an issue reading the packet transmission timestamp of the Delay Request message, and thus was not expecting a Delay Response. This is almost always caused by PTPd being unable to read the TX hardware timestamp from the NIC, which is almost always caused by the NIC not providing a hardware timestamp for the Unicast Delay Request packet. This is strange, as "ethtool -T" says this Intel i350 GbE Controller supports hardware timestamps...

High grade network cards designed for packet capture workloads (like Solar Flare 7322s) have full hardware timestamping support, as in they will provide a NIC generated timestamp for every packet. This obviously requires a bit more silicon than normal and drivers to support the extra timestamp data as well as packet data, and so these high end cards are expensive. Many other server grade NICs support hardware timestamps, but for PTP packets only, and the cards themselves enforce this.

Exactly what a certain model of card does can be difficult to determine - the Intel Data Sheets are a largely beyond my ability to interpret. There is a good white paper that talks about the hardware timestamping capabilities of some Intel NICs in a sensible way (Section 6). One of the methods the cheaper NICs do to enforce hardware timestamps only on PTP packets is to use a hardware packet filter on certain MAC addresses. Since the Multicast address for PTP is a standard address, a NIC can filter on this and provide hardware timestamps just for PTP into a single register, and the driver can just read from that single register to get the packet timestamp. Since PTP is generally no more than a couple messages a second one register is enough. My guess is it's a great deal cheaper for NIC manufacturers to add a single register for PTP support, rather than implementing a high grade "timestamp every packet" solution.

Wojciech's diagnosis was that this was what was happening here - the Intel I350 was not providing a hardware timestamp for the Unicast PTP Delay Response message because the packet didn't match the hardware filter. His suggestion was to change from hybrid mode back to multicast mode. This took all of seven seconds and after a restart we now have stats about the Slave To Master offset and PTPd appears to be operating normally.

Interference

PTPd gives you a little bit more information that sfptpd. It will report the Master to Slave offset and the Slave to Master offset, which are the two offsets used to calculate One Way Delay (which is reported by sfptpd as Path Delay). While the values of these two metrics are largely useless when looked at individually, compared together and over a period of time can indicate a problem with PTP. After collecting metrics for a few minutes, an interesting pattern started to emerge regarding the Slave to Master offset.

The first graph below is the Path Delay for Client1 (which is still running sfptpd for comparison) is consistently around 31μs ±1μs, which is the number it has been at for months, nothing surprising there:

The Master to Slave offset of Client2, now running PTPd, is pretty stable at 35μs. Note that the actual value compared to Client1 here is not important, this value is greatly dependent on network architecture, and it could be that Client2 is taking a few more switch port hops than Client1, depending on which interface is the active member of the bond:

The one that looks strange is Client2 Slave to Master offset. This should be roughly the same as Master to Slave, it's the same path across the switches, but here we see 30μs with spikes up to 50μs:

The Slave to Master offset looks strange, and because these intermediate offsets are part of what PTPd uses to calculate adjustments, our accuracy is slightly affected:

Client1 is ±2μs while Client2 is ±4μs - how is sfptpd with software timestamps performing better than PTPd with hardware timestamps?

The hint is in the variance in the Slave to Master offset. The graph shows us we're getting upwards of 20μs variance in either the egress/TX time from the Slave or the ingress/RX time on the Master, which are the two timestamps that would affect Slave to Master offset. Client1 does not appear to be affected, so probably not the RX time on the master but something to do with Client2.

With the server now suspect it was pretty fast to find what the problem was. Somewhere during the development of the extra monitoring and trying out new versions I'd gotten Collectd into a Segfault-restart loop (it was crashing and respawning over and over). My NTP offset sanity check script is a Collectd Exec plugin that "always runs", and it looks like every time Collectd crashed and restarted it spawned a new NTP monitoring script. I had 140 instances of a script running doing 140 NTP queries every second to an NTP server. This is the cause of the interference. You can probably guess at what time I killed the runaway scripts:

Client2 has now stabilised, but the accuracy is still no where near what I was expecting. In fact, it's almost on par with Client1 in almost every metric taken. Why are software timestamps equal to hardware timestamps?

Hardware Timestamps on a Client, Actually

Something was not right. When Wojciech first told me about the 2.3.3-hwtest branch of PTPd he also gave me a sample configuration file with a number of additional filtering options. I'd been using the defaults, which, aside from the extra filtering, were missing one very important option:

ptpengine:hardware_timestamping=y

I think you can guess what that does. Configuration problem fixed, PTPd is now using the Linux PHC API:

2016-04-07 11:26:05.863773 ptpd[58310].bond242 (info) (init) bond242 is a bonded interface, current active slave: em2

2016-04-07 11:26:05.863787 ptpd[58310].bond242 (info) (init) Underlying physical device: em2

2016-04-07 11:26:05.866849 ptpd[58310].bond242 (info) (init) Hardware timestamping successfully enabled on em2

2016-04-07 11:26:05.867864 ptpd[58310].bond242 (info) (init) clock.linuxphc: Linux PHC clock driver em2 starting

2016-04-07 11:26:05.867888 ptpd[58310].bond242 (info) (init) clock.linuxphc: Successfully started Linux PHC clock driver em2 (/dev/ptp1) clock ID ffffff8b

2016-04-07 11:26:05.867895 ptpd[58310].bond242 (notice) (init) clock: Clock em2 changed state from INIT to FREERUN

2016-04-07 11:26:05.868921 ptpd[58310].bond242 (info) (init) clock.linuxphc: Linux PHC clock driver em1 starting

2016-04-07 11:26:05.868942 ptpd[58310].bond242 (info) (init) clock.linuxphc: Successfully started Linux PHC clock driver em1 (/dev/ptp0) clock ID ffffff83

2016-04-07 11:26:05.868955 ptpd[58310].bond242 (notice) (init) clock: Clock em1 changed state from INIT to FREERUN

After running for a few minutes we're now looking a lot better. Our offset tolerance is ±200ns, the best we've seen so far:

If we look over a longer period of time PTPd with hardware timestamps is generally a lot more stable both from the hardware timestamps and the better outlier filtering it does:

Another interesting change in metrics is the Slave to Master and Master to Slave offsets, we've dropped from 30μs to 21.6μs:

By moving the point in where we timestamp from being in software to being ingress into the NIC, we've taken off 10μs of latency. From here you can observe that the time it takes to go from the NIC, through the kernel and into software is 10μs, which is comparable to the performance gain we see when using kernel bypass stacks like Solar Flare Onload. Not very relevant for clock sync, but it's nice to have independent numbers line up.

This raises a question though: if hardware timestamping was not enabled when we had the problem with PTPd ignoring Delay Response messages, then why was PTPd failing to read the transmit timestamp of the Unicast message? It would mean it was failing to read the software timestamp of a Unicast packet. One for the ptpd project GitHub issues, once this branch of code is merged. It does mean though that we can try Hybrid mode again... Yes, it works, so Unicast TX timestamps do work on the Intel i350 in this server and we're back to using a slightly less network intensive mode of PTP.

Terminology and Language

The MiFID II regulations have forced the Finance sector into areas where they may not have usually cared about before, so there are people, like myself, jumping into the unknown territory of clock synchronisation. One of the problems whenever you get started in new territory is finding what already exists. It is important to remember that other industries have solved these sorts of clock sync problems before, so there's a lot of learning that's already been done and standards that exist, and if we're not careful we'll end up reinventing the wheel and coming up with a whole new set of language for the same thing.

One example is what I've been graphing as Offset Tolerance, which is the minimum and maximum offset over a set time period that shows the "how bad" the offset was for a given range of time. Turns out there is a well defined term for this called Maximum Time Interval Error (MTIE) that was standardised by the International Telecommunications Union over a decade ago.

Guilty. I'll be renaming that, and will endeavour to use similar terms as our Telecom colleagues.

What's Next?

I'm pretty happy with these results. ±200ns is the best we've seen, and it seems very stable, I think for our use case we don't need to try go any lower at the moment. We've got a proof of concept design, although now I'd really like to go back and try the ASIC firewall again, as I suspect my previous results were tainted by system load. Most importantly we've got a PTP client that supports bonded interfaces and hardware timestamps on the model of NIC controller we have in some of our servers here.

There's a few things I think need to be tackled first before we starting looking at a roll out:

PTPd to sync Solar Flare cards - we want to make sure we can receive accurate PTP over a cheap NIC and sync the clocks of expensive Solar Flare capture cards in the same server.
Failure monitoring and recovery - we've seen PTPd work well, what if it fails? It's infeasible to constantly watch hundreds of graphs - how do we alarm about it? Can we recover nicely or automatically? Can we somehow revert to NTP if we're in trouble?
PTP Bridge Resiliency - the PTP Bridge currently uses a mix of linuxptp software (ptp4l and phc2sys), can we use PTPd to replace this and get a little resiliency from bonded interfaces? What if the Bridge dies?

Failure monitoring is probably the most important here before we can look at any sort of internal roll out. The PTP Bridge is missing some Puppet work though, so perhaps next will be a Puppet module for PTPd.

Wednesday, February 10, 2016

Solving MiFID II Clock Synchronisation with minimum spend (part 6)

Talked about the regulations and how we might solve this with Linux software
Built a "PTP Bridge" with Puppet
Started recording metrics with collectd and InfluxDB, and
Finished recording metrics
Drawn lots of graphs with Grafana and found contention on our firewall

In this post we will look at introducing a second ASIC based firewall to route PTP through so that PTP accuracy is not affected by our management firewall getting busy.

The New Design

Here is the adjusted design diagram, it's not that much more complicated:

Routing To Another Firewall

In order for this design to work every PTP Client must send it's PTP traffic back through the PTP firewall, rather than the management firewall. Each client needs a routing rule to reflect this. There are similar changes needed on the PTP Bridge to send through the PTP firewall.

With the new firewall in place and the routing changes applied, everything should "just work", but it doesn't. We see multicast leaving the PTP Bridge correctly:

[root@ptp-bridge ~]# tcpdump -i p4p3 multicast -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on p4p3, link-type EN10MB (Ethernet), capture size 65535 bytes
16:10:07.519357 IP 192.168.0.4.319 > 224.0.1.129.319: UDP, length 44
16:10:07.519405 IP 192.168.0.4.320 > 224.0.1.129.320: UDP, length 44

Restart sfptpd on a client and we see it cannot subscribe to the multicast group:

2015-12-15 16:01:18.306138: info: using SO_TIMESTAMPNS software timestamps
2015-12-15 16:01:18.406259: notice: Now in state: PTP_LISTENING
2015-12-15 16:01:18.406353: info: creating NTP sync-module
2015-12-15 16:01:30.306177: warning: failed to receive Announce within 12.000 seconds

A packet trace shows that we see PIM announcements from the right firewall, and the IGMPv3 packets from sfptpd are leaving the client:

16:07:15.444343 IP 192.168.0.2 > 224.0.0.13: PIMv2, Hello, length 34
16:07:18.307942 IP 192.168.0.5 > 224.0.0.22: igmp v3 report, 2 group record(s)
16:07:18.369939 IP 192.168.0.5 > 224.0.0.22: igmp v3 report, 2 group record(s)
16:07:18.686936 IP 192.168.0.5 > 224.0.0.22: igmp v3 report, 2 group record(s)

Looking at the multicast groups on the new firewall, we don't have the PTP multicast addresses appearing as registered groups that the firewall can forward.

To cut a very, very long story short, this was a TTL problem compounded by two firewall brands behaving differently. The ptp4l process sets the TTL of it's multicast packets to 1 to restrict them to the same subnet. Our management firewall accepted this and did the multicast forwarding anyway, the PTP firewall did not. Setting udp_ttl=2 for the ptp4l process that's doing the multicast fixed this. The lmaxexchange-linuxptp module has been changed to take this option now too.

The performance improvement moving from the software firewall to an ASIC firewall is quite apparent. Whereas before our "normal" range of variance was ±20μs (top image), we're now getting ±5μs (bottom image):

Our mean accuracy has improved substantially - at a glance the hardware change has saved 15% of the tolerance the regulations allow.

There are interesting patterns in the most recent data; regular and somewhat uniform spikes in client offset are seen on both clients. The graph below draws both client offsets together:

The spikes are recorded at almost exactly the same second, they last for a few seconds, they appear to come in groups of two and are generally over 9000ns in size. Identifying how regular these spikes are may indicate what the problem is. If they are perfectly regular on both clients, I would start to think it might be something on the firewall, like an administrative or housekeeping function.

We can use an InfluxDB query to get the raw data:

SELECT "time","value" FROM "tail_ns" WHERE "instance" = 'sfptpd_stats' AND "type_instance" = 'offset-system' AND ("host" = 'client1') AND ( "value" > '0' OR "value" < '0' ) AND time >= '2016-02-05 09:43:19' AND value > 9000

Now with a little bit of Python we can work out the difference in time between spikes:

Unfortunately there's not as much uniformity in the spikes as I was hoping to see:

$ python diff.py < foo | awk '{print $4}' | sort | uniq -c | sort -n | tail
6 0:01:08
6 0:01:10
6 0:01:14
6 0:01:16
8 0:01:00
8 0:01:03
8 0:01:17
9 0:01:04
9 0:01:05
10 0:01:06

The most common time difference is around 60-70 seconds, which is between the individual spikes in a pair of two. The difference between the pairs of spikes themselves though is quite varied, anywhere from 2 minutes to 8 minutes apart. This doesn't mean it's not the firewall, but this data doesn't reliably pin the problem to some X minute recurring job in the internals of the firewall's operating system. The network engineers in my team are also quick to point out that this is an ASIC based firewall that's doing nothing but shifting a couple PTP packets per second, so firewall spikes of 20μs would be a very strange thing to be encountering. Before getting lost in a rabbit hole chasing performance spikes (something our engineers here know all too much about), there may be another way to mitigate this.

Attracting Attention

January was an interesting month for this project. We had several Solarflare representatives in to talk specifically about sfptpd, how it may be better utilised here and what feedback we could give to the Solarflare developers. A few days later I was tracked down on LinkedIn by Wojciech Owczarek, developer of the PTPd project. There are some very interesting features coming into PTPd, most of which we'll get to in a later blog post, but Wojciech did recommend tuning the PTP Sync and DelayReq interval higher to give the filters more samples to work with. This is a very simple thing to adjust and measure. So lets see if adjusting these values smooths out our spikes to any degree.

Faster Message Rates

Changing the sync rate of the ptp4l process on the PTP Bridge requires a minor code addition to the lmaxexchange-linuxptp Puppet module, and at the same time I'll rename repositories on GitHub to be more indicative that they are actually Puppet modules. Setting logSyncInterval to -2 means ptp4l will send out 4 Sync messages a second. The sfptpd daemons recognise this immediately:

2016-02-05 14:47:24.187549: info: received new Sync interval -2 from master (was 0)

After half an hour it appears as if the size of the spikes has lessened from upwards of 20μs to 15μs. Observing over a longer period appears to show the intensity of a spike has lowered slightly, see the before (top) and after (bottom) graphs below:

Changing the Delay Request rate of the sfptpd client is also simple. Setting ptp_delayreq_interval to -2 in the sfptpd configuration file means the client will send a DelayReq message 4 times a second. We also have two clients, so by changing only one we can compare the difference between the two.

An interesting pattern has emerged a day after changing one of our clients to send 4 Delay Requests per second; the spikes are no longer uniform across both clients as seen in the graph below:

This debunks the theory that it's the firewalls causing the jitter, each client is seeing it's own variance at different intervals. One of our key performance guys, Mark Price, tells me 10-20μs is well within the range of Operating System jitter he's seen, so on an un-tuned system, even with no workload, there's a good chance this is what we are seeing.

A Different Architecture

Lets step back a bit and look at what we've found so far. Routing PTP through a firewall is possible but we are heavily affected by contention. We can reduce this by using a dedicated firewall, but are still seeing 20μs spikes of variance, and there's a good chance it's from the Operating System. We could continue down the OS tuning route now - it's definitely something we'll have to do eventually - but it would be a big time investment and right we're still experimenting with architecture and design. Finally, our Networking team want their spare ASIC firewall back, so I'm losing my dedicated PTP firewall. The PTP Bridge generally keeps a baseline of 100ns accuracy (excluding jitter), perhaps it's time to look at switching-only designs.

In the very first post I mused about a design that didn't require a firewall, something that involved connecting the PTP Bridge to multiple separate management networks and having it send multiple independent PTP multicast streams. The idea had security implications - the PTP Bridge would join networks that don't usually join, networks that are usually protected from each other by a firewall. With careful attention to detail these implications can be mitigated.

Along side this, I've got new options in my choice of software. Solar Flare are interested in us replacing the PTP Bridge component with a purely Solar Flare approach - I'm told there is an undocumented feature that allows sfptpd to work in UDP Unicast mode, so it would be able to talk directly to our S300 Grand Master. There's also been some advances with PTPd that make it viable to replace linuxptp on the PTP Bridge, as well as replace sfptpd as the PTP clients.

In the next post we'll remove the PTP firewall from the mix and look at purely switched designs.

Tuesday, January 12, 2016

Solving MiFID II Clock Synchronisation with minimum spend (part 5)

In this series we are attempting to solve a clock synchronisation problem to a degree of accuracy in order to satisfy MiFID II regulations. So far we have:

Talked about the regulations and how we might solve this with Linux software
Built a "PTP Bridge" with Puppet
Started recording metrics with collectd and InfluxDB, and
Finished recording metrics

In this post we will look at the data we've gathered to see how accurate my first design is, and also see if we can explain why NTP and PTP think the time is so different.

Here's a quick refresh on what the design looks like:

Minor Improvements

I've made some slight improvements to Collectd and InfluxDB since I last posted code examples (if the queries look different to previous posts, this is why). I figured out why the Tail plugin always came up as "tail_" in Influx - it's because I used a custom Collectd Type that InfluxDB didn't know about. To fix this I pointed InfluxDB at my custom Collectd types database file, and now the measurement names contain the unit. I also fixed the type used for remote NTP offset to be in microseconds. The measurements now look like this, which is a bit more descriptive:

Graphing PTP Client Metrics

I've got data from the PTP Bridge as well as two test servers I'm using as PTP Clients. Lets look at the offset of the clients, arguably the most important metric:

The queries for the above graphs are:

SELECT mean("value") AS "value" FROM "tail_ns" WHERE "instance" = 'sfptpd_stats' AND "type_instance" = 'offset-system' AND "host" = 'FQDN' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

Generally we are ±20μs offset from the PTP Bridge. Most of the time the PTP Bridge itself is ±200ns offset from the Grand Master Clock, however there are some spikes. Here is the PTP Bridge offset for the same time period:

A better way to visualise the offset variance is to plot the min() and max() on the same graph using a larger GROUP BY time period. The two queries for an sfptpd client would be:

SELECT min("value") AS "value" FROM "tail_ns" WHERE "instance" = 'sfptpd_stats' AND "type_instance" = 'offset-system' AND "host" = 'FQDN' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

SELECT max("value") AS "value" FROM "tail_ns" WHERE "instance" = 'sfptpd_stats' AND "type_instance" = 'offset-system' AND "host" = 'FQDN' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

The PTP Bridge queries are:

SELECT min("value") AS "value" FROM "tail_ns" WHERE "instance" = 'ptp-master' AND "type_instance" = 'offset' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

SELECT max("value") AS "value" FROM "tail_ns" WHERE "instance" = 'ptp-master' AND "type_instance" = 'offset' AND ( "value" > '0' OR "value" < '0' ) AND $timeFilter GROUP BY time($interval)

I find this easier to read the offset variance - it's the space between the lines. The image below contains the PTP Bridge (top) and two clients:

It would be nice to be able to represent "offset variance" as a single number. I know the formula - absolute(Max) plus absolute(Min) - but I don't know how to do that in InfluxDB or Grafana.

If we ignore that one spike on the Master for now, our PTP Clients have about ±20μs variance. This is "OK", but not great. It's within the tolerance we need to hit to be compliant, however this is not the only step in the chain. Our financial software is what timestamps events, and this is subject to some amount of operating system jitter. A finger-in-the-air number from some of our developers about what jitter to expect from our software at the moment is in the area of ±40μs. Adding ±20μs PTP variance is ±60μs, which is getting uncomfortably near to our compliance limit of ±100μs.

The difference in offset variance is not just the firewall, although it's probably a large part of it, there are two other factors. First, there is no hardware timestamping on the clients receiving PTP multicast - the sfptpd software only supports hardware timestamping on Solar Flare interfaces and we are using an existing management network that's not Solar Flare NICs. Second, we haven't tuned or isolated the PTP software so it's subject to operating system jitter. While the OS jitter is probably very small right now (the test clients are essentially idle machines) it will become more of an issue if we put PTP on our servers doing a normal workload.

The time period in the graphs above are a small 3 minute window I chose during a "good" period to show the data. When you look at the data over a larger period a much more serious problem emerges, one that is going to force our hand in terms of architecture changes.

The problem with re-using infrastructure

Here is a client's offset tolerance from 21:30 to 23:30:

In case you missed the scale of the graph, that's spikes of ±700μs in this 2 hour period. I won't hold you in suspense for too long - in short, it's network resource contention on the switch / firewall introducing variance to PTP. 22:00 UTC is a special time in the Finance world, it is 17:00 EST (5pm New York time) which is when the financial markets close in NY. Our platform does some house keeping work at this time, including several backup jobs. These backup jobs ship large amounts of data across our management networks, which is the same switch and firewall that PTP is running through.

To look at this problem in detail here are tiles of all metrics for both the PTP Bridge and PTP Clients for that time period (the top image is the PTP Bridge):

Eyeballing all graphs we see that the backup period affects both the PTP Bridge and the PTP Clients, but on different scales. The PTP Bridge's Offset Tolerance is interesting because it shows the switch has ±7μs of overhead when under load, but the PTP Client offsets spike up to hundreds of microseconds. This indicates that while the switch is also noticeably busy, the firewall is contributing the majority of this variance - makes sense, it's a software (CPU-based) firewall and it's also trying to ship gigabytes of backups around at the same time.

If we zoom in on one of these spikes we can see what is most likely a queuing backlog on the firewall:

You can see Path Delay spikes up and then drains out slowly. The offset smooths out reasonably quickly because the PTP formula is able to figure out the correct offset as long as the delay is reasonably stable. It's the initial spike which gives us all the trouble.

When zoomed in like this the spike may not seem such a big deal, after all the offset appears to settle back to normal in about 10-15 seconds. Our platform can process hundreds of thousands of client messages in that time period, so it's not a trivial amount of time in the Financial world, nor is it anywhere near compliant.

Rethinking Architecture

We know that it's contention on the firewall that's causing the majority of these issues. The switch is also affected, but it's not that big a contributor. When discussing this problem in the office the next simplest idea we came up with was to just route the PTP traffic through a different firewall. It wouldn't even need to be a very powerful device, as long as it wasn't congested during backup periods (we joked about using someone's home Netgear at one point).

We use ASIC based firewalls in other areas where we care about speed, and the brand in question has a pretty inexpensive low-end model. Even though it's a break from our "don't spend any money" stance the cost to buy these small firewalls is low enough that's it's not that big a deal. We happen to have a similar unit available for testing, so in the next post we'll look at adding this ASIC firewall into the mix, configuring PTP to route through this firewall, and the results we get from it.