Saturday, January 14, 2017

Leaping Seconds

Just before New Year 2017 a leap second was inserted into Coordinated Universal Time (UTC). At LMAX we had some luxury to play with how we handled the leap second. January 1st is a public holiday, there's no trading, so we are free to do recovery if something didn't go according to plan. This blog post is an analysis of the results of various time synchronisation clients (NTP and PTP) using different methods to handle the leap second.

Some Research

Red Hat have a comprehensive article about the different clock synchronisation software they support on their Operating Systems and each one's capabilities. The section "Handling of the Leap Second" is especially worth a read to understand the various options and which ones would be applicable to you.

Since there's no financial trading on New Years day, this event became a real "live test" opportunity for us. We were able to consider all the available methods for correcting time. If the leap second was inserted during the middle of the year (June 30th), chances are the next day would be a working week (and in 2016 it was) and we'd have had less options to consider.

Our platform code assumes that time never goes backwards - it is always expected to go forwards at some rate. If it does go backwards, our application logic simply uses the last highest time it saw until the underlying clock source has progressed forwards again. In other words, our platform's view of time will "freeze" for one second if the clock is stepped back for one leap second.

During trading hours this is can be a problem. For previous leap seconds we've ignored the event and let NTP handle the clock drift naturally. The Red Hat page describes the clock being off for "hours" when you use this method. From our past experience it's more like days. Ideally we want clock synchronisation to recover rapidly and we want time to always progress forward - the "client slew" method.

Most of our platform uses the tried and tested NTP Daemon for clock synchronisation. The standard NTP Daemon doesn't have a fast slewing option, only Chrony can do this. Upgrading to Chrony before the leap second event wasn't an option for us unfortunately, so our hand was forced to use the "daemon step" method for this leap second. We judged safer than the kernel step method (less likely to trigger kernel bugs) but we knew our platform code needed to be tested heavily.

Some of our platform uses PTPd, and it's due to be rolled out more widely soon. PTPd's in built help describes it's leap second handling methods:

setting: clock:leap_second_handling (--clock:leap_second_handling)
   type: SELECT
  usage: Behaviour during a leap second event:
         accept: inform the OS kernel of the event
         ignore: do nothing - ends up with a 1-second offset which is then slewed
         step: similar to ignore, but steps the clock immediately after the leap second event
        smear: do not inform kernel, gradually introduce the leap second before the event
               by modifying clock offset (see clock:leap_second_smear_period)
options: accept ignore step smear
default: accept

Personally I was interested in knowing how quickly PTPd could bring the clock back in sync if we simply ignored the leap second and let it's normal error correction mechanism slew the clock. This would probably be our preferred method if a leap second is introduced during trading hours.

NTP Planning and Expectations

The plan was to have NTP step and PTP ignore the leap second.

Telling NTPd to step the clock is simple - we just needed to remove the "-x" flag from ntpd, but we had to make sure our platform code would handle it.  To do this we isolated one of our performance test environments and set up a fake stratum 1 NTP server by fudging a stratum 0 server. The configuration for this fake NTP server is:

restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict 127.127.1.0
restrict -6 ::1
restrict 10.101.0.0 mask 255.255.0.0 notrap nomodify
server 127.127.1.0
driftfile /var/lib/ntp/drift
fudge 127.127.1.0 stratum 0
leapfile /etc/ntp/leap-seconds.list

We set the fake NTP server's system clock to Dec 31st 23:45:00, force sync'd all performance machines to this NTP server, then started a performance run.  This particular run generally takes 10 minutes to get going so by 23:59:59 the environment would be running it's normal performance load, which is a simulation based on real production traffic patterns. This is one of the best tests we can come up with to simulate what would happen if the leap second occurred during business hours.

This leap second test was repeated a number of times and, as expected the timestamp 23:59:59.999 was used for the second time the clock ticked 23:59:59. Once the clock moved to 00:00:00 the exchange time progressed forward normally.

PTP Calculations

We wanted to test PTP slewing the clock post leap second, which is the method we'd be considering if the leap second occurred during trading hours. We know that NTP can take a long time to recover from a leap second,The inbuilt PTP configuration docs describe three options we set to slew the clock and improve recovery speed:

ptpengine:panic_mode=n
clock:leap_second_handling=ignore
clock:max_offset_ppm=1000

The first option is to stop the PTP daemon entering panic mode, which can result in the daemon stepping the clock (we want to avoid steps).

The second option simply tells PTPd to ignore the leap second from upstream, which will begin the slewing process after the leap second event occurs.

The third option sets the maximum frequency shift of a software clock. It's measured in Parts Per Million, where 1ppm is a shift of 1us per second. A value of 1000 means that we should be able to recover the clock by 1ms every second, which is 1000 seconds to recover from the leap second event.

There is also a default setting "clock:leap_second_pause_period=5" which makes the PTP daemon stop doing clock updates for 5 seconds before and 5 seconds after the leap second event, basically as a safety measure.

1000 seconds is 16 minutes and 39 seconds, adding the 5 second pause period we estimate that our PTP disciplined server clocks should be back in sync by 00:16:44 on January 1st.

What Actually Happened: NTP

The actual leap second event over all was fine. For the NTP disciplined servers, the testing of our code held up and as expected, our platform stopped processing for 1 second until real time caught up with it's view of time. If we look at the clock drift of one of our NTP disciplined servers at this time, there's no perceivable clock drift after Sunday 00:00 (the scale of the graph is in microseconds):



A much more interesting graph is a non-production machine that didn't pick up the NTP configuration change that removed the "-x" flag. On this hardware NTPd ignored the leap second and disciplined the clock using it's normal algorithms:



If you look at the X axis, it takes almost 12 hours for this NTP daemon to get remotely close to zero, and even after that it's not until Monday 12:00 that the system clock is within 10ms offset. This behaviour fits our observations during the previous leap second - it took much longer recover than we expected.

The ntpd man page talks about the maximum slew rate the linux kernel allows is 500ppm, so it will take a minimum 2000 seconds for NTP to correct 1 second of inaccuracy. What we're looking at here though is days. While we will be moving almost all platform servers to PTP we will still use NTP in our estate, and thus I'd like to understand the above behaviour. We haven't done much research into improving NTP recovery times, but I'd be surprised if there's not a way to tune the daemon to bring this down significantly.

A simpler option of course is to just replace ntpd with chronyd. Chrony supports a client slew method and while I don't have any hard data, Red Hat describe chronyd's leap second recovery as "minutes".

What Actually Happened: PTP

I calculated it would take a little over 15 minutes for PTP to bring the clock back into sync. It actually took 45 minutes. When using PHC Hardware Timestamping the PTPd daemon manages several clocks. The Master clock is whatever interface the PTP signal is coming over, and then there's the system clock which is a slave of the master clock. If the interface configured is also a Bonded interface, then any non-active interfaces are also managed as slave clocks.

Slave clocks are synchronised from the master clock using the same algorithm and rate limits, but more importantly slave clocks are not synchronised until the master clock is stable (ie: the LOCKED state). So what actually happens is the master clock - which from our graph below is the PHC device attached to interface em1 - synchronises it's time to the upstream PTP master clock first, and only once it is in sync do the rest of the slave clocks in the server start to be disciplined:



This is why the offset of em1 begins to track back into sync a little after 00:01:00. em2 and the System clock only begin to synchronise after 00:10:00, once em1 is LOCKED.  Why are the NIC clocks synchronising faster than the System clock though?

PTPd has the "clock:max_offset_ppm_hardware" setting which defaults to 2000ppm, which is also the daemon's maximum. This means it will take 1000000/2000/60 = 8.33 minutes to correct one second of offset. However the System clock is a software clock, who's rate is controlled by the "clock:max_offset_ppm" option which we specifically set to the maximum value of 1000ppm. The system clock should be recovering by 1ms every second but it's actually taking 2 seconds to recover 1ms, clearly seen in the slope of the graph if you zoom in (see below):



It looks like our value of 1000 for "clock:max_offset_ppm" didn't do anything. Wojciech Owczarek provided the answer - it is a known issue with the version of PTPd we're running. Support for slewing system clocks above the kernel maximum of 500ppm isn't finished yet, but will be in the final version.

While it's not as fast as I'd predicted, PTPd recovery is a lot faster than our NTP recovery.  We still want to know why our standard NTP recovery time is measured in days rather than hours, but that's less important if we move to Chrony for NTP.

Wednesday, October 5, 2016

How computer clocks work and how they are adjusted

A month or two ago I was asked by someone in our Operations team what clock synchronisation is and why we need to do it. I gave them a very basic few sentence answer. That got me thinking that I never read an easy explanation when I myself got started in this area, and the terminology used can be confusing if it's the first time you come across it. Below is a copy-paste out of our internal documentation where I attempt to explain computer clock synchronisation and the reason for it. The explanation is designed at non-technical people, it skips over many intricacies and is "wrong" from certain points of view but hopefully it describes the problem and solution accurately enough.

So here it is...

There is some terminology for time synchronisation that you have to get your head around (like Step, Slew, and Frequency), and to get that you have to understand how a "Clock" is implemented in a computer.

A computer only knows how to work with digital signals. Imagine a small device (called an Oscillator) where if you pass an electrical signal into it, it sends out pulses of electrical signals at constant intervals. If these intervals are small enough, we can approximate the passage of time versus the number of signal pulses.

For the sake of easy mathematics lets assume that 6000 pulses means 1 minute has passed, so that is 100 pulses per second, or 100 Hertz (Hz). If we put one of these devices onto a motherboard and send an electrical signal through it, and count the number of signals coming out, the computer will be able to know how much time has passed.

That's nice, but we need to know what the current time actually is. Knowing how many seconds has passed is not helpful if I don't even know what year it is. I can ask a clock that I trust what the real time is and just set my time to what it tells me. This is called a clock STEP - where the clock is forcibly set to a given value. Fantastic, my computer can now draw the correct time on my Desktop Clock.


Now imagine that like all computer components, the Oscillator is mass produced cheaply and never perfect. So even though it's supposed to send 100 pulses every second, my one actually only sends 90 pulses per second due to a manufacturing discrepancy, so my computer clock ends up running slow (when it counts 100 pulses, more than 1 second of real time has passed). In this case I can't trust the passage of time on my own machine. I can solve this problem by asking the trusted clock twice. After I step the clock the first time, if I wait what I think is 1 second and ask the same trusted time source what the time is I'll be able to figure out that I am 1/10 of a second slow, or, my OFFSET or DRIFT is 0.1 seconds. From there I can figure out that I need to count 10 fewer pulses to get to one second. If I now "remember" that I need to count 10 pulses less and step the clock again... Great, I have now compensated for my Oscillator defect and my desktop clock runs at the correct speed.


Now imagine that like all computer components, the Oscillator behaves differently under different conditions even though we don't want it to. Maybe if it gets hotter it maybe it sends pulses faster, and if it gets colder it sends pulses slower. Or maybe due to more manufacturing faults that even though it is supposed to be a constant pulse, it's not really.

This means that my +10 frequency adjustment is only correct for a single moment. Since I've got no idea how stable the Oscillator pulses are, in order to keep time as accurate as possible I have to constantly be referring to a clock that I can trust and adjust the frequency as needed. Maybe every time I ask it turns out I'm between +0.1 and -0.1 seconds out, so my frequency adjustment range is +/- 10 pulses. I don't particularly want to step the clock each time though, that's can be a bit brutal to the programs running on my computer - time would appear to be passing in "jumps". What I can do though is count more or less pulses to catch me up to where time is supposed to be, so the time will SLEW in the right direction.

For example if I am 0.1 seconds behind after one second passes, if I count 10 fewer pulses I will still be 0.1 seconds behind (I've compensated for the drift but not caught up). Instead I'm going to count 20 fewer pulses, so that in 1 seconds from now I should be at the exact same point as my trusted clock. This is nicer to my computer programs than stepping the clock all the time; it is a smooth FREQUENCY ADJUSTMENT.

I have to make these adjustments all the time though because my Oscillator varies, so what generally ends up happening is I adjust too fast and over shoot the real time, so I adjust to slow down, I under shoot, I speed up, I speed down, etc. If I were to draw my clock adjustments as a graph it would end up looking a bit like a "wave" pattern. It's not a big deal though because we're only talking about a couple pulses each time, and my desktop clock is generally hovering around the real time.


I can now keep time accurately, but I have to be in constant communication with the trusted clock, and make constant frequency adjustments. Since I know my Oscillator can lose up to 1/10 of a second every second, that means my computer clock can run freely (or FREERUN) for 10 of my seconds before I could be 1 second out. Lets say I don't care about seconds, only minutes, so I can free run for at most 600 seconds (or 10 minutes) before I would have to assume my clock could be 1 minute inaccurate. Maybe I do updates once every 10 seconds then, that is enough to keep me happy.


One Oscillator pulse is about 100 milliseconds in my pretend computer, so I can only accurately time stamp an event with 100 millisecond accuracy. There's no other faster signal to measure by in my PC, so it's physically impossible to tell the difference between events that happen at 0.01 and 0.02 seconds. Luckily real computer oscillators have much higher frequencies than 100 Hertz, they are on the range of hundreds of MHz (nanosecond accuracy). The same problems still exist though - they need constant adjustment to keep them in line, otherwise they go free running off in some direction.

Wednesday, May 18, 2016

Presenting at ITSF's MiFID II Workshop

The International Timing & Sync Forum (ITSF) and Executive Industry Events (EIE) have put together a workshop for MiFID II UTC Traceability specifically for the Finance sector. There will be quite a lot of knowledge in the room: Wojciech Owczarek (the developer of PTPd) is presenting, and Neil Horlock of Credit Suisse is chairing a session, who is a fountain of information regarding the regulations and how an organisation may satisfy them.

I will also be presenting my own findings in the MiFID II Clock Synchronisation space, specifically regarding PTP distribution designs and the monitoring around it, plus where I'd like to get to in the future. It should be a lot less Sales-y than other events, like STAC, so if you're in Fin Tech and want to come down for a chat, I encourage you to register.