Comments on Nazca Lines: Solving MiFID II Clock Synchronisation with minimum spend (part 3)

The "why" is a bit academic. We seem to...

2016-06-22T01:02:44.369+10:00

The "why" is a bit academic. We seem to get about +/-100ns of accuracy, but with occasional blips to +/-200 to 300ns. When that happens, I want to understand how sfptpd is correcting (and whether that correction is consistent across the servers).

Thought you might be interested - the following is what Solarflare support tells me:

"The frequency correction values are in parts-per-billion (PPB). The clock frequency doesn't matter, but the timescale is significant. One way to think of the parts-pre-billion value is as a ns/s (nanoseconds per second) slew rate. It is typical for there to be a "base" correction to the oscillator to make its frequency in the correct range. Your freq-adj or around -640PPB = -0.64PPM (parts-per-million) is well within the 4.6PPM Free Run frequency stability for a Stratum 3 oscillator. Across these two readings, the freq-adj for phc0 is modified by -638.674 - (-640.160) = 1.486 ns/s so the slew rate to keep everything in sync is very modest."

We record sfptp stats every second, so any ns/sec values get updated every second, which means for me they show the ns slew applied.

Completely not confident :-) PTP will not report o...

2016-06-17T19:47:42.730+10:00

Completely not confident :-) PTP will not report offset that it itself doesn't see, and I'm sure there's "hidden" delay in all sorts of places in this test infrastructure, as none of the layer 1 gear is PTP aware at the moment.

This holds true for frequency adjustment as well, because PTP will only slew the clock based on the offset that it thinks it knows about. any freq-adj values are made in response to the calculated offset (even if this offset is wrong). So if we stick with your first scenario, a sudden +100ns gain in signal that stays constant. Using a P/I controller the actual deltas of offset when graphed might look something like this:

http://ctms.engin.umich.edu/CTMS/Content/Suspension/Control/PID/html/Suspension_ControlPID_03.png

The freq-adj would be a similar-ish shape but opposite; PTP should be slowing down the oscillator to compensate for it thinking that the clock is now 100ns ahead of the master. When the offset is detected as being 0ns again, the freq-adj should be stable as well. The time it takes for PTP to be in sync is the same time as the offset was not 0ns, which is the same time freq-adj was not stable. Since both the offset and freq-adj numbers come from within PTP, both are as trustworthy as each other, so using either offset or freq-adj to figure out how long before you are back to 0ns should be the same, in theory. Plugging in a magic "perfect" clock to compare with might show something very different, of course.

Are you trying to figure out the offset at any arbitrary point in time? To explain the question...

You'll only get as many data points as your PTP sync interval, so lets assume once a second, and the pulses land on the second boundary (.000000). Imagine the second sync pulse of +100ns, the next packet after the first time PTP adjusted the frequency. Let's assume PTP now thinks it is +60ns out, so in the second that's gone past it has made a certain frequency adjustment, and that adjustment has correct 40ns of drift.

However you don't care what the offset is at 1.00000, you want to know what the offset was at 0.43716, so somewhere between +100ns and +60ns. Is that why you want freq-adj to nanos, so you can retrospectively figure out what the clock offset was *between* the frequency adjustments?

How confident are you that each offset really is t...

2016-06-17T07:58:02.026+10:00

How confident are you that each offset really is the true offset from the grandmaster? The short answer is that I want to know how many nanoseconds PTP is actually adjusting the clock during each adjustment.

The longer explanation follows...

Offset is how inaccurate PTP thinks the host is from the PTP grandmaster. Freq-adj is the dispersion (correction) that PTP actually applys to try and bring the host back into sync. Assume you are perfectly in sync (0 ns offset), then the next packet is +100 ns offset. Now consider two scenarios:

1. If the packets continue at +100ns PTP will slew you forward using the freq-adj until the offset is 0 ns, but it won't jump your time in one interval. Knowing how freq-adj gets converted to nanoseconds answers how long it takes to get in sync and how aggressively PTP acts to make this happen.

2. If the packets immediately go back to 0 ns, perhaps the +100 ns packet was an "error". PTP has probably started to slew your time towards +100 ns and will need to slew your time back. Knowing how freq-adj converts to ns in this case allows you to know by how much your clock was inaccurate.

Sounds tricky, as there will be a number of variab...

2016-06-17T07:12:11.094+10:00

Sounds tricky, as there will be a number of variables that are dragging the oscillator frequency slower and faster compared to pure wall clock time, so it'll be relative to those various unknowns. Reminds me of my boss' blog post about getting sanity out of CPU clock sources (https://www.lmax.com/blog/staff-blogs/2015/10/25/time-stamp-counters/) - have a read, because even though it's talking about CPUs, some of the core problems are probably going to be the same (CPU power levels, etc).

What exactly is your need to try convert from frequency to nanos?

2016-06-17T06:51:50.115+10:00

This comment has been removed by the author.

Any idea how to convert frequency adjustment (whic...

2016-06-17T06:17:44.093+10:00

Any idea how to convert frequency adjustment (which is in parts per billion) to nanoseconds? My initial thought was to divide the value by the CPU clock speed in MHz, but there isn't enough variance in the gps->system log lines for that calculation to be correct. On top of that, the freq_adj is always lower for the Solarflare ASIC to GPS, which makes me assume the Soalrflare clock runs at a slower speed