HardenedBSD/usr.sbin/xntpd/kernel
..
chuinit.c
clkinit.c
README.kern

Precision Time and Frequency Synchronization Using Modified Kernels

1. Introduction

This memo describes replacements for certain SunOS and Ultrix kernel
routines that manage the system clock and timer functions. They provide
improved accuracy and stability through the use of a disciplined clock
interface for use with the Network Time Protocol (NTP) or similar time-
synchronization protocol. In addition, for certain models of the
DECstation 5000 product line, the new routines provide improved
precision to +-1 microsecond (us) (SunOS 4.1.1 already does provide
precision to +-1 us). The current public NTP distribution cooperates
with these kernel routines to provide synchronization in principle to
within a microsecond, but in practice this is limited by the short-term
stability of the oscillator that drives the timer interrupt.

This memo describes the principles behind the design and operation of
the software. There are two versions of the software, one that operates
with the SunOS 4.1.1 kernel and the other that operates with the Ultrix
4.2a kernel (and probably the 4.3 kernel, although this has not been
tested). A detailed description of the variables and algorithms is given
in the hope that similar improvements can be incorporated in Unix
kernels for other machines. The software itself is not included in this
memo, since it involves licensed code. Detailed instructions on where to
obtain it for either SunOS or Ultrix will be given separately.

The principle function added to the SunOS and Ultrix kernels is to
change the way the system clock is controlled, in order to provide
precision time and frequency adjustments. Another function utilizes an
undocumented counter in the DECstation hardware to provide precise time
to the microsecond. This function can be used only with the DECstation
5000/240 and possibly others that use the same input/output chipset.

2. Design Principles

In order to understand how these routines work, it is useful to consider
how most Unix systems maintain the system clock. In the original design
a hardware timer interrupts the kernel at some fixed rate, such as 100
Hz in the SunOS kernel and 256 Hz in the Ultrix kernel. Since 256 does
not evenly divide the second in microseconds, the kernel inserts 64 us
once each second so that the system clock stays in step with real time.
The time returned by the gettimeofday() routine is thus characterized by
255 advances of 3906 us plus one of 3970 us.

Also in the original design it is possible to slew the system clock to a
new offset using the adjtime() system call. To do this the clock
frequency is changed by adding or subtracting a fixed amount (tickadj)
at each timer interrupt (tick) for a calculated number of ticks. Since
this calculation involves dividing the requested offset by tickadj, it
is possible to slew to a new offset with a precision only of tickadj,
which is usually in the neighborhood of 5 us, but sometimes much higher.

In order to maintain the system clock within specified bounds with this
scheme, it is necessary to call adjtime() on a regular basis. For
instance, let the bound be set at 100 us, which is a reasonable value
for NTP-synchronized hosts on a local network, and let the onboard
oscillator tolerance be 100 ppm, which is a reasonably conservative
assumption. This requires that adjtime() be called at intervals not
exceeding 1 second (s), which is in fact what the unmodified NTP
software daemon does.

In the modified kernel routines this scheme is replaced by another that
extends the low-order bits of the system clock to provide very precise
clock adjustments. At each timer interrupt a precisely calibrated time
adjustment is added to the composite time value and overflows handled as
required. The quantity to add is computed from the adjtime() call and,
in addition a frequency adjustment, which is automatically calculated
from previous time adjustments. This implementation operates as an
adaptive-parameter, first-order, type-II, phase-lock loop (PLL), which
in principle provides precision control of the system clock phase to
within +-1 us and frequency to within +-5 nanoseconds (ns) per day.

This PLL model is identical to the one implemented in NTP, except that
in NTP the software daemon has to simulate the PLL using only the
original adjtime() system call. The daemon is considerably complicated
by the need to parcel time adjustments at frequent intervals in order to
maintain the accuracy to specified bounds. The kernel routines do this
directly, allowing vast gobs of ugly daemon code to be avoided at the
expense of only a small amount of new code in the kernel. In fact, the
amount of code added to the kernel for the new scheme is about the
amount removed for the old scheme. The new adjtime() routine needs to be
called only as each new time update is determined, which in NTP occurs
at intervals of from 64 s to 1024 s. In addition, doing the frequency
correction in the kernel means that the system time runs true even if
the daemon were to cease operation or the network paths to the primary
reference source fail.

Note that the degree to which the adjtime() adjustment can be made is
limited to a specific maximum value, presently +-128 milliseconds (ms),
in order to achieve microsecond resolution. It is the intent in the
design that settimeofday() be used for changes in system time greater
than +-128 ms. It has been the Internet experience that the need to
change the system time in increments greater than +-128 milliseconds is
extremely rare and is usually associated with a hardware or software
malfunction. Nevertheless, the limit applies to each adjtime() call and
it is possible, but not recommended, that this routine is called at
intervals smaller than 64 seconds, which is the NTP lower limit.

For the most accurate and stable operation, adjtime() should be called
at specified intervals; however, the PLL is quite forgiving and neither
moderate loss of updates nor variations in the length of the interval is
serious. The current engineering parameters have been optimized for
intervals not greater than about 64 s. For larger intervals the PLL time
constant can be adjusted to optimize the dynamic response up to
intervals of 1024 s. Normally, this is automatically done by NTP. In any
case, if updates are suspended, the PLL coasts at the frequency last
determinated, which usually results in errors increasing only to a few
tens of milliseconds over a day.

The new code needs to know the initial frequency offset and time
constant for the PLL, and the daemon needs to know the current frequency
offset computed by the kernel for monitoring purposes. This is provided
by a small change in the second argument of the kernel adjtime() calling
sequence, which is documented later in this memo. Ordinarily, only the
daemon will call the adjtime() routine, so the modified calling sequence
is easily accommodated. Other than this change, the operation of
adjtime() is transparent to the original.

In the DECstation 5000/240 and possibly other models there happens to be
an undocumented hardware register that counts system bus cycles at a
rate of 25 MHz. The new kernel routines test for the CPU type and, in
the case of the '240, use this register to interpolate system time
between hardware timer interrupts. This results in a precision of +-1 us
for all time values obtained via the gettimeofday() system call. This
routine calls the kernel routine microtime(), which returns the actual
interpolated value, but does not change the kernel time variable.
Therefore, other kernel routines that access the kernel time variable
directly and do not call either gettimeofday() or microtime() will
continue their present behavior.

The new kernel routines include provisions for error statistics (maximum
error and estimated error), leap seconds and system clock status. These
are intended to support applications that need such things; however,
there are no applications other than the time-synchronization daemon
itself that presently use them. At issue is the manner in which these
data can be provided to application clients, such as new system calls
and data interfaces. While a proposed interface is described later in
this memo, it has not yet been implemented. This is an area for further
study.

While any time-synchronization daemon can in principle be modified to
use the new code, the most likely will be users of the xntp3
distribution of NTP. The code in the xntp3 distribution determines
whether the new kernel code is in use and automatically reconfigures as
required. When the new code is in use, the daemon reads the frequency
offset from a file and provides it and the initial time constant via
adjtime(). In subsequent calls to adjtime(), only the time adjustment
and time constant are affected. The daemon reads the frequency from the
kernel (returned as the second argument of adjtime()) at intervals of
one hour and writes it to the file.

3. Technical Description

Following is a technical description of how the new scheme works in
terms of the variables and algorithms involved. These components are
discussed as a distinct entity and do not involve coding details
specific to the Ultrix kernel. The algorithms involve only minor changes
to the system clock and interval timer routines, but do not in
themselves provide a conduit for application programs to learn the
system clock status or statistics of the time-synchronization process.
In a later section a number of new system calls are proposed to do this,
along with an interface specification.

The new scheme works like the companion simulator called kern.c and
included in this directory. This stand-alone simulator includes code
fragments identical to those in the modified kernel routines and
operates in the same way. The system clock is implemented in the kernel
using a set of variables and algorithms defined below and in the
simulator. The algorithms are driven by explicit calls from the
synchronization protocol as each time update is computed. The clock is
read and set using the gettimeofday() and settimeofday() system calls,
which operate in the same way as the originals, but return a status word
describing the state of the system clock.

Once the system clock has been set, the adjtime() system call is used to
provide periodic updates including the time offset and possibly
frequency offset and time constant. With NTP this occurs at intervals of
from 64 s to 1024 s, deending on the time constant value. The kernel
implements an adaptive-parameter, first-order, type-II, phase-lock loop
(PLL) in order to integrate this offset into the phase and frequency of
the system clock. The kernel keeps track of the time of the last update
and adjusts the maximum error to grow by an amount equal to the
oscillator frequency tolerance times the elapsed time since the last
update.

Occasionally, it is necessary to adjust the PLL parameters in response
to environmental conditions, such as leap-second warning and oscillator
stability observations. While the interface to do this has not yet been
implemented, proposals to to that are included in a later section. A
system call (setloop()) is used on such occasions to communicate these
data. In addition, a system call (getloop())) is used to extract these
data from the kernel for monitoring purposes.

All programs utilize the system clock status variable time_status, which
records whether the clock is synchronized, waiting for a leap second,
etc. The value of this variable is returned by each system call. It can
be set explicitly by the setloop() system call and implicitly by the
settimeofday() system call and in the timer-interrupt routine. Values
presently defined in the header file timex.h are as follows:

int time_status = TIME_BAD;     /* clock synchronization status */

#define TIME_UNS 0      /* unspecified or unknown */
#define TIME_OK 1       /* operation succeeded */
#define TIME_INS 1      /* insert leap second at end of current day */
#define TIME_DEL 2      /* delete leap second at end of current day */
#define TIME_OOP 3      /* leap second in progress */
#define TIME_BAD 4      /* system clock is not synchronized */
#define TIME_ADR -1     /* operation failed: invalid address */
#define TIME_VAL -2     /* operation failed: invalid argument */
#define TIME_PRV -3     /* operation failed: priviledged operation */

In case of a negative result code, the operation has failed; however,
some variables may have been modified before the error was detected.
Note that the new system calls never return a value of zero, so it is
possible to determine whether the old routines or the new ones are in
use. The syntax of the modified adjtime() is as follows:

/*
 * adjtime - adjuts system time
 */
#include <sys/timex.h>

int gettimexofday(tp, fiddle)

struct timeval *tp;     /* system time adjustment*/
struct timeval *fiddle; /* sneak path */

On entry the "timeval" sneak path is coded:

struct timeval {
        long tv_sec = time_constant;    /* time constant */
        long tv_usec = time_freq;       /* new frequency offset */
}

However, the sneak is ignored if fiddle is the null pointer and the new
frequency offset is ignored if zero.

The value returned on exit is the system clock status defined above. The
"timeval" sneak path is modified as follows:

struct timeval {
        long tv_sec = time_precision;   /* system clock precision */
        long tv_usec = time_freq;       /* current frequency offset */
}

3.1. Kernel Variables

The following variables are used by the new code:

long time_offset = 0;           /* time adjustment (us) */

This variable is used by the PLL to adjust the system time in small
increments. It is scaled by (1 << SHIFT_UPDATE) in binary microseconds.
The maximum value that can be represented is about +-130 ms and the
minimum value or precision is about one nanosecond.

long time_constant = SHIFT_TAU; /* pll time constant */

This variable determines the bandwidth or "stiffness" of the PLL. It is
used as a shift, with the effective value in positive powers of two. The
optimum value for this variable is equal to 1/64 times the update
interval. The default value SHIFT_TAU (0) corresponds to a PLL time
constant of about one hour or an update interval of about one minute,
which is appropriate for typical uncompensated quartz oscillators used
in most computing equipment. Values larger than four are not useful,
unless the local clock timebase is derived from a precision oscillator.

long time_tolerance = MAXFREQ;  /* frequency tolerance (ppm) */

This variable represents the maximum frequency error or tolerance of the
particular platform and is a property of the architecture. It is
expressed as a positive number greater than zero in parts-per-million
(ppm). The default MAXFREQ (100) is appropriate for conventional
workstations.

long time_precision = 1000000 / HZ; /* clock precision (us) */

This variable represents the maximum error in reading the system clock.
It is expressed as a positive number greater than zero in microseconds
and is usually based on the number of microseconds between timer
interrupts, in the case of the Ultrix kernel, 3906. However, in cases
where the time can be interpolated between timer interrupts with
microsecond resolution, the precision is specified as 1. This variable
is computed by the kernel for use by the time-synchronization daemon,
but is otherwise not used by the kernel.

struct timeval time_maxerror;   /* maximum error */

This variable represents the maximum error, expressed as a Unix timeval,
of the system clock. For NTP, it is computed as the synchronization
distance, which is equal to one-half the root delay plus the root
dispersion. It is increased by a small amount (time_tolerance) each
second to reflect the clock frequency tolerance. This variable is
computed by the time-synchronization daemon and the kernel for use by
the application program, but is otherwise not used by the kernel.

struct timeval time_esterror;   /* estimated error */

This variable represents the best estimate of the actual error,
expressed as a Unix timeval, of the system clock based on its past
behavior, together with observations of multiple clocks within the peer
group. This variable is computed by the time-synchronization daemon for
use by the application program, but is otherwise not used by the kernel.

The PLL itself is controlled by the following variables:

long time_phase = 0;            /* phase offset (scaled us) */
long time_freq = 0;             /* frequency offset (scaled ppm) */long
time_adj = 0;                   /* tick adjust (scaled 1 / HZ) */

These variables control the phase increment and the frequency increment
of the system clock at each tick of the clock. The time_phase variable
is scaled by (1 << SHIFT_SCALE) in binary microseconds, giving a minimum
value (time resolution) of 9.3e-10 us. The time_freq variable is scaled
by (1 << SHIFT_KF) in parts-per-million (ppm), giving it a maximum value
of about +-130 ppm and a minimum value (frequency resolution) of 6e-8
ppm. The time_adj variable is the actual phase increment in scaled
microseconds to add to time_phase once each tick. It is computed from
time_phase and time_freq once per second.

long time_reftime = 0;          /* time at last adjustment (s) */

This variable is the second's portion of the system time on the last
call to adjtime(). It is used to adjust the time_freq variable as the
time since the last update increases.

The HZ define establishes the timer interrupt frequency, 256 Hz for the
Ultrix kernel and 100 Hz for the SunOS kernel. The SHIFT_HZ define
expresses the same value as the nearest power of two in order to avoid
hardware multiply operations. These are the only parameters that need to
be changed for different timer interrupt rates.

#define HZ 256                  /* timer interrupt frequency (Hz) */
#define SHIFT_HZ 8              /* log2(HZ) */

The following defines establish the engineering parameters of the PLL
model. They are chosen for an initial convergence time of about an hour,
an overshoot of about seven percent and a final convergence time of
several hours, depending on initial frequency error.

#define SHIFT_KG 10             /* shift for phase increment */
#define SHIFT_KF 24             /* shift for frequency increment */
#define SHIFT_TAU 0             /* default time constant (shift) */

The SHIFT_SCALE define establishes the decimal point on the time_phase
variable which serves as a an extension to the low-order bits of the
system clock variable. The SHIFT_UPDATE define establishes the decimal
point of the phase portion of the adjtime() update. The FINEUSEC define
represents 1 us in scaled units.

#define SHIFT_SCALE 28          /* shift for scale factor */
#define SHIFT_UPDATE 14         /* shift for offset scale factor */
#define FINEUSEC (1 << SHIFT_SCALE) /* 1 us in scaled units */

The FINETUNE define represents the residual, in ppm, to be added to the
system clock variable in addition to the integral 1-us value given by
tick. This allows a systematic frequency offset in cases where the timer
interrupt frequency does not exactly divide the second in microseconds.

#define FINETUNE (1000000 - (1000000 / HZ) * HZ) /* frequency adjustment
                                 * for non-isochronous HZ (ppm) */

The following four defines establish the performance envelope of the
PLL, one to bound the maximum phase error, another to bound the maximum
frequency error and the last two to bound the minimum and maximum time
between updates. The intent of these bounds is to force the PLL to
operate within predefined limits in order to conform to the correctness
models assumed by time-synchronization protocols like NTP and DTSS. An
excursion which exceeds these bounds is clamped to the bound and
operation proceeds accordingly. In practice, this can occur only if
something has failed or is operating out of tolerance, but otherwise the
PLL continues to operate in a stable mode. Note that the MAXPHASE define
conforms to the maximum offset allowed in NTP before the system time is
reset, rather than incrementally adjusted.

#define MAXPHASE 128000         /* max phase error (us) */
#define MINSEC 64               /* min interval between updates (s) */
#define MAXFREQ 100             /* max frequency error (ppm) */
#define MAXSEC 1024             /* max interval between updates (s) */

3.2. Code Segments

The code segments illustrated in the simulator should make clear the
operations at various points in the code. These segments are not derived
from any licensed code. The hardupdate() fragment is called by adjtime()
to update the system clock phase and frequency. This is an
implementation of an adaptive-parameter, first-order, type-II phase-lock
loop. Note that the time constant is in units of powers of two, so that
multiplies can be done by simple shifts. The phase variable is computed
as the offset multiplied by the time constant. Then, the time since the
last update is computed and clamped to a maximum (for robustness) and to
zero if initializing. The offset is multiplied (sorry about the ugly
multiply) by the result and by the square of the time constant and then
added to the frequency variable. Finally, the frequency variable is
clamped not to exceed the tolerance. Note that all shifts are assumed to
be positive and that a shift of a signed quantity to the right requires
a litle dance.

With the defines given, the maximum time offset is determined by the
size in bits of the long type (32) less the SHIFT_UPDATE (14) scale
factor or 18 bits (signed). The scale factor is chosen so that there is
no loss of significance in later steps, which may involve a right shift
up to 14 bits. This results in a maximum offset of about +-130 ms. Since
the time_constant must be greater than or equal to zero, the maximum
frequency offset is determined by the SHIFT_KF (24) scale factor, or
about +-130 ppm. In the addition step the value of offset * mtemp is
represented in 18 + 10 = 28 bits, which will not overflow a long add.
There could be a loss of precision due to the right shift of up to eight
bits, since time_constant is bounded at four. This results in a net
worst-case frequency error of about 2^-16 us or well down into the
oscillator phase noise. While the time_offset value is assumed checked
before entry, the time_phase variable is an accumulator, so is clamped
to the tolerance on every call. This helps to damp transients before the
oscillator frequency has been determined, as well as to satisfy the
correctness assertions if the time-synchronization protocol comes
unstuck.

The hardclock() fragment is inserted in the hardware timer interrupt
routine at the point the system clock is to be incremented. The phase
adjustment (time_adj) is added to the clock phase (time_phase) and
tested for overflow of the microsecond. If an overflow occurs, the
microsecond (tick) in incremented or decremented.

The second_overflow() fragment is inserted at the point where the
microseconds field of the system time variable is being checked for
overflow. On rollover of the second the maximum error is increased by
the tolerance. The time offset is divided by the phase weight (SHIFT_KG)
and time constant. The time offset is then reduced by the result and the
result is scaled and becomes the value of the phase adjustment. The
phase adjustment is then corrected for the calculated frequency offset
and a fixed offset FINETUNE which is a property of the architecture. On
rollover of the day the leap-warning indicator is checked and the
apparent time adjusted +-1 s accordingly. The gettimeofday() routine
insures that the reported time is always monotonically increasing.

The simulator can be used to check the loop operation over the design
range of +-128 ms in time error and +-100 ppm in frequency error. This
confirms that no overflows occur and that the loop initially converges
in about 50-60 minutes for timer interrupt rates from 50 Hz to 1024 Hz.
The loop has a normal overshoot of about seven percent and a final
convergence time of several hours, depending on the initional frequency
error.

3.3. Leap Seconds

The leap-warning condition is determined by the synchronization protocol
(if remotely synchronized), by the timecode receiver (if available), or
by the operator (if awake). The time_status value must be set on the day
the leap event is to occur (30 June or 31 December) and is automatically
reset after the event. If the value is TIME_DEL, the kernel adds one
second to the system time immediately following second 23:59:58 and
resets time_status to TIME_OK. If the value is TIME_INS, the kernel
subtracts one second from the system time immediately following second
23:59:59 and resets time_status to TIME_OOP, in effect causing system
time to repeat second 59. Immediately following the repeated second, the
kernel resets time_status to TIME_OK.

Depending upon the system call implementation, the reported time during
a leap second may repeat (with a return code set to advertise that fact)
or be monotonically adjusted until system time "catches up" to reported
time. With the latter scheme the reported time will be correct before
and after the leap second, but freeze or slowly advance during the leap
second itself. However, Most programs will probably use the ctime()
library routine to convert from timeval (seconds, microseconds) format
to tm format (seconds, minutes,...). If this routine is modified to
inspect the return code of the gettimeofday() routine, it could simply
report the leap second as second 60.

To determine local midnight without fuss, the kernel simply finds the
residue of the time.tv_sec value mod 86,400, but this requires a messy
divide. Probably a better way to do this is to initialize an auxiliary
counter in the settimeofday() routine using an ugly divide and increment
the counter at the same time the time.tv_sec is incremented in the timer
interrupt routine. For future embellishment.

4. Proposed Application Program Interface

Most programs read the system clock using the gettimeofday() system
call, which returns the system time and time-zone data. In the modified
5000/240 kernel, the gettimeofday() routine calls the microtime()
routine, which interpolates between hardware timer interrupts to a
precision of +-1 microsecond. However, the synchronization protocol
provides additional information that will be of interest in many
applications. For some applications it is necessary to know the maximum
error of the reported time due to all causes, including those due to the
system clock reading error, oscillator frequency error and accumulated
errors due to intervening time servers on the path to a primary
reference source. However, for those protocols that adjust the system
clock frequency as well as the time offset, the errors expected in
actual use will almost always be much less than the maximum error.
Therefore, it is useful to report the estimated error, as well as the
maximum error.

It does not seem useful to provide additional details private to the
kernel and synchronization protocol, such as stratum, reference
identifier, reference timestamp and so forth. It would in principle be
possible for the application to independently evaluate the quality of
time and project into the future how long this time might be "valid."
However, to do that properly would duplicate the functionality of the
synchronization protocol and require knowledge of many mundane details
of the platform architecture, such as the tick value, reachability
status and related variables. Therefore, the application interface does
not reveal anything except the time, timezone and error data.

With respect to NTP, the data maintained by the protocol include the
roundtrip delay and total dispersion to the source of synchronization.
In terms of the above, the maximum error is computed as half the delay
plus the dispersion, while the estimated error is equal to the
dispersion. These are reported in timeval structures. A new system call
is proposed that includes all the data in the gettimeofday() plus the
two new timeval structures.

The proposed interface involves modifications to the gettimeofday(),
settimeofday() and adjtime() system calls, as well as new system calls
to get and set various system parameters. In order to minimize
confusion, by convention the new system calls are named with an "x"
following the "time"; e.g., adjtime() becomes adjtimex(). The operation
of the modified gettimexofday(), settimexofday() and adjtimex() system
calls is identical to that of their prototypes, except for the error
quantities and certain other side effects, as documented below. By
convention, a NULL pointer can be used in place of any argument, in
which case the argument is ignored.

The synchronization protocol daemon needs to set and adjust the system
clock and certain other kernel variables. It needs to read these
variables for monitoring purposes as well. The present list of these
include a subset of the variables defined previously:

long time_precision
long time_timeconstant
long time_tolerance
long time_freq
long time_status

/*
 * gettimexofday, settimexofday - get/set date and time
 */
#include <sys/timex.h>

int gettimexofday(tp, tzp, tmaxp, testp)

struct timeval *tp;     /* system time */
struct timezone *tzp;   /* timezone */
struct timeval *tmaxp;  /* maximum error */
struct timeval *testp;  /* estimated error */

The settimeofday() syntax is identical. Note that a call to
settimexofday() automatically results in the system being declared
unsynchronized (TIME_BAD return code), since the synchronization
condition can only be achieved by the synchronization daemon using an
internal or external primary reference source and the adjtimex() system
call.

/*
 * adjtimex - adjust system time
 */
#include <sys/timex.h>

int adjtimex(tp, tzp, freq, tc)

struct timeval *tp;     /* system time */
struct timezone *tzp;   /* timezone */
long freq;              /* frequency adjustment */
long tc;                /* time constant */

/*
 * getloop, setloop - get/set kernel time variables
 */
#include <sys/timex.h>

int getloop(code, argp)

int code;               /* operation code */
long *argp;             /* argument pointer */

The paticular kernal variables affected by these routines are selected
by the operation code. Values presently defined in the header file
timex.h are as follows:

#define TIME_PREC 1     /* precision (log2(sec)) */
#define TIME_TCON 2     /* time constant (log2(sec) */
#define TIME_FREQ 3     /* frequency tolerance  */
#define TIME_FREQ 4     /* frequency offset (scaled) */
#define TIME_STAT 5     /* status (see return codes) */

The getloop() syntax is identical.

Comments welcome, but very little support is available:

David L. Mills
Electrical Engineering Department
University of Delaware
Newark, DE 19716
302 831 8247 fax 302 831 4316
mills@udel.edu