PTP Hardware Clock: Sub-Nanosecond Timekeeping for Chiplet Systems
Introduction
Precision timekeeping is a foundational service in any distributed system. Whether synchronising Ethernet frames to a PTP grandmaster, timestamping die-to-die packet exchanges between chiplets, or scheduling time-critical hardware events, the system needs a clock that is accurate, capturable at multiple points simultaneously, and adjustable by both hardware servo loops and software without stopping.
The PTP Hardware Clock (PHC) is an IEEE 1588-aligned timestamp counter IP designed for embedded SoC and chiplet applications. It maintains a 110-bit timestamp (48-bit seconds + 30-bit nanoseconds + 32-bit sub-nanosecond fraction), provides four independent capture banks for simultaneous timestamping from different sources, integrates a dual-source hardware servo multiplexer for runtime-selectable clock discipline, and exposes a programmable alarm with pulse-per-second output.
At approximately 10,000 um^2 in TSMC 65nm (including the AHB-to-APB bridge), the PHC is small enough to instantiate on every chiplet in a multi-die system, giving each node its own high-resolution time reference that can be synchronised to a global master via TideLink PTP or IEEE 1588 Ethernet PTP.
This article describes the PHC's architecture, timestamp format, capture and servo mechanisms, register interface, and verification, developed as part of the SoC Labs platform.
Timestamp Architecture
Format
The PHC maintains a continuously incrementing timestamp with three fields:
┌──────────────────┬──────────────────────┬──────────────────────────┐
│ seconds (48-bit) │ nanoseconds (30-bit) │ sub-nanoseconds (32-bit) │
│ 0 to 2^48 - 1 │ 0 to 999,999,999 │ fractional accumulator │
└──────────────────┴──────────────────────┴──────────────────────────┘
~8.9 million years 1-second range ~0.23 ps granularityThe 48-bit seconds field matches the IEEE 1588 PTP timestamp format. The 30-bit nanoseconds field counts from 0 to 999,999,999, rolling over into the seconds field. The 32-bit sub-nanosecond accumulator provides fractional precision --- its carry-out adds 1 ns to the nanoseconds field.
Per-Cycle Increment
Two registers control how much time advances each clock cycle:
NS_INCR(8-bit): Integer nanoseconds added per cycle. Default 4 for a 250 MHz clock (1 cycle = 4 ns).NS_INCR_FRAC(32-bit): Sub-nanosecond fraction added per cycle. The accumulator carries into the nanosecond field when it overflows.
This split representation allows the PHC to track non-integer clock periods. For example, a 200 MHz clock (5.0 ns period) uses NS_INCR=5, NS_INCR_FRAC=0, while a 233.33 MHz clock uses NS_INCR=4, NS_INCR_FRAC=0x55555555 (the fractional part represents 0.2857 ns, accumulating to produce the correct carry pattern).
Carries and Rollover
The increment logic uses a 33-bit adder for the sub-nanosecond accumulator (detecting carry into nanoseconds) and a 31-bit adder for nanoseconds (detecting second rollover at 999,999,999). On second rollover, the seconds field increments, nanoseconds wraps to the residual, and a pulse-per-second (PPS) output asserts for exactly one clock cycle.
Every cycle:
sub_ns_sum = sub_nanoseconds + ns_incr_frac (33-bit)
carry = sub_ns_sum[32] (overflow)
ns_sum = nanoseconds + ns_incr + carry (31-bit)
if (ns_sum >= 1,000,000,000):
seconds <= seconds + 1
nanoseconds <= ns_sum - 1,000,000,000
pps <= 1 (one cycle pulse)
else:
nanoseconds <= ns_sum
pps <= 0Four Capture Banks
The PHC provides four independent capture banks, each atomically snapshotting the full 110-bit timestamp on its respective trigger. All captures are combinational --- they complete in a single clock cycle with no bus stalling or multi-cycle sequencing.
┌──────────────┐
CTRL[2] (APB write) ───────►│ SW Capture │──► CAP_SECONDS/NS/FRAC (0x020)
├──────────────┤
hw_capture (external) ─────►│ HW Capture │──► HW_CAP_SECONDS/NS/FRAC (0x040)
├──────────────┤
eth_rx_capture ────────────►│ ETH RX Cap │──► ETH_RX_CAP_SECONDS/NS/FRAC (0x060)
├──────────────┤
eth_tx_capture ────────────►│ ETH TX Cap │──► ETH_TX_CAP_SECONDS/NS/FRAC (0x080)
└──────────────┘Software Capture
Triggered by writing CAPTURE=1 (W1S) to the CTRL register via APB. The current timestamp is latched into the CAP_* registers, which firmware then reads. Used for periodic status checks, synchronisation verification, and diagnostic snapshots.
Hardware Capture
Triggered by an external hw_capture pulse, typically from the TideLink PTP autonomous servo. The captured timestamp appears in the HW_CAP_* registers, readable by both the local APB interface and the external servo source. This enables hardware servo loops to read timestamps without any APB bus transaction, eliminating software latency from the synchronisation loop.
Ethernet PTP Captures
Two dedicated capture banks for IEEE 1588 PTP frame timestamping:
- RX Capture: Triggered by
eth_rx_capturewhen a PTP frame (EtherType0x88F7) is received on an Ethernet MAC's MII interface. Records the ingress timestamp (t2 or t4 in PTP terminology). - TX Capture: Triggered by
eth_tx_capturewhen a PTP frame is transmitted. Records the egress timestamp (t1 or t3).
These triggers come from the PTP event detector in a Ethernet Subsystem, which parses MII nibbles at line speed and generates single-cycle pulses on PTP frame detection.
Capture Independence
All four banks operate independently. A hardware capture can occur simultaneously with an Ethernet RX capture without either being lost or corrupted. Each bank has its own set of registers, and the snapshot logic is purely combinational (a parallel register load), so there is no arbitration or priority between captures.
Dual-Source Hardware Servo
The PHC integrates a two-source servo multiplexer, allowing the clock to be disciplined from either of two external sources at runtime:
┌───────────────────────┐
│ Source 0 (TideLink) │──┐
│ hw_capture_0 │ │ ┌──────────┐
│ hw_set_time_0 │ ├─────►│ Servo │──► Clock Core
│ hw_adj_ns_incr_frac_0│ │ ┌──►│ Mux │ (set_time, adj_frac)
└───────────────────────┘ │ │ │ │
│ │ │ SRC_SEL │◄── SERVO_CTRL[0]
┌───────────────────────┐ │ │ └──────────┘
│ Source 1 (HA1588) │──┘ │
│ hw_capture_1 │─────┘
│ hw_set_time_1 │
│ hw_adj_ns_incr_frac_1│
└───────────────────────┘
Source 0: TideLink PTP
The default source. The TideLink PTP subsystem performs autonomous inter-chiplet time synchronisation using hardware-timestamped PTP exchanges over the die-to-die link. It drives hw_capture_0 to snapshot the local time, reads back the captured values, computes offset and delay, and applies corrections via hw_set_time_0 (phase step) or hw_adj_ns_incr_frac_0 (frequency steering).
Source 1: HA1588 Hardware Servo
An alternative source for Ethernet-based time synchronisation. The HA1588 servo module (in the Ethernet Subsystem) periodically compares the HA1588 RTC time to the PHC time and applies corrections. This is used when the PHC should be disciplined from the Ethernet PTP grandmaster rather than from a neighbouring chiplet.
Runtime Switching
The servo source is selected by SERVO_CTRL[0] (SRC_SEL). Switching is immediate (combinational mux) --- the clock continues running without interruption. Both sources always observe the same hw_cap_* outputs, so a source can monitor the clock even when it is not the active disciplining source.
Fractional Increment Arbitration
The NS_INCR_FRAC register can be written by both the APB interface (software) and the active hardware servo (hw_adj_valid). The PHC uses a last-writer-wins protocol: when the hardware servo asserts hw_adj_valid, it takes ownership of the fractional increment; when software writes a new value to NS_INCR_FRAC via APB, software reclaims ownership. This prevents the hardware servo's adjustments from being silently overwritten by stale software values, while still allowing software to override the servo when needed.
Alarm and PPS
Programmable Alarm
The PHC includes a time-of-day alarm comparator:
- Load the target time into
ALARM_SECONDS_LO/HIandALARM_NANOSECONDS - Set
ALARM_CTRL[0] = ARMto enable the comparator - When
(seconds == alarm_seconds) && (nanoseconds >= alarm_nanoseconds),alarm_hitasserts - If
ALARM_CTRL[1] = AUTO_DISARM, the alarm automatically disarms after firing (one-shot mode)
The alarm drives STATUS[2] and the alarm_irq output (gated by INT_EN[1]). Nanosecond-granularity alarm resolution enables precise hardware event scheduling.
Pulse-Per-Second
The PPS output asserts for exactly one clock cycle on every second rollover, aligned to nanoseconds = 0. It drives:
- The
pps_outtop-level output (for external test equipment or oscilloscope triggering) - A sticky status bit
STATUS[1](cleared on register read) - The
pps_irqinterrupt (gated byINT_EN[0])
Register Map
Core Configuration (0x000--0x01C)
| Offset | Name | Access | Description |
|---|---|---|---|
0x000 | CTRL | RW | [0] EN, [1] SET_TIME (W1S), [2] CAPTURE (W1S) |
0x004 | STATUS | RO | [0] RUNNING, [1] PPS (sticky, clear-on-read), [2] ALARM_HIT |
0x008 | NS_INCR | RW | [7:0] Integer ns per cycle (default 4) |
0x00C | NS_INCR_FRAC | RW | [31:0] Fractional sub-ns per cycle |
0x010 | SET_SECONDS_LO | RW | [31:0] Seconds to load (lower) |
0x014 | SET_SECONDS_HI | RW | [15:0] Seconds to load (upper) |
0x018 | SET_NANOSECONDS | RW | [29:0] Nanoseconds to load |
0x01C | INT_EN | RW | [0] PPS_IRQ_EN, [1] ALARM_IRQ_EN |
Software Capture + Alarm (0x020--0x03C)
| Offset | Name | Access | Description |
|---|---|---|---|
0x020 | CAP_SECONDS_LO | RO | Captured seconds (lower) |
0x024 | CAP_SECONDS_HI | RO | Captured seconds (upper) |
0x028 | CAP_NANOSECONDS | RO | Captured nanoseconds |
0x02C | CAP_NS_FRAC | RO | Captured sub-nanoseconds |
0x030 | ALARM_SECONDS_LO | RW | Alarm target seconds (lower) |
0x034 | ALARM_SECONDS_HI | RW | Alarm target seconds (upper) |
0x038 | ALARM_NANOSECONDS | RW | Alarm target nanoseconds |
0x03C | ALARM_CTRL | RW | [0] ARM, [1] AUTO_DISARM |
Hardware Capture (0x040--0x04C)
| Offset | Name | Access | Description |
|---|---|---|---|
0x040--0x04C | HW_CAP_* | RO | Hardware-captured seconds, nanoseconds, sub-ns |
Ethernet PTP Captures (0x060--0x08C)
| Offset | Name | Access | Description |
|---|---|---|---|
0x060--0x06C | ETH_RX_CAP_* | RO | Ethernet RX PTP capture |
0x080--0x08C | ETH_TX_CAP_* | RO | Ethernet TX PTP capture |
Servo Configuration (0x0A0--0x0A8)
| Offset | Name | Access | Description |
|---|---|---|---|
0x0A0 | SERVO_CTRL | RW | [0] SRC_SEL (0=TideLink, 1=HA1588), [1] HA1588_SERVO_EN |
0x0A4 | SYNC_INTERVAL | RW | [29:0] Sync period in nanoseconds (default ~1 Hz) |
0x0A8 | SERVO_STATUS | RO | [0] LOCKED, [1] PHASE_STEP_ACTIVE |
Hardware Architecture
The PHC is structured as four SystemVerilog modules:
PHC_AHB (top-level, AHB slave)
└── cmsdk_ahb_to_apb (CMSDK bridge)
└── phc (APB top-level)
├── phc_apb_regs (register interface, alarm, interrupts)
└── phc_clock_core (timestamp counter, 4 capture banks, PPS)
└── servo source mux (combinational, in phc.sv)
| Module | Lines | Description |
|---|---|---|
phc_clock_core.sv | 235 | Timestamp counter, increment logic, 4 capture banks, PPS |
phc_apb_regs.sv | 403 | APB register decode, alarm comparator, interrupt gating |
phc.sv | 323 | APB top-level, servo source mux, signal routing |
PHC_AHB.sv | 211 | AHB-to-APB bridge wrapper |
Estimated area (TSMC 65nm): ~10,000 um^2 including the CMSDK AHB-to-APB bridge. The clock core alone is approximately 3,000--4,000 um^2.
Software Driver
A CMSIS-compliant C driver (phc.h, phc.c) provides the firmware interface:
phc_t clock;
phc_init(&clock, PHC_BASE_ADDR);
// Configure for 250 MHz
phc_set_increment(&clock, 4, 0x00000000);
phc_enable(&clock);
// Set initial time
phc_timestamp_t t = {.seconds_lo = 1000, .nanoseconds = 0};
phc_set_time(&clock, &t);
// Read current time via software capture
phc_capture(&clock);
phc_timestamp_t now;
phc_read_capture(&clock, &now);
// Set alarm for 5 seconds from now
phc_timestamp_t alarm = {.seconds_lo = now.seconds_lo + 5};
phc_set_alarm(&clock, &alarm);
phc_arm_alarm(&clock);
phc_enable_alarm_irq(&clock);
// Select TideLink PTP as servo source
phc_set_servo_source(&clock, 0);
Register definitions are auto-generated from SystemRDL (phc_apb_regs.rdl to phc_apb_regs.generated.h).
Integration
In the Ethernet-NanoSoClet
The PHC sits on the base tier's AHB interconnect, accessible to the host Cortex-M0 and via the external slave ports. It integrates with both the Ethernet subsystem and TideLink:
- Source 0 (TideLink PTP): The TideLink PTP subsystem drives
hw_capture_0during autonomous SYNC/DELAY_REQ exchanges, reads the captured time, computes offset, and applies corrections viahw_adj_ns_incr_frac_0. This runs entirely in hardware --- no firmware involvement after initial configuration. - Source 1 (HA1588): The HA1588 hardware servo in the Ethernet subsystem periodically compares the HA1588 RTC time to the PHC and applies corrections. This disciplines the PHC from the Ethernet network's PTP grandmaster.
- Ethernet captures: The PTP event detector in the Ethernet subsystem drives
eth_rx_captureandeth_tx_captureon PTP frame detection, latching timestamps into the dedicated Ethernet capture banks.
Time Distribution Hierarchy
In a multi-chiplet system with Ethernet:
Ethernet PTP Grandmaster
│ (IEEE 1588 over MII)
▼
HA1588 RTC (Ethernet-NanoSoClet)
│ (HA1588 servo, Source 1)
▼
PHC (Ethernet-NanoSoClet, Grandmaster)
│ (TideLink PTP, Source 0)
▼
PHC (Codec-NanoSoClet, Subordinate)
│ (TideLink PTP, Source 0)
▼
PHC (Sensor-NanoSoClet, Subordinate)
Each hop adds the TideLink PTP jitter budget (< 10 ns hardware, dependent on software servo loop performance). End-to-end accuracy from the Ethernet grandmaster to the leaf chiplet depends on the Ethernet PTP performance (typically sub-microsecond) plus the accumulated TideLink PTP jitter.
Verification
The PHC is verified through 99 cocotb tests across four hierarchy levels:
| Level | Testbench | Tests | Focus |
|---|---|---|---|
| 1 | phc_clock_core | 21 | Timestamp counter, increment, rollover, PPS, captures |
| 2 | phc_apb_regs | 46 | Register read/write, alarm, interrupts, defaults |
| 3 | phc | 20 | Integration: servo mux, cross-module interaction |
| 4 | PHC_AHB | 12 | AHB bridge, C driver, end-to-end register access |
Additional verification:
- Cross-IP tests (in ethernet-mac-ahb): 4 tests for end-to-end PTP sync from Ethernet MAC to PHC, 10 tests for HA1588 servo CDC handshake
- Formal: X-propagation analysis on all modules
- CDC: SpyGlass single-clock validation (PHC core is fully synchronous)
- CI pipeline: Lint, regression, synthesis (DC + RTLA, TSMC 65nm), coverage merge, dashboard generation
Key Test Scenarios
- Fractional carry accumulation: Verify that
NS_INCR_FRAC=0x80000000(0.5 ns) produces a carry every other cycle - Second rollover and PPS: Set nanoseconds near 999,999,999, verify rollover, PPS pulse width (exactly 1 cycle), and seconds increment
- Atomic capture: Trigger capture mid-rollover and verify the snapshot is consistent (no partial update)
- Servo source switching: Switch from source 0 to source 1 mid-operation and verify the clock continues without glitch
- Last-writer-wins: Hardware servo writes
NS_INCR_FRAC, then APB writes a different value --- verify APB wins and retains ownership - Alarm one-shot: Arm alarm with AUTO_DISARM, verify it fires once and automatically disarms
Performance
| Metric | Value |
|---|---|
| Timestamp precision | ~0.23 ps (32-bit sub-nanosecond) |
| Maximum time range | ~8.9 million years (48-bit seconds) |
| PPS accuracy | Exactly 1 clock cycle, aligned to ns=0 |
| Capture latency | 0 cycles (combinational snapshot) |
| Servo switch latency | 0 cycles (combinational mux) |
| Register access latency | 2 APB cycles (via CMSDK bridge) |
| Alarm comparison | Combinational (same cycle as match) |
| Clock frequency range | 200--250 MHz (configurable increment) |
Comments
Integration within the Arm Architecture
Some work is needed to understand how this IP is acting as the coordinating system clock in a reference for chiplet based systems within an Arm based ecosystem.
The Arm Chiplet System Architecture defines two System Types, a Hub based model and a decentralised compute fabric based model within which time must be coordinated.
In Chapter 6, the role of the System Counter is defined. The system counter is specified by the Arm Architecture Reference Manual as a sub-component of the Generic Timer, a necessary component in an Arm system. It measures the passing of time in real-time, providing a uniform view of system time to all components in the Arm system that require it. This includes, in addition to the Generic Timer, trace timestamps and RAS telemetry.
In a system that is composed of chiplets the system counter is distributed over multiple chiplets. There is a primary system counter in a single chiplet that provides the source of the count, and secondary system counters in all other chiplets that require a view of system time. These secondary system counters are used for local distribution of the system count within their respective chiplet.
A system counter meets the requirements as specified by the Arm ARM Generic Timer and by the Arm Base System Architecture Clock and Timer Subsystem.
There a separate Arm Architecture Reference Manuals for M class and A class implementations.
Looking towards the NanoSoC reference design this is an M class architecture. That architecture states:
The system timer, SysTick
Generated by the SysTick timer that is an integral component of an Armv7-M processor. SysTick is permanently enabled. An Armv7-M implementation must include a system timer, SysTick, that provides a simple, 24-bit clear-on-write, decrementing, wrap-on-zero counter with a flexible control mechanism.
The timer is clocked by a reference clock. Whether the reference clock is the processor clock or an external clock source is implementation defined. If an implementation uses an external clock, it must document the relationship between the processor clock and the external reference. This is required for system timing calibration, taking account of metastability, clock skew and jitter.
Global timestamping
When an implementation includes global timestamping, the ITM includes an external ITM timestamp interface, providing, 48-bit or 64-bit global timestamp count value and clock change signal that the system asserts if there is a change in the ratio between the global timestamp clock frequency and the processor clock frequency.
This project will need to consider the interaction between the various time elements within an Arm based Chiplet system.
Add new comment
To post a comment on this article, please log in to your account. New users can create an account.