AHB Chiplet Communication (Tidelink)
Introduction
For small M-class microcontroller SoCs, particularly those built around Arm Cortex-M0, M0+, and M3 processors, AHB is the standard on-chip bus interconnect. AHB is an inherently blocking transfer protocol: a bus manager must receive the response to its current transaction before it can issue the next. This works well for low-throughput, low-latency interconnects within a single die, but becomes problematic when the bus fabric must stretch across chiplets. Read transactions are especially concerning: the read data must return before the bus is relinquished, and if there are long latencies involved such as multiple hops between chiplets, clock domain crossings, serialisation delays, etc. the entire bus stalls for the duration.
This project, called 'TideLink' extends the generic AXI chiplet controller (that is built around the open-source Wlink die-to-die link layer with runtime master/slave role selection and I2C sideband). It uses the Arm XHB500 AHB-to-AXI bridges to interface to the AXI based inter chiplet communications. TideLink provides three independent communication paths that share a single die-to-die PHY, each solving a distinct class of chiplet communication problem. A credit-based packet FIFO, a CAM-based address translator, and a Precision Time Protocol (PTP) clock synchronisation engine are used in this project.
The AHB Blocking-Bus Problem
AHB is a blocking protocol. A manager must receive the response to its current transaction before issuing the next one. For read transactions over a chiplet link, this creates a severe stall:
- The host CPU issues an AHB read. The address phase completes in one cycle.
- The TideLink bridge must hold HREADY low while the request crosses the link to the remote slave, the remote slave processes it, and the read data returns across the link.
- For even a modest 10 ns one-way link latency at 100 MHz AHB, this introduces a minimum 20-cycle stall per read — and the entire host AHB bus is frozen for the duration.
AHB does define SPLIT and RETRY mechanisms that allow a slave to release the bus during a long-latency response, but these require arbiter support and are absent from Cortex-M bus matrices and virtually all existing Cortex-M peripherals. They are not a practical mitigation.
Writes are less critical — they can be buffered and issued as fire-and-forget — but read performance over a transparent AHB bridge degrades dramatically with link latency. For latency-sensitive or high-bandwidth use cases, transparent AHB bridging is insufficient on its own.
The Solution: A Three-Path Architecture
The TideLink project addresses the full range of chiplet communication requirements through three independent paths, all sharing a single die-to-die PHY and independently flow-controlled so that traffic on one path cannot starve or be starved by traffic on another.
Path 1 — Transparent AHB Bridge (Control-Plane Traffic)
An XHB500 AHB-to-AXI bridge converts AHB transactions to AXI, which natively supports outstanding (non-blocking) transactions. The AXI chiplet controller carries the transactions over the link. An XHB500 AXI-to-AHB bridge reconstructs AHB on the remote side. A CAM-based address translator remaps local bus addresses to remote address ranges using two independent translation channels, each with 8 programmable match/replace rules that can be configured at runtime via APB.
This path is used for control-plane access: configuration writes, memory-mapped access to remote peripherals, and debugging. Latency is acceptable for these use cases, and the programming model is completely transparent — a CPU or DMA engine on one chiplet issues AHB transactions that are forwarded and executed on the remote chiplet's bus fabric without any software awareness of the link.
Path 2 — Mailbox Packet FIFO (Data-Plane and Latency-Sensitive Traffic)
Rather than bridging AHB reads transparently, TideLink exposes a FIFO mailbox on each chiplet. Software on the sending side constructs a descriptor packet — specifying transaction type, source and destination chiplet IDs, addresses, burst length, and a transaction tag — and writes it word-by-word into a TX aperture. A dedicated Wlink flow-control (FC) node (data_id=0xa1, 48-bit width) carries the words across the link directly into the remote chiplet's receive FIFO. Software on the receiving side is interrupted when a complete packet arrives, pops the descriptor, performs the local transaction, and writes a response packet back through its own TX aperture.
This path eliminates bus stalling entirely. The CPU writes a handful of words to a local peripheral and is immediately free. The bus is never held waiting for a remote response. For read requests, the host CPU writes only a 4-word descriptor and continues executing — the remote CPU performs the local reads and sends the data back asynchronously as a response packet.
Path 3 — PTP Clock Synchronisation (Time-Plane Traffic)
In multi-chiplet systems, a common time reference is essential for coordinating events, timestamping data, and implementing distributed protocols. TideLink integrates a Precision Time Protocol subsystem that synchronises the Precision Hardware Clock (PHC) across chiplets using dedicated Wlink short packets (data_id=0x50 for SYNC, 0x51 for DELAY_REQ). This path bypasses the FC state machine entirely — no credits, replay buffers, or CRC overhead — providing 67% bandwidth reduction compared to long packets and tighter timing characteristics.
All three paths are necessary. The transparent bridge provides simple memory-mapped access with no software overhead. The mailbox provides scalable, bulk, interrupt-driven data movement without AHB bus stalling. The PTP path provides autonomous clock synchronisation. Together they cover the full range of chiplet communication requirements for Cortex-M class systems.
Relationship to Wlink
Wlink is a layered chiplet communication stack originally developed by WAV (now open-source):
- Application layer: Protocol-specific nodes that convert bus transactions into Wlink packets. Wlink natively supports AXI, APB, and TileLink application nodes. AHB is not natively supported — this is TideLink's role.
- Link layer: Flow control (FC state machines), ECC (MIPI CSI/DSI SEC/DED), byte striping across lanes, TX/RX routing.
- PHY layer: Configurable — GPIO, SerDes, Bunch-of-Wires, or custom. Up to 256 asymmetric lanes. TideLink uses 8 GPIO lanes by default.
TideLink extends Wlink with two additional application-layer nodes. The mailbox path adds a dedicated FC node (data_id=0xa1, 48-bit) that provides the streaming valid/ready interface for FIFO data. The PTP path uses Wlink's native short packet mechanism (32-bit, with Hamming SEC/DED ECC) for low-latency timestamp exchange. The regular AHB bridge path uses the existing AXI application nodes within the chiplet controller. The Wlink instance is regenerated from Chisel source with the TideLink-specific FC node configuration.
TideLink wraps Wlink in a generic chiplet controller (axi_chiplet_controller) that adds runtime master/slave role selection via a strap pin and APB register, I2C sideband with independent master and slave cores for out-of-band configuration, and Wlink power-on-reset gating until the role is locked. This allows a single TideLink to serve as either endpoint in an asymmetric chiplet pair.
Architecture Detail
TideLink Top-Level Integration
The top-level module (tidelink_top) presents six AHB slave ports, one AHB master port, an APB configuration port, and dedicated interfaces for the PHC clock domain, chiplet controller role selection, and I2C sideband:
| Port | Direction | Purpose |
|---|---|---|
ahb_sub | Slave | Regular AHB access to remote side (via XHB500, address-translated) |
ahb_tx | Slave | TideLink TX aperture (direct to FC node, same aperture size as remote RX FIFO) |
ahb_fifo | Slave | Local RX FIFO data window (pop received packets) |
ahb_adr | Slave | Address translator configuration |
ahb_ptp | Slave | PTP TX write port (CPU writes here to trigger PTP short-packet messages) |
apb | Slave | Unified configuration port (0x0000–0x1FFF: Wlink controller, 0x2000–0x203F: TideLink config + PTP registers) |
ahb_mng | Master | Incoming transactions from remote side (via XHB500) |
Additionally, it exposes:
- PHY pads for the die-to-die link (8 GPIO lanes by default).
- PHC clock domain interface — a full bidirectional interface comprising hardware capture trigger and timestamp outputs (to/from the external PHC), free-running PHC time inputs, PPS pulse, and phase-step/frequency-adjust outputs from the servo.
- Role selection —
role_strap_i(external strap pin),role_is_master_o,role_locked_o. - I2C sideband — tristate SCL/SDA pins plus an AXI slave port for CPU-initiated I2C master transactions.
- Five interrupt outputs —
released_credits_irq,doorbell_irq,packet_committed_irq,ptp_irq,wlink_irq. - Servo status —
servo_lockedoutput. - General bus — 32-bit bidirectional interrupt forwarding across the link.
- Scan/DFT — scan mode, clock, shift, chain in/out.
Receive-Side FIFO Subsystem
The receive-side FIFO subsystem (tidelink_fifo_ahb) is the local mailbox buffer on each chiplet. It wraps:
- A 16 KB SRAM backing store with technology-specific implementations (FPGA, ASIC, or generic behavioural RTL).
- A FIFO controller (
tidelink_fifo_ctrl) that manages circular read/write pointers, packet-boundary framing, and credit counting. The controller uses the first word of each incoming packet as a length field to detect packet boundaries, firing apacket_committed_irqinterrupt when a complete packet has been received. The controller supports two write sources: standard AHB writes (2-phase protocol) and a direct FC write path that bypasses the AHB bus for single-cycle writes, doubling write throughput from the FC adapter. - An APB register block (
tidelink_apb_regs) for configuration (pair base address, credit release threshold), status (overrun, underrun, master error, packet committed), credit accumulators, doorbell, pair credit counter, and pass-through access to PTP, servo, and chiplet controller registers. The register block is organised into five regions:- Region 0 — FIFO configuration: pair base address, release threshold, packet word length, credit count, status, doorbell, flush control.
- Region 1 — Pair-side accumulators: released credits (write-accumulate/read-clear), doorbell response, pair credit counter with consume and enable registers.
- Region 2 — PTP and hardware sync initiator: PTP control/status/RX payload pass-through, HW sync enable/interval/status.
- Region 3 — Servo configuration: mode (Grandmaster/Subordinate), PI gains (KP, KI), step threshold, status (locked, last delay, NS_INCR_FRAC), and mailbox write registers for incoming servo timestamps.
- Region 4 — Chiplet controller register pass-through for Wlink configuration.
- A returner (
tidelink_returner) — a 3-channel priority-arbitrated AHB master that sends credit-release deltas and doorbell notifications back to the remote side. As the CPU reads data from the FIFO, freed word counts accumulate until they reach a configurable release threshold, at which point a credit delta is returned to the remote sender. The three channels are prioritised: credit release (highest), doorbell response, and reset handshake (lowest).
Credit accounting is handled automatically. The maximum credit count is derived from the SRAM size (4096 words for a 16 KB FIFO). Each packet costs its word length plus one (for the length word itself). Credits are decremented on write and incremented on read in a circular buffer scheme. Setting the release threshold to zero passes credits through immediately for backward compatibility.
FC Adapter
The FC adapter (tidelink_fc_adapter) bridges the AHB domain to the Wlink FC node. It handles traffic through a priority-arbitrated TX path and a stateless RX demultiplexer:
Transmit side: The adapter presents a write-only AHB slave (the TX aperture) through which the CPU writes packet words. Each 32-bit AHB write is combined with the AHB address offset to form a 48-bit FC word: 2 bits of packet type, 14 bits of address offset within the 16 KB aperture, and 32 bits of payload. The adapter also intercepts the returner's AHB master writes — credit deltas and doorbells — and re-encodes them as SIDEBAND FC packets on the same FC node, with the returner's target register offset carried in the address field. The PTP servo can also inject FC SIDEBAND packets carrying timestamp data directly to the remote side's mailbox registers. TX priority is: returner (highest) > servo SIDEBAND > TX aperture (lowest).
Receive side: The adapter accepts incoming 48-bit FC words and routes them by packet type: FIFO_DATA words are written directly to the FIFO data window via the direct FC write path (bypassing the AHB bus), and SIDEBAND words are routed to the APB configuration registers (targeting the appropriate mailbox, servo, or controller register). Each FC word is self-describing — it carries its own destination address and routing tag — so the RX path is entirely stateless.
Address Translator
The address translator (tidelink_addr_translator) provides APB-configurable address remapping for the transparent AHB bridge path. It contains two independent translation channels, each backed by 8 programmable CAM-based match/replace rules (parameterised via NUM_RULES). Each rule matches on the upper bits of the incoming address and replaces them with a configured output pattern, allowing software to map local address ranges to arbitrary remote address ranges. This CAM-based approach reduces register storage from 2,048 FFs (for a full 256-entry segment table) to approximately 169 FFs per channel, with no reduction in practical flexibility for typical chiplet address maps.
Packetisation
TideLink packets are a software convention imposed on the raw FIFO word stream. The first word written to the TX aperture at address offset 0x0000 is a length field specifying the number of words that follow. The next three words form a descriptor header:
- Word 1: Packet type (RD_REQ, WR_REQ, RD_RSP, WR_RSP, ERROR), source and destination chiplet IDs (8-bit each), transaction tag (8-bit), status, and burst type.
- Word 2: 32-bit destination address on the remote chiplet.
- Word 3: Burst length and beat size.
- Words 4+: Data payload (for write requests and read responses).
Hardware is unaware of packet semantics — it transports each 32-bit word independently as a FIFO_DATA FC packet. The receiving CPU reconstructs the packet by reading the length word first, then popping the descriptor and payload from the local RX FIFO.
Write and Read Mechanisms
For a write request: The host CPU constructs a WR_REQ packet (descriptor plus data payload) and writes it word-by-word to the TX aperture. Each word is forwarded over the FC node to the remote FIFO. On arrival of the final word, packet_committed_irq fires on the device side. The device CPU pops the descriptor, performs the requested local AHB writes at the destination address, and optionally sends a WR_RSP acknowledgement back.
For a read request: The host CPU writes only the RD_REQ descriptor (4 words, no data payload) to the TX aperture and is immediately free — the AHB bus is never stalled waiting for remote data. The device CPU receives the descriptor, performs local AHB reads at the specified address, constructs an RD_RSP packet containing the descriptor header and read data, and writes it back through the device TX aperture. The RD_RSP traverses the link and arrives in the host RX FIFO, triggering packet_committed_irq on the host. The host CPU then pops the response data. This asynchronous round-trip avoids the AHB blocking-bus problem entirely: the host CPU was free for other work for the full duration of the remote read.
Precision Time Protocol (PTP) Subsystem
Motivation
In multi-chiplet systems, a common time reference is critical. In the reference deployment, one chiplet has Ethernet connectivity and synchronises to an external PTP Grandmaster via standard IEEE 1588. Other chiplets in the system have no direct Ethernet access. TideLink PTP propagates the disciplined time from the Ethernet-connected chiplet (acting as a local Grandmaster) to all other chiplets (Subordinates) over the die-to-die link, creating a two-level PTP hierarchy:
IEEE 1588 PTP TideLink PTP
(Ethernet) (Die-to-Die)
External PTP Grandmaster
│
│ Ethernet 1588
▼
Chiplet A (Grandmaster) ◄── Ethernet-connected
│
│ TideLink PTP (short packets 0x50/0x51)
▼
Chiplet B (Subordinate) ◄── No Ethernet
For multi-hop deployments, TideLink supports cascaded PTP synchronisation: a Subordinate that has converged can act as a Grandmaster to a further chiplet. The PHC_LOCK_GATE_EN parameter gates the hardware sync initiator on an external phc_locked_i signal, ensuring that a mid-chain chiplet does not begin forwarding SYNC messages until its own clock is locked to the upstream source.
Protocol
TideLink PTP implements a simplified two-message clock synchronisation protocol inspired by IEEE 1588. The exchange uses SYNC and DELAY_REQ messages carried as Wlink short packets (32 bits on wire: 8-bit ECC, 16-bit payload, 8-bit data_id). No follow-up messages are required because timestamps are captured in hardware at the exact moment of packet handshake.
The protocol flow is:
- Grandmaster sends SYNC: The PTP module waits for
tx_router_idle(ensuring no other traffic is in the TX pipeline), simultaneously assertshw_capture(capturing timestamp t1 in the PHC) and transmits the SYNC short packet. - Subordinate receives SYNC: The PTP module asserts
hw_captureon receipt, capturing t2. An interrupt fires. - Subordinate sends DELAY_REQ: Same idle-gated process, capturing t3.
- Grandmaster receives DELAY_REQ: Captures t4. An interrupt fires.
- Timestamp exchange: t1 and t4 are sent to the Subordinate (via the FC SIDEBAND path or the mailbox FIFO).
- Offset and delay computation:
offset = ((t2 - t1) - (t4 - t3)) / 2delay = ((t2 - t1) + (t4 - t3)) / 2
The idle gating on the TX path is critical: by waiting until the Wlink TX router is idle before transmitting, the short packet enters the link layer with deterministic latency, eliminating arbitration jitter on transmit timestamps.
Hardware Sync Initiator
The PTP module includes a hardware sync initiator that autonomously generates periodic SYNC messages without CPU intervention. It uses the PHC time outputs to determine when to fire, maintains a target timestamp that advances by a configurable interval (matching IEEE 1588 logSyncInterval ranges from 128 Hz to 1/16 Hz), and auto-increments a 16-bit sequence number. The initiator shares the TX path with software-initiated messages and servo-initiated DELAY_REQ messages, with software having priority. When PHC_LOCK_GATE_EN=1, the initiator is gated on phc_locked_i, preventing SYNC emission until the local clock is stable — essential for multi-hop PTP chains.
Autonomous Hardware Servo
For applications requiring clock synchronisation without any CPU intervention, TideLink includes a fully autonomous hardware PTP servo (tidelink_ptp_servo). The servo operates in one of two modes:
- Grandmaster mode: Captures t1/t4 timestamps after each SYNC/DELAY_REQ exchange and sends them to the Subordinate via FC SIDEBAND packets (4 words per timestamp, written directly to the remote side's mailbox registers).
- Subordinate mode: Captures t2/t3 timestamps, receives t1/t4 from the Grandmaster via the SIDEBAND mailbox, computes offset and delay, autonomously triggers DELAY_REQ messages, and adjusts the local PHC.
Clock discipline uses a two-tier approach:
- Large offsets (exceeding a configurable step threshold, or seconds mismatch): Direct phase step via the PHC SET_TIME registers.
- Small offsets: A PI (proportional-integral) controller adjusts the PHC's
NS_INCR_FRACregister to steer the clock frequency. The proportional and integral gains (KP, KI) are configurable via APB registers, with defaults of approximately 0.7 and 0.3 respectively in Q0.32 fixed-point representation.
The servo multiplication engine is parameterised: iterative mode (32-cycle, small area) or combinational mode (1-cycle, larger area). The servo exposes status registers including the last computed offset, last one-way delay, current NS_INCR_FRAC value, and a servo_locked indicator that is also brought out as a top-level output.
Clock Domain Crossing
The PHC may operate on a different clock from the AHB system clock. The CDC bridge (tidelink_phc_cdc) synchronises six signal paths between the two domains:
| Path | Direction | Width | Purpose | Mechanism |
|---|---|---|---|---|
| 1 | PHC → AHB | 110-bit | HW capture timestamps | Quasi-static snapshot |
| 2 | PHC → AHB | 78-bit | Free-running PHC time | Handshake snapshot |
| 3 | PHC → AHB | 1-bit | PPS pulse | Toggle-based pulse sync |
| 4 | AHB → PHC | 1-bit | HW capture trigger | Toggle-based pulse sync |
| 5 | AHB → PHC | 79-bit | Phase step command | Data + pulse handshake |
| 6 | AHB → PHC | 33-bit | Frequency adjust | Data + pulse handshake |
The module uses a configurable synchroniser chain depth (minimum 2 stages) and is safe for fully asynchronous clocks. When both clocks are the same (single-clock mode), the module can be bypassed via a BYPASS_CDC parameter, reducing cost from approximately 526 flip-flops to approximately 20.
Generic Chiplet Controller
TideLink wraps the Wlink die-to-die link layer in a generic chiplet controller (axi_chiplet_controller) that adds several integration features beyond raw link-layer transport:
- Runtime master/slave role selection: A strap pin (
role_strap_i) determines whether the chiplet acts as master or slave. The role is locked at startup and exposed asrole_is_master_oandrole_locked_o. Wlink power-on-reset is gated until the role is locked, ensuring deterministic initialisation. Different APB register sets are exposed depending on the selected role. - I2C sideband: Independent I2C master and slave cores with pin-muxed tristate I/O provide an out-of-band communication channel for configuration, recovery, and boot-time handshaking before the main link is active. The I2C master is accessible via an AXI slave port (
s_i2c_axi_*). - D2D reset output:
d2d_reset_oallows one chiplet to hold the other in reset.
This abstraction allows a single TideLink RTL design to be instantiated identically on both sides of a chiplet link, with the role strap determining which side acts as master and which as slave.
Verification
TideLink has extensive verification infrastructure spanning cocotb (Python-based), UVM (SystemVerilog), formal (X-propagation), CDC (SpyGlass), and lint (Cadence HAL):
- cocotb: 296 tests across 13 environments covering the FIFO controller, returner, APB registers, FC adapter, address translator, iterative multiplier, AHB wrapper, paired system stress, PTP short-packet exchange, PTP servo operation, and full top-level loopback.
- UVM: 8 environments with 51 test files covering FIFO unit tests, FC adapter TX/RX paths, loopback integration, paired system stress (credit exhaustion, reset recovery, sideband stress, mixed traffic), PTP jitter stress characterisation (under concurrent AXI, mailbox, and general bus traffic), PTP convergence analysis (PI servo model, offset/drift/step-change recovery, long-term stability), and multi-hop PTP chain testing (lock propagation, force enable, step recovery).
- Formal: VC Formal X-propagation analysis on 5 modules (FIFO controller, returner, APB registers, FIFO wrapper, top-level FIFO integration).
- CDC: SpyGlass CDC analysis on
tidelink_topwith constraint and waiver files. - Lint: Cadence HAL lint with standalone and CMSDK-dependent module targets.
- Code coverage (VCS): Line coverage exceeding 92% on the FIFO subsystem, with condition, branch, toggle, and FSM coverage actively tracked across all environments.
- CI/CD: A 9-stage GitLab CI pipeline runs lint, CDC, cocotb regression, UVM regression, C driver tests, synthesis (Design Compiler), coverage merging, and dashboard generation.
Known Architectural Limitations
The design has several documented limitations, including: no hardware credit underflow protection (software can write packets larger than available credits, causing counter wrap); a single-packet-in-flight limitation at the FIFO controller level; no AHB error response on FIFO overrun/underrun (errors are flagged in a status register but the bus transfer completes normally); no returner retry mechanism on bus errors; and RX-side PTP jitter from the Wlink receive pipeline that cannot be gated. These are documented in detail with severity classifications and recommended mitigations.
Additional hardware above the baseline Wlink
The starting point of the project was the Wavious open-source Wlink controller — an AXI-to-D2D link layer with a generic flow-control state machine (FCSM), CRC/ECC packet protection, and a clean separation between link-layer logic and PHY. What Wlink does not provide is a digital PHY: the upstream design assumes an analog SerDes block. Around the Wlink core, this TideLink project adds the following SoC Labs developed hardware:
- A digital GPIO PHY family (
WavD2DGpio*). Eight single-ended data lanes per direction plus a forwarded clock pad, implemented entirely with standard logic. The transmitter latches the parallel data on the slow side of the link-layer clock and serialises it byte-by-byte; the receiver samples the incoming bus on the forwarded clock from the peer. The public Wavious tree contains only the analog SerDes hooks. This SoC Labs GPIO based PHY design is being developed and verified as one of the optional PHY implementations. - Per-lane phase calibration. Eight
IDELAYE2primitives (on FPGA; standard-cell delay lines on ASIC) per receive lane. A central calibrator FSM (tidelink_phy_align_calibrator) sweeps phase and slip across the search space, scores each(phase, slip)point by how long the per-lane comma-hunt holds lock, and latches the best-scoring pair when it completes. State diagram:S_IDLE → S_ARM → S_SWEEP → S_FINISH → S_DONE(withS_CANCELandS_HOLDfor re-trigger and peer-convergence paths). - A T3A comma-hunt FSM per lane. Three states (
S_SETTLE → S_HUNT → S_LOCKED) that walk the bit-slip space looking for a known comma symbol. This is what actually achieves byte alignment under the calibrator's phase choice. - Bring-up coordination. A
ROLE_CFGregister tells each chiplet whether it is master or slave, with arole_lockedstrap that gates the calibrator's first trigger. A secondswi_recalbit lets software re-trigger the calibrator if the first attempt does not stabilise. - An AXI subsystem wrapper. The Wlink core ports are AXI; the chiplet boundary on either side is AHB-Lite (matching the Corstone-101 / BP210 fabric we share with other lab projects). We bridge with the Arm
XHB500AXI-to-AHB bridge, add asynchronous clock-crossing FIFOs (link-layer clock vs. system clock), and provide a Control and Status Register (CSR) aperture for software bring-up, lane status, and credit observability. - A dedicated functional-channel (FC) node for time-sync packets that need to bypass the AXI-credit-gated path. This is what the precision-time-protocol (PHC) implementation uses to deliver sync and follow-up packets out of band of the main data fabric.
- TideChart, a separate protocol layer for dynamic chiplet-ID assignment, currently being prototyped in a sister repository. Without it, every chiplet ID is hard-strapped at deploy; with it, chiplets discover each other and negotiate addresses post-power-on.
Together these are roughly as much custom logic as the Wlink core itself.
Limitations of the digital GPIO PHY
The GPIO PHY is robust enough for the chiplet bring-up work we have been doing, but it has architectural limits that we have come to understand the hard way:
- Single-ended, source-synchronous, no equalisation. With eight data lanes and one forwarded clock per direction, the link is vulnerable to per-lane skew that an analog SerDes would normally absorb in CDR. On FPGA, ribbon-cable skew between the two PYNQ-Z2 boards is enough to require explicit per-lane delay tuning; on ASIC, package and routing skew will dominate.
- Byte alignment is per-sweep non-deterministic. Each calibration sweep picks a
(phase, slip)pair per lane independently. In our bring-up runs we have observed that the same bitstream, deployed back-to-back to the same boards, will sometimes land in a working alignment and sometimes leave one direction misaligned. There is no continuous re-alignment loop in the baseline PHY — once the calibrator finishes, the choice is fixed until the next re-trigger. - Asymmetric failure modes are common. Because each direction calibrates independently, it is entirely possible for master→slave bytes to align while slave→master bytes do not. We have observed this exact failure pattern on hardware: the slave receiving the master's credit-request packet, the slave responding with a credit-acknowledge packet, and the master never seeing the response — leaving the link-layer FCSM stuck at
SEND_CREDITS1. - The training-mode signal is overloaded. When training mode is high, the transmitter sends comma symbols and the receiver runs comma-hunt; when it drops, the link transitions to data. The hand-off is delicate: drop training too soon and the peer is still searching; hold it too long and the FCSM never starts exchanging credits. We added an
S_HOLDstate to the calibrator specifically to extend training while waiting for the peer, but tuning the hold duration is empirical. - There is no in-band link-up indication. The local end can tell that its receiver has locked byte alignment, but it cannot tell that the peer's receiver has locked. Bring-up therefore relies on both ends independently reaching a "good enough" state and trusting that credit exchange will fail loudly if either side is not aligned.
These limits are characteristic of digital GPIO PHYs in general; they are the price we pay for not requiring an analog macro.
The bring-up saga
Bringing TideLink up on physical FPGA hardware — two PYNQ-Z2 boards wired together to act as a chiplet pair — turned out to be roughly an order of magnitude more work than simulation, and the lessons have shaped most of the v1 release.
The headline numbers, at the time of writing: seventeen full FPGA farm builds across two boards, with an eighteenth in flight. Each pair build takes ~45 minutes wall-clock with both boards built concurrently. Cumulative debug time across the team easily measured in weeks.
A representative sample of the issues worked through:
- Wrong test oracle. For roughly twelve hours of debug we were polling a local receive FIFO (a CSR aperture inside one of the dies) and treating its contents as evidence that data had crossed the link. It hadn't — the FIFO populated from a separate path that did not cross the chiplet boundary. The actual peer-aperture address sat at a different base. The lesson here is uncomfortable but real: when an oracle is wrong, every subsequent hypothesis is contaminated.
- PYNQ PS bus wedges. The peer aperture and the link's transmit aperture both sit at addresses that, when written while the link is not yet up, hang the host PYNQ processor's bus indefinitely. Recovery requires a power cycle of the board. We have learned to write to those apertures only once the link-status registers have explicitly confirmed both directions are alive.
- A missing bring-up step. The software bring-up sequence wrote
slot0 = 0x3(recal + training-mode high) and thenslot0 = 0x1(recal cleared), but never wroteslot0 = 0x0to drop training-mode entirely. Combined with an Option-C software gate we had added (training-mode held high also holds the link-layer receiver in reset), the receiver framer never escaped reset post-bring-up. Adding the missingslot0 = 0x0was the single change that produced the first cross-die doorbell event after weeks of failed attempts. - A mid-word mux flip in the transmitter. The transmit PHY latched parallel data on every cycle, but the asynchronous mux that selected between training-pattern and live data could flip in the middle of a byte. The slave side would then byte-align to the wrong boundary and never recover. The fix was to delay the mux update until a word boundary — a one-line change, but only after several wrong hypotheses involving the slave's enable signal, P-state machine, and packet-ID matcher had been ruled out.
- The lottery problem. Even with all of the above fixed, we discovered through ILA captures that the calibrator's single sweep produces a non-deterministic outcome: sometimes both directions align, sometimes one does not. The original bring-up workaround was to force a second calibration cycle from software via the
swi_recalbit — effectively buying a second lottery ticket. This works empirically. We are currently building an RTL fix that lets the T3A comma-hunt re-arm itself during steady-state training pulses, removing the lottery entirely. - Tooling gaps. Vivado 2025.2 changed the semantics of several ILA properties (
CAPTURE_MODEandMAX_DATA_DEPTHbecame read-only; mis-configuredC_CLK_INPUT_FREQ_HZondbg_hubsilently corrupts readback). We hit each of these. Where they were not documented, we documented them.
The honest summary is that the link works, but the path to "works" has been a continuous exercise in finding one more thing the baseline didn't handle.
Hardware iterations
Most of the bring-up work has been driven from a series of RTL patches on the chiplet wrapper rather than the Wlink core itself. Each iteration is named "L<n>" in our debug tree; the highlights:
- L1 — Word-aligned mux latch in the transmit PHY. Prevents the train/data mux from flipping mid-byte; closes the mis-alignment class described above.
- L2 — P-state controller modifications. Cleans up the link's reset and power-state sequencing across the AHB and link-layer clock domains.
- L4 —
first_short_pkt_seenlatch. A sticky bit in the link-layer receiver that confirms the first valid short packet has been observed, used as a bring-up gate by software. - L5 — Packet-type whitelist on the receiver. Filters known-invalid packet IDs during training-to-data transition so a single garbage symbol does not poison the FCSM.
- L6 / L7 — Calibrator FSM refinements. Added
S_HOLDto extend training while the peer is still hunting; tuned dwell and sweep timing. - L8 — Selective re-training mask restricted to the specific FIFO update path that needed it; an earlier broader mask regressed the link from
LINK_IDLEback toSEND_CREDITS, illustrating that "more masking" is not always safer. - L9 — Framer-stuck watchdog. An observability counter that fires if the link-layer receiver sits in the same state too long; informational only, used by ILA.
- L10 — Receive-credit clamp + ILA observability. Bounds the credit field so the FCSM cannot deadlock on overflow, and exposes a wide set of
mark_debugtaps inside the FCSM and link-layer receiver. - L11 (current) —
T3A_REARM_ON_TRAIN. A new module parameter that causes the per-lane comma-hunt to re-arm whenever training-mode is asserted, so the link can recover byte alignment within a single calibrator sweep instead of relying on a software-driven second sweep. This is the direct fix for the lottery problem identified through ILA analysis.
Each iteration has been paired with a paired-die cocotb test suite that gates the FPGA farm build, so the integration regressions get caught before consuming forty-five minutes of build time. That gate alone has saved many hours over the project.
Where the project is going
The immediate priorities are:
- Closing v1 ASIC tape-out. The chiplet pair is targeted at TSMC65 at roughly 100 MHz link-layer clock for the GPIO based PHY. The FPGA rig is retained as a parallel bring-up and regression platform at 25 MHz. The L11 fix outlined above needs to demonstrate stable continuous data crossing on hardware before we accept the bitstream as a v1 candidate.
- PTP-over-chiplet validation. The precision hardware clock (PHC) subsystem uses a dedicated functional-channel node to exchange sync and follow-up packets across the link. Once the link is reliably up, the PTP soak tests will quantify offset stability and drift between the two chiplets. This is what makes a chiplet pair behave as a single time domain.
- TideChart deployment. The dynamic chiplet-ID protocol currently lives in a sister repository. Once it lands, the per-die strap-based ID assignment can become a fall-back rather than the primary path, and chiplet bring-up will scale beyond two dies.
Beyond v1, the architectural questions we are most interested in are:
- Replacing the digital GPIO PHY with an analog SerDes for v2 to remove the per-sweep lottery and per-lane skew sensitivity entirely. The link layer is PHY-agnostic; the calibrator and IDELAY infrastructure go away in that world, replaced by clock and data recovery (CDR).
- More chiplets per pair. The current FCSM and address map are explicitly point-to-point; a switched topology requires a different credit-accounting model.
- Closer integration with the Corstone CPU subsystem, so that the chiplet boundary becomes invisible to system software running on the Cortex-M.
In summary, the status of the project at the end of May 2026: the link works, the workflow to bring it up is solid, and almost every remaining piece of work is about making the PHY's behaviour deterministic rather than negotiating with its lottery.
Comments
Milestones
It would be nice to have milestones with small reports to give a sense of time and progress through flows in this project.
Add new comment
To post a comment on this article, please log in to your account. New users can create an account.