Communication disruption roadmap #3283

lauckhart · 2026-02-24T21:43:57Z

lauckhart
Feb 24, 2026

This is an attempt to establish a framework for improving matter.js's response to Matter communication disruptions.

We focus here on errors that occur between two nodes that we previously considered to have a "healthy" network connection. Thus our use of the term "disruption" excludes problems with initial contact, which is a different problem space.

I created the taxonomy below and then asked Claude to assess our current implementation.

Dimensions

We divide into a 3-dimensional problem space along the following axes.

Disruption event (DE)

This is an event that we observe that (may) indicate disruption.

MRP timeout on established session
Subscription stops reporting
All IPs expire
Active IP expires with alternatives available
Peer reports shutdown
Peer closes session

Response context (RC)

This is the state in which we may (or must) respond to a disruption.

Active exchange
Idle session (no active exchanges or subscriptions)
Active subscription
New interaction (includes exchange creation)

Disruption context (DC)

This is the context in which a disruption occurs.

Same session
Same IP
Same peer
Same subnet

RC vs. DC

We differentiate the RC from the DC because we may want to respond to a disruption even if it occurs in a separate context from the RC.

For example, if we have an active subscription (RC3), we may want to take action if we have an MRP timeout (DE1) on a different session on the same IP (DC2).

Note: DC4 (same subnet) is aspirational — there is currently no subnet-level disruption detection, but the dimension is included for completeness as a possible future consideration.

Current behavior

This section summarizes how the code on the peer-connection branch currently responds to disruptions. See Appendix 1 for detailed analysis of detection mechanisms and propagation paths.

The most notable characteristic of the current implementation is its lack of nuance: most disruption events either propagate too broadly (DE1 kills all peer sessions) or too narrowly (DE2 only affects one subscription). There is no intermediate "degraded" state and no coordination between the different detection and recovery mechanisms.

DC4 (same subnet) is not represented below as no current code operates at subnet scope.

DE	Detected by	Propagates to (DC)	RC1 (active exchange)	RC3 (subscription)	RC4 (new interaction)
1. MRP timeout	`MessageExchange`	DC3 (all peer sessions)	Error thrown + all peer exchanges closed	All closed (no flush)	Error propagates to caller
2. Sub timeout	`ClientSubscriptions` timer	DC1 (that subscription)	No effect	That sub closed; sustained re-subscribes	No effect
3. All IPs expire	`PeerConnection` address events	DC3 (peer connection)	N/A (connection phase)	No direct effect	Blocks until reconnected
4. IP expires	`PeerConnection` address events	DC2 (that IP attempt)	No direct effect	No direct effect	Other attempts continue
5. Peer shutdown	DNS-SD / `Peers`	DC3 (all peer sessions)	All exchanges closed	Preserved (keepSubscriptions)	Must reconnect
6. Peer closes session	`NodeSession`	DC1 (that session)	That session's exchanges closed	That session's subs closed	No effect on other sessions

Notable gaps and asymmetries

DE1 is a sledgehammer: An MRP timeout on any single exchange force-closes all sessions for the peer. There is no attempt to test whether other sessions or IPs are still reachable.
DE2 does not inform other contexts: A subscription timeout only closes that specific subscription. It does not mark the session or peer as degraded, even though a subscription timeout likely indicates the same reachability problem that DE1 detects.
DE3/DE4 and established sessions are disconnected: IP expiration during connection establishment does not affect already-established sessions. An established session may continue operating on an IP that DNS-SD considers expired.
No cross-DE correlation: The system does not correlate events across disruption types. For example, an MRP timeout (DE1) followed by subscription timeout (DE2) on the same peer is not treated differently than either event alone.
No retry for reads: Read interactions are idempotent and could safely retry on transient failures, but currently do not. (Write and invoke are not idempotent, so retry is correctly left to higher levels.)
No gradual degradation: The system jumps from "fully connected" to "force-close everything" with no intermediate states (e.g., marking a session as suspect, preferring alternate IPs, or probing before destroying).

Future direction

DE	Proposed changes	Planned changes
1. MRP timeout
2. Sub timeout
3. All IPs expire
4. Active IP expires with alternatives available
5. Peer shutdown
6. Peer closes session

Appendix 1: Details of current implementation

DE1: MRP timeout on established session

Detection: MessageExchange.#retransmitMessage increments a retransmission counter each time the MRP timer fires without an ack. When the counter reaches MRP.MAX_TRANSMISSIONS (5), the #sentMessageAckFailure callback fires, which throws PeerUnresponsiveError.

Propagation path:

MessageExchange.send and MessageExchange.nextMessage catch errors caused by TransientPeerCommunicationError (parent of PeerUnresponsiveError) and call context.peerLost(exchange, cause).
ExchangeManager.#messageExchangeContextFor implements peerLost. For commissioned peers it calls SessionManager.handlePeerLoss(peerAddress, cause). For uncommissioned peers it calls session.handlePeerLoss directly on the individual session.
SessionManager.handlePeerLoss iterates all sessions for the peer address and calls session.handlePeerLoss on each, provided the session was created before the loss event.
NodeSession.handlePeerLoss sets isPeerLost = true and calls initiateForceClose, which closes the session and all its exchanges.

Current DC scope: DC3 (same peer). A single MRP timeout on one session causes force-close of all sessions for that peer address, regardless of IP or session.

Effect on RC1 (active exchange): The exchange that triggered the timeout throws PeerUnresponsiveError to its caller. All other active exchanges on all sessions for the same peer are force-closed, causing ClosedError in their callers.

Effect on RC3 (active subscription): Server-side subscriptions on all sessions for the peer are closed without flush.

Effect on RC4 (new interaction): No special handling — the error propagates to the caller.

DE2: Subscription stops reporting

Detection: ClientSubscriptions runs a periodic timer. For each PeerSubscription, it sets a timeoutAt timestamp on first check and then compares against it on subsequent checks. The timeout is maxInterval + 2 * maxPeerResponseTime.

When timeoutAt is exceeded, PeerSubscription.timedOut is called, which logs and calls close() on the subscription.

For sustained subscriptions: SustainedSubscription.#run awaits a closed promise from the underlying PeerSubscription. When it resolves (due to timeout), and the sustained subscription is not aborted, it logs "Replacing subscription due to timeout" and loops back to re-subscribe. The retry schedule starts at 15s and backs off exponentially to a max of 1 hour, with 0.25 jitter factor.

Current DC scope: DC1 (same session). Only the specific PeerSubscription that timed out is closed. Other sessions and subscriptions for the same peer are not affected.

Effect on RC1 (active exchange): None — the subscription timeout does not close the session or its exchanges.

Effect on RC2 (idle session): None — the session remains open.

DE3: All IPs expire

Detection: This is handled during connection establishment in PeerConnection. The service.addressChanges async iterator emits "delete" events when addresses expire. When all discovered addresses expire:

deleteAddress is called for each, which aborts the corresponding connection attempt
maybeAttemptFallback is called, which checks if the peer has a last-known operational address (peer.descriptor.operationalAddress) and, if so, enqueues it as a fallback connection attempt
If there is no operational address, scheduleAttempts blocks waiting for new addresses from DNS-SD discovery

During connection (in PeerConnection.createExchange), the onSend callback sets peer.service.status.isReachable = false on the first MRP retransmission, which triggers active DNS-SD solicitation.

Current DC scope: DC3 (same peer). This affects the peer's connection state. However, existing established sessions are not directly affected by IP expiration — they continue to use their existing channel until they independently fail.

Effect on RC4 (new interaction): Exchange creation blocks until a new connection is established or the operation is aborted.

DE4: Active IP expires with alternatives available

Detection: Same mechanism as DE3 — service.addressChanges emits a "delete" for the specific address.

PeerConnection.deleteAddress aborts the connection attempt for that address only. Other active connection attempts to remaining addresses continue. pendingAddresses still has entries so maybeAttemptFallback does not activate.

Current DC scope: DC2 (same IP). Only the connection attempt to the expired address is affected.

DE5: Peer reports shutdown

Detection: Peers.#onShutdown is called (triggered by DNS-SD seeing the peer go offline). It calls SessionManager.handlePeerShutdown(peerAddress).

handlePeerShutdown calls the private #handlePeerLoss with keepSubscriptions: true and a PeerShutdownError cause. This iterates all sessions for the peer and force-closes them, but (unlike DE1) it preserves subscription state since the peer may support persistent subscriptions on restart.

Current DC scope: DC3 (same peer). All sessions for the peer are closed.

Effect on RC1 (active exchange): All exchanges are force-closed, causing errors in callers.

Effect on RC3 (active subscription): Server-side subscriptions are preserved (keepSubscriptions: true), allowing potential reuse if the peer comes back with persistent subscription support.

DE6: Peer closes session

Detection: NodeSession.handlePeerClose is called when the peer sends a close message on the secure channel. This sets isPeerLost = true, emits on the closedByPeer observable, and then calls handlePeerLoss with a PeerInitiatedCloseError cause.

Unlike DE5 (shutdown via DNS-SD), this is a per-session event — the peer is closing a specific session, not announcing that the entire device is going offline.

Current DC scope: DC1 (same session). Only the session that received the close message is affected. handlePeerLoss is called directly on that session (not via SessionManager.handlePeerLoss), so other sessions for the same peer remain open.

Effect on RC1 (active exchange): All exchanges on the closed session are force-closed.

Effect on RC3 (active subscription): Subscriptions on the closed session are closed without flush.

Apollon77 · 2026-02-25T15:37:32Z

Apollon77
Feb 25, 2026
Maintainer

That's an awesome overview. Let me comment from my perspective.

Regarding the "gaps and asymmetries":

Re 2. "DE2 does not inform other contexts": This is only true when look at the exact code. Process/Flow wise a subscription timeout will directly trigger a new subscription try which then either works and we have a new subscription, or it ends in an MRP timeout ... so we end in Nr 1. This is fine for me.
Re 3. "DE3/DE4 and established sessions as disconnected": The peerconnection process is fine. When we have no established session then the expiry of one/all IPs will stop trying to reach these IPs (which is ok), but we still try to connect to the last known address. So that's the best for now that we can do. The fact that established sessions are unaffected is also correct because as ling as we have a working session ( or assumption for it) we are fine and MDNS is also "just UDP/Multicasts" that could have own issues.
Re 4/5: Yes and thats ok for me or "not fully true - see above"

My opinions for the Future

DE1 "MRP timeout" is a balance to tackle and decide.
- Reasons for MRP timeout are:
  - Device rebooted and session is dead: In this case we need a new session. A new session either be pushed by the device or we need to create it (Sigma 1...)
  - Device went offline:
    - We can not detect this, so basically need to fallback to case "assume device just rebooted" and start a CASE loop
    - we could start plan network "ping" to devices (does not say anything about "matter stack available" on the devices but could give at least some information maybe
    - do active MDNS queries for the device
  - Network is congested: Can we detect this? Not really ... we can only assume things currently
  - We could have internal counters on "how many interactions are in progress - especially looking at invokes" and so assume if we filled the network ourself. We could also dynamically adjust MRP timings based on "running parallel actions" per network
  - we could count "retried-peers" within "last time" and do assumptions
    - and monitor success-after-error rates
  - In fact we will never be able to really determine network congestion vs "a room went out of power" easiely
  - Main topic is to determine "network issue vs device offline issue"
  - Sigma 1 with all data is currently the only reliable and tested option to find out if a device responds, but also a heavy action. Technically an empty payload Sigma 1 is the easiest way and SDK+matter.js answer with InvalidParam but officially a bit aside spec.
- So with the above the main challenge is to determine if we need a new session or not ... that's basically the main point here
DE2 "Subscription timeout" is completely fine. As long as we have no reason assuming that the device is offline we assume it is online. When the subscription times out this is basically like DE1 in my eyes. So goal is to find out if we need a new session or if the session. is still valid
DE3: When all IPs expire this could be a reason to assume that the device went offline and so we could probe the device even before a subscription timeout.
DE4: First do like DE3, If we know alternative IPs then we should probe the current IP and if that fails probe the new IPs on the same session basically.If one work we have an IP change and session is still valid, so we just update the IP
DE5: This is easy, we know that the peer rebooted (at least), so we can declare the session as dead and start over with a short delay or such
DE6: I do not expect it to happen. This usually will be a resource issue on the peer side, so the question is what we do. before we see this case in reality I would handle it the same as DE5 (without the delay or with, irrelevant).

So with the above in my eyes we could have two intermediate states for a peer:

Unsure if the peer has a valid session with us
Unsure if the peer if online (at all)

Flow wise it would be like:

We have known IPs (in worst case just the last operative IP) but no session
               |
               |
               |
               V                       
Try to find out if the device is there (Sigma 1)
               |                        A
               |                        |     (ok we decided session is definitively gone)
               |                        |
               |     We are unsure if the session is still valid (e.g. after MRP timeout or sub timeout, expired IPs)
               |                        A        Probe session, potentially "combined" probing with above step"
               |                        |         (Jump over that if we got Shutdown/Session-close)
               |                        |                                        A
               V                        |                                        |
 We have a session with the device                                               |
               |                        |                                        |
               |                        |                                        |
               |                        |                                        | (just triggers this check,
               |                        |                                        |  for the Interaction it is too late)
               V                        V                                        |
        We have a session and subscription with the device          ||   Interaction fails

To probe, my latest commit to peerconnection uses an empty read (7 byte payload) which is currently only executed on subscription timeout.
Ping is not that easy from a Node.js/JS perspective, so we could add as a "platform enhancement" but can not easiely use real "ping"s
The smallest currently official Spec backed "ping" is a full Sigma1 (or at least big enough that the effort is not making sense to do something else than a full one)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Communication disruption roadmap #3283

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Communication disruption roadmap #3283

Uh oh!

lauckhart Feb 24, 2026

Dimensions

Disruption event (DE)

Response context (RC)

Disruption context (DC)

RC vs. DC

Current behavior

Notable gaps and asymmetries

Future direction

Appendix 1: Details of current implementation

DE1: MRP timeout on established session

DE2: Subscription stops reporting

DE3: All IPs expire

DE4: Active IP expires with alternatives available

DE5: Peer reports shutdown

DE6: Peer closes session

Replies: 1 comment

Uh oh!

Uh oh!

Apollon77 Feb 25, 2026 Maintainer

lauckhart
Feb 24, 2026

Apollon77
Feb 25, 2026
Maintainer