Replies: 1 comment
-
|
That's an awesome overview. Let me comment from my perspective. Regarding the "gaps and asymmetries":
My opinions for the Future
So with the above in my eyes we could have two intermediate states for a peer:
Flow wise it would be like:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This is an attempt to establish a framework for improving matter.js's response to Matter communication disruptions.
We focus here on errors that occur between two nodes that we previously considered to have a "healthy" network connection. Thus our use of the term "disruption" excludes problems with initial contact, which is a different problem space.
I created the taxonomy below and then asked Claude to assess our current implementation.
Dimensions
We divide into a 3-dimensional problem space along the following axes.
Disruption event (DE)
This is an event that we observe that (may) indicate disruption.
Response context (RC)
This is the state in which we may (or must) respond to a disruption.
Disruption context (DC)
This is the context in which a disruption occurs.
RC vs. DC
We differentiate the RC from the DC because we may want to respond to a disruption even if it occurs in a separate context from the RC.
For example, if we have an active subscription (RC3), we may want to take action if we have an MRP timeout (DE1) on a different session on the same IP (DC2).
Note: DC4 (same subnet) is aspirational — there is currently no subnet-level disruption detection, but the dimension is included for completeness as a possible future consideration.
Current behavior
This section summarizes how the code on the
peer-connectionbranch currently responds to disruptions. See Appendix 1 for detailed analysis of detection mechanisms and propagation paths.The most notable characteristic of the current implementation is its lack of nuance: most disruption events either propagate too broadly (DE1 kills all peer sessions) or too narrowly (DE2 only affects one subscription). There is no intermediate "degraded" state and no coordination between the different detection and recovery mechanisms.
DC4 (same subnet) is not represented below as no current code operates at subnet scope.
MessageExchangeClientSubscriptionstimerPeerConnectionaddress eventsPeerConnectionaddress eventsPeersNodeSessionNotable gaps and asymmetries
DE1 is a sledgehammer: An MRP timeout on any single exchange force-closes all sessions for the peer. There is no attempt to test whether other sessions or IPs are still reachable.
DE2 does not inform other contexts: A subscription timeout only closes that specific subscription. It does not mark the session or peer as degraded, even though a subscription timeout likely indicates the same reachability problem that DE1 detects.
DE3/DE4 and established sessions are disconnected: IP expiration during connection establishment does not affect already-established sessions. An established session may continue operating on an IP that DNS-SD considers expired.
No cross-DE correlation: The system does not correlate events across disruption types. For example, an MRP timeout (DE1) followed by subscription timeout (DE2) on the same peer is not treated differently than either event alone.
No retry for reads: Read interactions are idempotent and could safely retry on transient failures, but currently do not. (Write and invoke are not idempotent, so retry is correctly left to higher levels.)
No gradual degradation: The system jumps from "fully connected" to "force-close everything" with no intermediate states (e.g., marking a session as suspect, preferring alternate IPs, or probing before destroying).
Future direction
Appendix 1: Details of current implementation
DE1: MRP timeout on established session
Detection:
MessageExchange.#retransmitMessageincrements a retransmission counter each time the MRP timer fires without an ack. When the counter reachesMRP.MAX_TRANSMISSIONS(5), the#sentMessageAckFailurecallback fires, which throwsPeerUnresponsiveError.Propagation path:
MessageExchange.sendandMessageExchange.nextMessagecatch errors caused byTransientPeerCommunicationError(parent ofPeerUnresponsiveError) and callcontext.peerLost(exchange, cause).ExchangeManager.#messageExchangeContextForimplementspeerLost. For commissioned peers it callsSessionManager.handlePeerLoss(peerAddress, cause). For uncommissioned peers it callssession.handlePeerLossdirectly on the individual session.SessionManager.handlePeerLossiterates all sessions for the peer address and callssession.handlePeerLosson each, provided the session was created before the loss event.NodeSession.handlePeerLosssetsisPeerLost = trueand callsinitiateForceClose, which closes the session and all its exchanges.Current DC scope: DC3 (same peer). A single MRP timeout on one session causes force-close of all sessions for that peer address, regardless of IP or session.
Effect on RC1 (active exchange): The exchange that triggered the timeout throws
PeerUnresponsiveErrorto its caller. All other active exchanges on all sessions for the same peer are force-closed, causingClosedErrorin their callers.Effect on RC3 (active subscription): Server-side subscriptions on all sessions for the peer are closed without flush.
Effect on RC4 (new interaction): No special handling — the error propagates to the caller.
DE2: Subscription stops reporting
Detection:
ClientSubscriptionsruns a periodic timer. For eachPeerSubscription, it sets atimeoutAttimestamp on first check and then compares against it on subsequent checks. The timeout ismaxInterval + 2 * maxPeerResponseTime.When
timeoutAtis exceeded,PeerSubscription.timedOutis called, which logs and callsclose()on the subscription.For sustained subscriptions:
SustainedSubscription.#runawaits aclosedpromise from the underlyingPeerSubscription. When it resolves (due to timeout), and the sustained subscription is not aborted, it logs "Replacing subscription due to timeout" and loops back to re-subscribe. The retry schedule starts at 15s and backs off exponentially to a max of 1 hour, with 0.25 jitter factor.Current DC scope: DC1 (same session). Only the specific
PeerSubscriptionthat timed out is closed. Other sessions and subscriptions for the same peer are not affected.Effect on RC1 (active exchange): None — the subscription timeout does not close the session or its exchanges.
Effect on RC2 (idle session): None — the session remains open.
DE3: All IPs expire
Detection: This is handled during connection establishment in
PeerConnection. Theservice.addressChangesasync iterator emits"delete"events when addresses expire. When all discovered addresses expire:deleteAddressis called for each, which aborts the corresponding connection attemptmaybeAttemptFallbackis called, which checks if the peer has a last-known operational address (peer.descriptor.operationalAddress) and, if so, enqueues it as a fallback connection attemptscheduleAttemptsblocks waiting for new addresses from DNS-SD discoveryDuring connection (in
PeerConnection.createExchange), theonSendcallback setspeer.service.status.isReachable = falseon the first MRP retransmission, which triggers active DNS-SD solicitation.Current DC scope: DC3 (same peer). This affects the peer's connection state. However, existing established sessions are not directly affected by IP expiration — they continue to use their existing channel until they independently fail.
Effect on RC4 (new interaction): Exchange creation blocks until a new connection is established or the operation is aborted.
DE4: Active IP expires with alternatives available
Detection: Same mechanism as DE3 —
service.addressChangesemits a"delete"for the specific address.PeerConnection.deleteAddressaborts the connection attempt for that address only. Other active connection attempts to remaining addresses continue.pendingAddressesstill has entries somaybeAttemptFallbackdoes not activate.Current DC scope: DC2 (same IP). Only the connection attempt to the expired address is affected.
DE5: Peer reports shutdown
Detection:
Peers.#onShutdownis called (triggered by DNS-SD seeing the peer go offline). It callsSessionManager.handlePeerShutdown(peerAddress).handlePeerShutdowncalls the private#handlePeerLosswithkeepSubscriptions: trueand aPeerShutdownErrorcause. This iterates all sessions for the peer and force-closes them, but (unlike DE1) it preserves subscription state since the peer may support persistent subscriptions on restart.Current DC scope: DC3 (same peer). All sessions for the peer are closed.
Effect on RC1 (active exchange): All exchanges are force-closed, causing errors in callers.
Effect on RC3 (active subscription): Server-side subscriptions are preserved (
keepSubscriptions: true), allowing potential reuse if the peer comes back with persistent subscription support.DE6: Peer closes session
Detection:
NodeSession.handlePeerCloseis called when the peer sends a close message on the secure channel. This setsisPeerLost = true, emits on theclosedByPeerobservable, and then callshandlePeerLosswith aPeerInitiatedCloseErrorcause.Unlike DE5 (shutdown via DNS-SD), this is a per-session event — the peer is closing a specific session, not announcing that the entire device is going offline.
Current DC scope: DC1 (same session). Only the session that received the close message is affected.
handlePeerLossis called directly on that session (not viaSessionManager.handlePeerLoss), so other sessions for the same peer remain open.Effect on RC1 (active exchange): All exchanges on the closed session are force-closed.
Effect on RC3 (active subscription): Subscriptions on the closed session are closed without flush.
Beta Was this translation helpful? Give feedback.
All reactions