fix(protocol): re-solicit A/AAAA for stale cache entries on commissionable discovery#3693
Closed
fix(protocol): re-solicit A/AAAA for stale cache entries on commissionable discovery#3693
Conversation
…n PRs Intermittent CI failure on the matterjs controller variant of the CHIP Discovery test: step 7 returns 12 commissionables, step 8 (~3s later) returns empty. Insufficient instrumentation to determine whether the mDNS scanner cache was emptied between calls, addresses got pruned, or the Discovery callback chain dropped them. Add INFO logs at the relevant chokepoints: - CommissionableMdnsScanner: log cache size on entry, per-entry match decision (matched / rejected: no addresses / rejected: filter), waiter notifications for new arrivals, cache add and cache delete events, and final result size. - Discovery: log whether the scanner callback reused an existing ClientNode or created a new one (and whether find() hit an already-commissioned node). - LegacyControllerCommandHandler.handleDiscovery: log incoming findBy, raw result count, picked entry, and empty-return. Also temporarily extend the matterjs controller matrix gate in build-test.chip.yml to run on pull_request so this branch can capture fresh logs from CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds additional diagnostic logging around commissionable device discovery to help pinpoint an intermittent CHIP discovery failure in the matterjs controller CI job, and adjusts the CHIP workflow to run the matterjs controller variant on PRs to capture logs more frequently.
Changes:
- Add INFO-level logs in the legacy chip-testing controller discovery handler to capture request/response details and empty-result cases.
- Add INFO-level logs in the protocol mDNS commissionable scanner to trace cache iteration, cache add/delete, and waiter notifications.
- Add INFO-level logs in node controller discovery callback paths to trace whether a node is reused vs created.
- Enable the matterjs controller test matrix for
pull_requestevents in the CHIP build/test workflow.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| support/chip-testing/src/handler/LegacyControllerCommandHandler.ts | Adds discovery request/result logging and more structured context when selecting the “latest” discovery entry. |
| packages/protocol/src/mdns/CommissionableMdnsScanner.ts | Adds detailed discovery/cache diagnostic logs to understand cache eviction, address resolution, and callback delivery. |
| packages/node/src/behavior/system/controller/discovery/Discovery.ts | Adds diagnostic logs to differentiate node reuse vs new node creation during discovery callbacks. |
| .github/workflows/build-test.chip.yml | Expands controller matrix selection to include matterjs on PR events for CI debugging. |
…stic logs
Previous logging commit serialized the CommissionableDeviceIdentifiers
object via JSON.stringify, which throws "Do not know how to serialize
a BigInt" when the identifier carries a BigInt-typed field, breaking
every commissionable discovery on the matterjs CHIP test variant.
Replace with the field-key list (sufficient to know what was queried)
and replace the legacy handler's raw result dump with a slim
{id, D, CM} projection so neither path can hit a BigInt at log time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CHIP Discovery test on the matterjs controller reproducibly returned an empty result on the second discover-commissionables call within a step (step 8 "Check Hostname" failed with hostName: NoneType). Diagnostic logs added in the previous commit showed the scanner cache held all 13 entries but every entry was rejected by the addresses-length filter on the second call: 1st call: matched: 13, rejectedNoAddr: 0, resultSize: 13 2nd call: matched: 0, rejectedNoAddr: 13, resultSize: 0 Cache itself was intact (DnssdName.isDiscovered still true) but each entry's IpService.addresses had been pruned because the A/AAAA TTL on matter commissionable broadcasts is short (~4s) and elapses between unsolicited rebroadcasts. #cacheDevice solicited and armed an onAddresses observer when a device was first cached without addresses, but the observer deregistered itself after the first delivery, so a later expiry could not re-trigger it, and #startDiscovery solicits PTR records only. Extract the "solicit SRV-target A/AAAA + arm onAddresses observer" logic from #cacheDevice into a new idempotent helper #solicitAndArmAddresses, and call it from the cache iteration loop whenever a cached entry matches the identifier but has zero addresses. The current discovery's waiter is already registered, so when A/AAAA records resolve before the discovery timeout fires, #deliverDeviceIfResolved notifies the waiter and the device flows through the callback chain. Idempotent: skips re-arming if the onAddresses observer is already attached, so concurrent overlapping discoveries do not double-register. Diagnostic logs from the previous commit are kept so the next CI run verifies the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two consecutive matterjs CHIP runs on this branch passed cleanly with the A/AAAA re-solicit fix in place. Drop the diagnostic logs added across CommissionableMdnsScanner, Discovery, and LegacyControllerCommandHandler and revert the temporary pull_request gate in build-test.chip.yml so the branch contains only the surgical fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
Comment on lines
+201
to
206
| } else { | ||
| // Cached entry's A/AAAA records expired since the responder's last unsolicited broadcast | ||
| // (matter commissionable A/AAAA TTL is short). Solicit fresh records and arm onAddresses | ||
| // so this discovery's waiter is notified if addresses resolve before the timeout fires. | ||
| this.#solicitAndArmAddresses(cached); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CommissionableMdnsScanner.findCommissionableDevicesContinuouslyreturns an empty result on the second consecutive discover-commissionables call within a CHIP test step on the matterjs controller, breaking the YAML constraint check (e.g.Discoverystep 8Check Hostnamefails withhostName: NoneType).Cause: the scanner's
#cacheretains the entry (itsDnssdName.isDiscoveredstays true), butIpService.addressesfor the entry has been pruned because matter commissionable A/AAAA TTL is short (~4s) and lapses between the responder's unsolicited rebroadcasts. The cache iteration infindCommissionableDevicesContinuouslyrejects entries withaddresses.length === 0.#cacheDevicedid solicit and arm anonAddressesobserver when a device was first cached without addresses, but that observer deregistered itself after the first delivery, so a subsequent A/AAAA expiry could not re-trigger it.#startDiscoveryonly solicits PTR records, not A/AAAA, so a fresh discovery window does nothing to refresh hostname records on already-cached devices.Fix
onAddressesobserver" logic out of#cacheDeviceinto a new idempotent helper#solicitAndArmAddresses(cached).#deliverDeviceIfResolvednotifies the waiter and the device flows through the callback chain into the result.cached.onAddressesis already attached, so concurrent overlapping discoveries do not double-register onipService.changed.The helper is also reused by
#cacheDeviceso the original initial-resolution path is preserved verbatim.Verification
matterjsCHIPDiscoverytest variant (CI runs reproducibly hit step-8 empty-result failure onmain)npm test -- -p packages/protocol— 736/736 pass🤖 Generated with Claude Code