Skip to content

fix(protocol): re-solicit A/AAAA for stale cache entries on commissionable discovery#3695

Merged
Apollon77 merged 6 commits intomainfrom
fix/commissionable-mdns-stale-addresses
May 6, 2026
Merged

fix(protocol): re-solicit A/AAAA for stale cache entries on commissionable discovery#3695
Apollon77 merged 6 commits intomainfrom
fix/commissionable-mdns-stale-addresses

Conversation

@Apollon77
Copy link
Copy Markdown
Collaborator

Replaces #3693 (closed automatically when the branch was renamed).

Summary

CommissionableMdnsScanner.findCommissionableDevicesContinuously returns an empty result on the second consecutive discover-commissionables call within a CHIP test step on the matterjs controller, breaking the YAML constraint check (e.g. Discovery step 8 Check Hostname fails with hostName: NoneType).

Cause: the scanner's #cache retains the entry (its DnssdName.isDiscovered stays true), but IpService.addresses for the entry has been pruned because matter commissionable A/AAAA TTL is short (~4s) and lapses between the responder's unsolicited rebroadcasts. The cache iteration in findCommissionableDevicesContinuously rejects entries with addresses.length === 0. #cacheDevice did solicit and arm an onAddresses observer when a device was first cached without addresses, but that observer deregistered itself after the first delivery, so a subsequent A/AAAA expiry could not re-trigger it. #startDiscovery only solicits PTR records, not A/AAAA, so a fresh discovery window does nothing to refresh hostname records on already-cached devices.

Fix

  • Extract the existing "solicit SRV-target A/AAAA + arm onAddresses observer" logic out of #cacheDevice into a new idempotent helper #solicitAndArmAddresses(cached).
  • Call it from the cache iteration loop whenever a cached entry matches the identifier but has zero addresses.
  • The current discovery's waiter is already registered, so when A/AAAA records resolve before the discovery timeout, #deliverDeviceIfResolved notifies the waiter and the device flows through the callback chain into the result.
  • Idempotent: skips re-arming if cached.onAddresses is already attached, so concurrent overlapping discoveries do not double-register on ipService.changed.

The helper is also reused by #cacheDevice so the original initial-resolution path is preserved verbatim.

Verification

  • Bug reproduced on matterjs CHIP Discovery test variant
  • Two back-to-back CI runs of this branch with the fix passed cleanly
  • npm test -- -p packages/protocol — 736/736 pass

Test plan

  • Build clean (npm run build -w @matter/protocol)
  • Format clean (npm run format-verify)
  • Protocol unit tests pass (npm test -- -p packages/protocol)
  • CHIP Discovery test passes on matterjs controller variant in CI

🤖 Generated with Claude Code

Apollon77 and others added 4 commits May 6, 2026 11:27
…n PRs

Intermittent CI failure on the matterjs controller variant of the CHIP
Discovery test: step 7 returns 12 commissionables, step 8 (~3s later)
returns empty. Insufficient instrumentation to determine whether the
mDNS scanner cache was emptied between calls, addresses got pruned, or
the Discovery callback chain dropped them.

Add INFO logs at the relevant chokepoints:

- CommissionableMdnsScanner: log cache size on entry, per-entry match
  decision (matched / rejected: no addresses / rejected: filter), waiter
  notifications for new arrivals, cache add and cache delete events,
  and final result size.
- Discovery: log whether the scanner callback reused an existing
  ClientNode or created a new one (and whether find() hit an
  already-commissioned node).
- LegacyControllerCommandHandler.handleDiscovery: log incoming findBy,
  raw result count, picked entry, and empty-return.

Also temporarily extend the matterjs controller matrix gate in
build-test.chip.yml to run on pull_request so this branch can capture
fresh logs from CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…stic logs

Previous logging commit serialized the CommissionableDeviceIdentifiers
object via JSON.stringify, which throws "Do not know how to serialize
a BigInt" when the identifier carries a BigInt-typed field, breaking
every commissionable discovery on the matterjs CHIP test variant.

Replace with the field-key list (sufficient to know what was queried)
and replace the legacy handler's raw result dump with a slim
{id, D, CM} projection so neither path can hit a BigInt at log time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CHIP Discovery test on the matterjs controller reproducibly returned an
empty result on the second discover-commissionables call within a step
(step 8 "Check Hostname" failed with hostName: NoneType). Diagnostic
logs added in the previous commit showed the scanner cache held all 13
entries but every entry was rejected by the addresses-length filter on
the second call:

  1st call: matched: 13, rejectedNoAddr: 0, resultSize: 13
  2nd call: matched:  0, rejectedNoAddr: 13, resultSize:  0

Cache itself was intact (DnssdName.isDiscovered still true) but each
entry's IpService.addresses had been pruned because the A/AAAA TTL on
matter commissionable broadcasts is short (~4s) and elapses between
unsolicited rebroadcasts. #cacheDevice solicited and armed an
onAddresses observer when a device was first cached without addresses,
but the observer deregistered itself after the first delivery, so a
later expiry could not re-trigger it, and #startDiscovery solicits PTR
records only.

Extract the "solicit SRV-target A/AAAA + arm onAddresses observer"
logic from #cacheDevice into a new idempotent helper
#solicitAndArmAddresses, and call it from the cache iteration loop
whenever a cached entry matches the identifier but has zero addresses.
The current discovery's waiter is already registered, so when A/AAAA
records resolve before the discovery timeout fires, #deliverDeviceIfResolved
notifies the waiter and the device flows through the callback chain.

Idempotent: skips re-arming if the onAddresses observer is already
attached, so concurrent overlapping discoveries do not double-register.

Diagnostic logs from the previous commit are kept so the next CI run
verifies the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two consecutive matterjs CHIP runs on this branch passed cleanly with
the A/AAAA re-solicit fix in place.  Drop the diagnostic logs added
across CommissionableMdnsScanner, Discovery, and LegacyControllerCommandHandler
and revert the temporary pull_request gate in build-test.chip.yml so
the branch contains only the surgical fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 11:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a commissionable mDNS discovery regression where consecutive discovery calls can return empty results when cached entries remain discovered but their A/AAAA records have expired (short TTL), by re-soliciting host A/AAAA and re-arming address resolution for stale cached entries.

Changes:

  • Adjust cached-result iteration to re-solicit A/AAAA and re-arm address observation when a matching cached device has zero addresses.
  • Extract “solicit SRV-target A/AAAA + arm onAddresses observer” into a shared helper (#solicitAndArmAddresses) and reuse it from both the initial caching path and the stale-cache refresh path.

Comment thread packages/protocol/src/mdns/CommissionableMdnsScanner.ts
Cover the path fixed in the parent commit: a cached commissionable
device whose A record TTL has elapsed while SRV/TXT remain valid is
re-delivered on a subsequent findCommissionableDevicesContinuously
call once the responder replies to the re-solicited hostname query.

Verified: the test fails on the pre-fix code (expected 1 to equal 0
after the second discovery returns no devices) and passes after the
#solicitAndArmAddresses helper is reused from the cache iteration loop.

Addresses copilot-pull-request-reviewer review comment on PR #3695.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@Apollon77 Apollon77 merged commit 638f289 into main May 6, 2026
42 checks passed
@Apollon77 Apollon77 deleted the fix/commissionable-mdns-stale-addresses branch May 6, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants