Skip to content

fix(protocol): re-solicit A/AAAA for stale cache entries on commissionable discovery#3693

Closed
Apollon77 wants to merge 4 commits intomainfrom
debug/legacy-controller-discovery-logging
Closed

fix(protocol): re-solicit A/AAAA for stale cache entries on commissionable discovery#3693
Apollon77 wants to merge 4 commits intomainfrom
debug/legacy-controller-discovery-logging

Conversation

@Apollon77
Copy link
Copy Markdown
Collaborator

@Apollon77 Apollon77 commented May 6, 2026

Summary

CommissionableMdnsScanner.findCommissionableDevicesContinuously returns an empty result on the second consecutive discover-commissionables call within a CHIP test step on the matterjs controller, breaking the YAML constraint check (e.g. Discovery step 8 Check Hostname fails with hostName: NoneType).

Cause: the scanner's #cache retains the entry (its DnssdName.isDiscovered stays true), but IpService.addresses for the entry has been pruned because matter commissionable A/AAAA TTL is short (~4s) and lapses between the responder's unsolicited rebroadcasts. The cache iteration in findCommissionableDevicesContinuously rejects entries with addresses.length === 0. #cacheDevice did solicit and arm an onAddresses observer when a device was first cached without addresses, but that observer deregistered itself after the first delivery, so a subsequent A/AAAA expiry could not re-trigger it. #startDiscovery only solicits PTR records, not A/AAAA, so a fresh discovery window does nothing to refresh hostname records on already-cached devices.

Fix

  • Extract the existing "solicit SRV-target A/AAAA + arm onAddresses observer" logic out of #cacheDevice into a new idempotent helper #solicitAndArmAddresses(cached).
  • Call it from the cache iteration loop whenever a cached entry matches the identifier but has zero addresses.
  • The current discovery's waiter is already registered, so when A/AAAA records resolve before the discovery timeout, #deliverDeviceIfResolved notifies the waiter and the device flows through the callback chain into the result.
  • Idempotent: skips re-arming if cached.onAddresses is already attached, so concurrent overlapping discoveries do not double-register on ipService.changed.

The helper is also reused by #cacheDevice so the original initial-resolution path is preserved verbatim.

Verification

  • Bug reproduced on matterjs CHIP Discovery test variant (CI runs reproducibly hit step-8 empty-result failure on main)
  • Two back-to-back CI runs of this branch with the fix passed cleanly
  • npm test -- -p packages/protocol — 736/736 pass

🤖 Generated with Claude Code

…n PRs

Intermittent CI failure on the matterjs controller variant of the CHIP
Discovery test: step 7 returns 12 commissionables, step 8 (~3s later)
returns empty. Insufficient instrumentation to determine whether the
mDNS scanner cache was emptied between calls, addresses got pruned, or
the Discovery callback chain dropped them.

Add INFO logs at the relevant chokepoints:

- CommissionableMdnsScanner: log cache size on entry, per-entry match
  decision (matched / rejected: no addresses / rejected: filter), waiter
  notifications for new arrivals, cache add and cache delete events,
  and final result size.
- Discovery: log whether the scanner callback reused an existing
  ClientNode or created a new one (and whether find() hit an
  already-commissioned node).
- LegacyControllerCommandHandler.handleDiscovery: log incoming findBy,
  raw result count, picked entry, and empty-return.

Also temporarily extend the matterjs controller matrix gate in
build-test.chip.yml to run on pull_request so this branch can capture
fresh logs from CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 09:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds additional diagnostic logging around commissionable device discovery to help pinpoint an intermittent CHIP discovery failure in the matterjs controller CI job, and adjusts the CHIP workflow to run the matterjs controller variant on PRs to capture logs more frequently.

Changes:

  • Add INFO-level logs in the legacy chip-testing controller discovery handler to capture request/response details and empty-result cases.
  • Add INFO-level logs in the protocol mDNS commissionable scanner to trace cache iteration, cache add/delete, and waiter notifications.
  • Add INFO-level logs in node controller discovery callback paths to trace whether a node is reused vs created.
  • Enable the matterjs controller test matrix for pull_request events in the CHIP build/test workflow.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
support/chip-testing/src/handler/LegacyControllerCommandHandler.ts Adds discovery request/result logging and more structured context when selecting the “latest” discovery entry.
packages/protocol/src/mdns/CommissionableMdnsScanner.ts Adds detailed discovery/cache diagnostic logs to understand cache eviction, address resolution, and callback delivery.
packages/node/src/behavior/system/controller/discovery/Discovery.ts Adds diagnostic logs to differentiate node reuse vs new node creation during discovery callbacks.
.github/workflows/build-test.chip.yml Expands controller matrix selection to include matterjs on PR events for CI debugging.

Comment thread packages/protocol/src/mdns/CommissionableMdnsScanner.ts Outdated
Comment thread packages/node/src/behavior/system/controller/discovery/Discovery.ts Outdated
Comment thread .github/workflows/build-test.chip.yml Outdated
Apollon77 and others added 3 commits May 6, 2026 11:35
…stic logs

Previous logging commit serialized the CommissionableDeviceIdentifiers
object via JSON.stringify, which throws "Do not know how to serialize
a BigInt" when the identifier carries a BigInt-typed field, breaking
every commissionable discovery on the matterjs CHIP test variant.

Replace with the field-key list (sufficient to know what was queried)
and replace the legacy handler's raw result dump with a slim
{id, D, CM} projection so neither path can hit a BigInt at log time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CHIP Discovery test on the matterjs controller reproducibly returned an
empty result on the second discover-commissionables call within a step
(step 8 "Check Hostname" failed with hostName: NoneType). Diagnostic
logs added in the previous commit showed the scanner cache held all 13
entries but every entry was rejected by the addresses-length filter on
the second call:

  1st call: matched: 13, rejectedNoAddr: 0, resultSize: 13
  2nd call: matched:  0, rejectedNoAddr: 13, resultSize:  0

Cache itself was intact (DnssdName.isDiscovered still true) but each
entry's IpService.addresses had been pruned because the A/AAAA TTL on
matter commissionable broadcasts is short (~4s) and elapses between
unsolicited rebroadcasts. #cacheDevice solicited and armed an
onAddresses observer when a device was first cached without addresses,
but the observer deregistered itself after the first delivery, so a
later expiry could not re-trigger it, and #startDiscovery solicits PTR
records only.

Extract the "solicit SRV-target A/AAAA + arm onAddresses observer"
logic from #cacheDevice into a new idempotent helper
#solicitAndArmAddresses, and call it from the cache iteration loop
whenever a cached entry matches the identifier but has zero addresses.
The current discovery's waiter is already registered, so when A/AAAA
records resolve before the discovery timeout fires, #deliverDeviceIfResolved
notifies the waiter and the device flows through the callback chain.

Idempotent: skips re-arming if the onAddresses observer is already
attached, so concurrent overlapping discoveries do not double-register.

Diagnostic logs from the previous commit are kept so the next CI run
verifies the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two consecutive matterjs CHIP runs on this branch passed cleanly with
the A/AAAA re-solicit fix in place.  Drop the diagnostic logs added
across CommissionableMdnsScanner, Discovery, and LegacyControllerCommandHandler
and revert the temporary pull_request gate in build-test.chip.yml so
the branch contains only the surgical fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Apollon77 Apollon77 changed the title chore(chip-testing): add discovery diagnostic logs (matterjs CI debug) fix(protocol): re-solicit A/AAAA for stale cache entries on commissionable discovery May 6, 2026
@Apollon77 Apollon77 closed this May 6, 2026
@Apollon77 Apollon77 deleted the debug/legacy-controller-discovery-logging branch May 6, 2026 11:00
@Apollon77 Apollon77 requested a review from Copilot May 6, 2026 11:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment on lines +201 to 206
} else {
// Cached entry's A/AAAA records expired since the responder's last unsolicited broadcast
// (matter commissionable A/AAAA TTL is short). Solicit fresh records and arm onAddresses
// so this discovery's waiter is notified if addresses resolve before the timeout fires.
this.#solicitAndArmAddresses(cached);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants