Skip to content

Commit 638f289

Browse files
Apollon77claude
andauthored
fix(protocol): re-solicit A/AAAA for stale cache entries on commissionable discovery (#3695)
* chore(chip-testing): add discovery diagnostic logs and run matterjs on PRs Intermittent CI failure on the matterjs controller variant of the CHIP Discovery test: step 7 returns 12 commissionables, step 8 (~3s later) returns empty. Insufficient instrumentation to determine whether the mDNS scanner cache was emptied between calls, addresses got pruned, or the Discovery callback chain dropped them. Add INFO logs at the relevant chokepoints: - CommissionableMdnsScanner: log cache size on entry, per-entry match decision (matched / rejected: no addresses / rejected: filter), waiter notifications for new arrivals, cache add and cache delete events, and final result size. - Discovery: log whether the scanner callback reused an existing ClientNode or created a new one (and whether find() hit an already-commissioned node). - LegacyControllerCommandHandler.handleDiscovery: log incoming findBy, raw result count, picked entry, and empty-return. Also temporarily extend the matterjs controller matrix gate in build-test.chip.yml to run on pull_request so this branch can capture fresh logs from CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(protocol): avoid BigInt-unsafe JSON.stringify in discovery diagnostic logs Previous logging commit serialized the CommissionableDeviceIdentifiers object via JSON.stringify, which throws "Do not know how to serialize a BigInt" when the identifier carries a BigInt-typed field, breaking every commissionable discovery on the matterjs CHIP test variant. Replace with the field-key list (sufficient to know what was queried) and replace the legacy handler's raw result dump with a slim {id, D, CM} projection so neither path can hit a BigInt at log time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(protocol): re-solicit A/AAAA for stale cache entries on discovery CHIP Discovery test on the matterjs controller reproducibly returned an empty result on the second discover-commissionables call within a step (step 8 "Check Hostname" failed with hostName: NoneType). Diagnostic logs added in the previous commit showed the scanner cache held all 13 entries but every entry was rejected by the addresses-length filter on the second call: 1st call: matched: 13, rejectedNoAddr: 0, resultSize: 13 2nd call: matched: 0, rejectedNoAddr: 13, resultSize: 0 Cache itself was intact (DnssdName.isDiscovered still true) but each entry's IpService.addresses had been pruned because the A/AAAA TTL on matter commissionable broadcasts is short (~4s) and elapses between unsolicited rebroadcasts. #cacheDevice solicited and armed an onAddresses observer when a device was first cached without addresses, but the observer deregistered itself after the first delivery, so a later expiry could not re-trigger it, and #startDiscovery solicits PTR records only. Extract the "solicit SRV-target A/AAAA + arm onAddresses observer" logic from #cacheDevice into a new idempotent helper #solicitAndArmAddresses, and call it from the cache iteration loop whenever a cached entry matches the identifier but has zero addresses. The current discovery's waiter is already registered, so when A/AAAA records resolve before the discovery timeout fires, #deliverDeviceIfResolved notifies the waiter and the device flows through the callback chain. Idempotent: skips re-arming if the onAddresses observer is already attached, so concurrent overlapping discoveries do not double-register. Diagnostic logs from the previous commit are kept so the next CI run verifies the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: remove diagnostic logs and PR-trigger gate after fix verified Two consecutive matterjs CHIP runs on this branch passed cleanly with the A/AAAA re-solicit fix in place. Drop the diagnostic logs added across CommissionableMdnsScanner, Discovery, and LegacyControllerCommandHandler and revert the temporary pull_request gate in build-test.chip.yml so the branch contains only the surgical fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(protocol): add regression for stale-cache A/AAAA re-solicit Cover the path fixed in the parent commit: a cached commissionable device whose A record TTL has elapsed while SRV/TXT remain valid is re-delivered on a subsequent findCommissionableDevicesContinuously call once the responder replies to the re-solicited hostname query. Verified: the test fails on the pre-fix code (expected 1 to equal 0 after the second discovery returns no devices) and passes after the #solicitAndArmAddresses helper is reused from the cache iteration loop. Addresses copilot-pull-request-reviewer review comment on PR #3695. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ccd892f commit 638f289

2 files changed

Lines changed: 150 additions & 21 deletions

File tree

packages/protocol/src/mdns/CommissionableMdnsScanner.ts

Lines changed: 44 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -190,11 +190,19 @@ export class CommissionableMdnsScanner implements Scanner {
190190
let callbackInvoked = false;
191191
for (const cached of this.#cache.values()) {
192192
const device = refreshAddresses(cached);
193-
if (matchesIdentifier(device, identifier) && device.addresses.length > 0) {
193+
if (!matchesIdentifier(device, identifier)) {
194+
continue;
195+
}
196+
if (device.addresses.length > 0) {
194197
seen.add(device.deviceIdentifier);
195198
result.push(device);
196199
callbackInvoked = true;
197200
callback(device);
201+
} else {
202+
// Cached entry's A/AAAA records expired since the responder's last unsolicited broadcast
203+
// (matter commissionable A/AAAA TTL is short). Solicit fresh records and arm onAddresses
204+
// so this discovery's waiter is notified if addresses resolve before the timeout fires.
205+
this.#solicitAndArmAddresses(cached);
198206
}
199207
}
200208

@@ -383,29 +391,44 @@ export class CommissionableMdnsScanner implements Scanner {
383391
// A/AAAA records may arrive after the initial SRV/TXT discovery;
384392
// defer notification until addresses become available.
385393
if (!this.#deliverDeviceIfResolved(cached)) {
386-
// SRV target hostname may have lost its A/AAAA records (TTL expired) while the instance
387-
// SRV/TXT was still valid. Solicit address records for all SRV target hostnames so we
388-
// don't wait for the next unsolicited broadcast to deliver the device.
389-
for (const record of name.records) {
390-
if (record.recordType !== DnsRecordType.SRV) {
391-
continue;
392-
}
393-
const hostname = this.#names.get(record.value.target);
394-
this.#names.solicitor.solicit({
395-
name: hostname,
396-
recordTypes: [DnsRecordType.A, DnsRecordType.AAAA],
397-
});
394+
this.#solicitAndArmAddresses(cached);
395+
}
396+
}
397+
398+
/**
399+
* Solicit A/AAAA records for the cached device's SRV target hostnames and arm an onAddresses observer
400+
* that delivers via {@link #deliverDeviceIfResolved} once addresses resolve. Used both when a device is
401+
* first cached without addresses AND when an active discovery encounters a cached entry whose A/AAAA
402+
* records have since expired (matter commissionable A/AAAA TTL is short and may lapse between the device's
403+
* unsolicited broadcasts). Idempotent: skips re-arming if onAddresses observer already attached.
404+
*/
405+
#solicitAndArmAddresses(cached: CachedDevice) {
406+
// SRV target hostname may have lost its A/AAAA records (TTL expired) while the instance
407+
// SRV/TXT was still valid. Solicit address records for all SRV target hostnames so we
408+
// don't wait for the next unsolicited broadcast to deliver the device.
409+
for (const record of cached.name.records) {
410+
if (record.recordType !== DnsRecordType.SRV) {
411+
continue;
398412
}
413+
const hostname = this.#names.get(record.value.target);
414+
this.#names.solicitor.solicit({
415+
name: hostname,
416+
recordTypes: [DnsRecordType.A, DnsRecordType.AAAA],
417+
});
418+
}
399419

400-
const onAddresses = () => {
401-
if (this.#deliverDeviceIfResolved(cached)) {
402-
this.#observers.off(ipService.changed, onAddresses);
403-
cached.onAddresses = undefined;
404-
}
405-
};
406-
cached.onAddresses = onAddresses;
407-
this.#observers.on(ipService.changed, onAddresses);
420+
if (cached.onAddresses !== undefined) {
421+
return;
408422
}
423+
424+
const onAddresses = () => {
425+
if (this.#deliverDeviceIfResolved(cached)) {
426+
this.#observers.off(cached.ipService.changed, onAddresses);
427+
cached.onAddresses = undefined;
428+
}
429+
};
430+
cached.onAddresses = onAddresses;
431+
this.#observers.on(cached.ipService.changed, onAddresses);
409432
}
410433

411434
#deliverDeviceIfResolved(cached: CachedDevice): boolean {

packages/protocol/test/mdns/CommissionableMdnsScannerTest.ts

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -823,4 +823,110 @@ describe("CommissionableMdnsScanner", () => {
823823
await clientSocket.close();
824824
}
825825
});
826+
827+
it("re-solicits A/AAAA when cached entry's addresses expired before next discovery", async () => {
828+
// Regression: the matter commissionable A/AAAA TTL is short and may lapse between
829+
// a responder's unsolicited rebroadcasts. When SRV/TXT remain valid the cache entry
830+
// persists but its addresses go empty, and a follow-up findCommissionableDevicesContinuously
831+
// call must re-solicit A/AAAA so the entry is delivered once addresses come back.
832+
const simulator = new NetworkSimulator();
833+
const serverNetwork = new MockNetwork(simulator, SERVER_MAC, [SERVER_IPv4, SERVER_IPv6]);
834+
const clientNetwork = new MockNetwork(simulator, CLIENT_MAC, [CLIENT_IPv4, CLIENT_IPv6]);
835+
836+
const serverSocket = await MdnsSocket.create(serverNetwork);
837+
const clientSocket = await MdnsSocket.create(clientNetwork);
838+
const clientNames = new DnssdNames({
839+
socket: clientSocket,
840+
entropy: MockCrypto(0x0c),
841+
minTtl: Millis(0),
842+
});
843+
const scanner = new CommissionableMdnsScanner(clientNames);
844+
845+
try {
846+
const instanceQname = `${INSTANCE_ID}._matterc._udp.local`;
847+
const longTtl = Seconds(120);
848+
const shortAddrTtl = Seconds(2);
849+
850+
// Initial advertisement: SRV/TXT long-lived, A short-lived to model the real-world
851+
// mismatch where instance records survive past hostname A/AAAA TTL.
852+
await serverSocket.send({
853+
messageType: DnsMessageType.Response,
854+
answers: [
855+
{
856+
name: instanceQname,
857+
recordType: DnsRecordType.TXT,
858+
recordClass: DnsRecordClass.IN,
859+
ttl: longTtl,
860+
value: [`D=3840`, `CM=1`, `VP=4996+22`],
861+
},
862+
{
863+
name: instanceQname,
864+
recordType: DnsRecordType.SRV,
865+
recordClass: DnsRecordClass.IN,
866+
ttl: longTtl,
867+
value: { priority: 0, weight: 0, port: PORT, target: HOSTNAME },
868+
},
869+
{
870+
name: HOSTNAME,
871+
recordType: DnsRecordType.A,
872+
recordClass: DnsRecordClass.IN,
873+
ttl: shortAddrTtl,
874+
value: SERVER_IPv4,
875+
},
876+
],
877+
additionalRecords: [],
878+
});
879+
await MockTime.advance(10);
880+
881+
// Device cached with resolved addresses
882+
expect(scanner.getDiscoveredCommissionableDevices({ longDiscriminator: 3840 }).length).equals(1);
883+
884+
// Advance past A TTL — A pruned, SRV/TXT survive, cache entry stays but addresses empty
885+
await MockTime.advance(Seconds(3));
886+
await MockTime.advance(10);
887+
expect(scanner.getDiscoveredCommissionableDevices({ longDiscriminator: 3840 }).length).equals(0);
888+
889+
// Responder replies with fresh A when client re-solicits the hostname
890+
serverSocket.receipt.on(async message => {
891+
if (message.queries.find(q => q.name === HOSTNAME && q.recordType === DnsRecordType.A)) {
892+
await serverSocket.send({
893+
messageType: DnsMessageType.Response,
894+
answers: [
895+
{
896+
name: HOSTNAME,
897+
recordType: DnsRecordType.A,
898+
recordClass: DnsRecordClass.IN,
899+
ttl: longTtl,
900+
value: SERVER_IPv4,
901+
},
902+
],
903+
additionalRecords: [],
904+
});
905+
}
906+
});
907+
908+
// Second discovery: cache iteration encounters the zero-address entry, fires
909+
// #solicitAndArmAddresses, the server replies, onAddresses observer delivers the device.
910+
const found: CommissionableDevice[] = [];
911+
const identifier = { longDiscriminator: 3840 };
912+
const discoveryPromise = scanner.findCommissionableDevicesContinuously(
913+
identifier,
914+
device => found.push(device),
915+
Seconds(10),
916+
);
917+
await MockTime.resolve(Time.sleep("wait for re-solicit", Seconds(5)));
918+
919+
expect(found.length).equals(1);
920+
expect(found[0].deviceIdentifier).equals(INSTANCE_ID);
921+
expect(found[0].addresses.length).greaterThan(0);
922+
923+
scanner.cancelCommissionableDeviceDiscovery(identifier);
924+
await MockTime.resolve(discoveryPromise);
925+
} finally {
926+
await scanner.close();
927+
await clientNames.close();
928+
await serverSocket.close();
929+
await clientSocket.close();
930+
}
931+
});
826932
});

0 commit comments

Comments
 (0)