Problem
For a custom domain (an SNI that isn't a <app-id>.<base-domain> subdomain), the gateway looks up the target app via DNS on every connection, before the TLS handshake completes — so the delay shows up as handshake latency. The lookup is slow because it:
- Builds a new DNS resolver every call (
AsyncResolver::tokio_from_system_conf()), so nothing is cached between connections and the record TTL is never used.
- Runs the primary and legacy TXT lookups with
tokio::join! and waits for both, putting a slow/negative legacy lookup on the critical path.
This is worst when the gateway's resolver is slow — e.g. a CVM on QEMU user-mode (SLIRP) networking, where DNS is forwarded and uncached. Subdomain-routed apps skip the lookup and are unaffected.
Code
gateway/src/proxy/tls_passthough.rs, resolve_app_address() (called per connection from proxy_with_sni(), before tls_accept()):
let resolver = hickory_resolver::AsyncResolver::tokio_from_system_conf()?; // (1) new resolver every call
// ...
let (lookup, lookup_legacy) = tokio::join!( // (2) waits for BOTH; legacy is usually NXDOMAIN
resolver.txt_lookup(txt_domain),
resolver.txt_lookup(txt_domain_legacy),
);
Evidence
TLS-handshake time against one gateway, over loopback (no internet RTT), 18 samples each:
| SNI |
pre-handshake work |
median |
max |
| custom domain |
DNS lookup + handshake |
820 ms |
3373 ms |
<app-id>.<base> subdomain |
no DNS |
10 ms |
16 ms |
The ~810 ms gap is entirely the DNS step. Adding the missing legacy TXT record (so the second lookup isn't NXDOMAIN) dropped the median to ~499 ms — confirming the join!-on-both cost, but most of the delay is the per-connection uncached resolver.
Suggested fix
- Build the resolver once and reuse it (hickory caches by TTL).
- Cache resolved app-addresses by record TTL so steady-state connections skip DNS.
- Make the legacy lookup a fallback (only on primary miss), not a
join! that always waits.
(1)+(2) should bring custom-domain connections down to the subdomain baseline.
Problem
For a custom domain (an SNI that isn't a
<app-id>.<base-domain>subdomain), the gateway looks up the target app via DNS on every connection, before the TLS handshake completes — so the delay shows up as handshake latency. The lookup is slow because it:AsyncResolver::tokio_from_system_conf()), so nothing is cached between connections and the record TTL is never used.tokio::join!and waits for both, putting a slow/negative legacy lookup on the critical path.This is worst when the gateway's resolver is slow — e.g. a CVM on QEMU user-mode (SLIRP) networking, where DNS is forwarded and uncached. Subdomain-routed apps skip the lookup and are unaffected.
Code
gateway/src/proxy/tls_passthough.rs,resolve_app_address()(called per connection fromproxy_with_sni(), beforetls_accept()):Evidence
TLS-handshake time against one gateway, over loopback (no internet RTT), 18 samples each:
<app-id>.<base>subdomainThe ~810 ms gap is entirely the DNS step. Adding the missing legacy TXT record (so the second lookup isn't NXDOMAIN) dropped the median to ~499 ms — confirming the
join!-on-both cost, but most of the delay is the per-connection uncached resolver.Suggested fix
join!that always waits.(1)+(2) should bring custom-domain connections down to the subdomain baseline.