Skip to content

gateway: custom-domain app lookup does an uncached DNS query on every connection, adding ~0.8–3s to the TLS handshake #736

@h4x3rotab

Description

@h4x3rotab

Problem

For a custom domain (an SNI that isn't a <app-id>.<base-domain> subdomain), the gateway looks up the target app via DNS on every connection, before the TLS handshake completes — so the delay shows up as handshake latency. The lookup is slow because it:

  1. Builds a new DNS resolver every call (AsyncResolver::tokio_from_system_conf()), so nothing is cached between connections and the record TTL is never used.
  2. Runs the primary and legacy TXT lookups with tokio::join! and waits for both, putting a slow/negative legacy lookup on the critical path.

This is worst when the gateway's resolver is slow — e.g. a CVM on QEMU user-mode (SLIRP) networking, where DNS is forwarded and uncached. Subdomain-routed apps skip the lookup and are unaffected.

Code

gateway/src/proxy/tls_passthough.rs, resolve_app_address() (called per connection from proxy_with_sni(), before tls_accept()):

let resolver = hickory_resolver::AsyncResolver::tokio_from_system_conf()?;  // (1) new resolver every call
// ...
let (lookup, lookup_legacy) = tokio::join!(   // (2) waits for BOTH; legacy is usually NXDOMAIN
    resolver.txt_lookup(txt_domain),
    resolver.txt_lookup(txt_domain_legacy),
);

Evidence

TLS-handshake time against one gateway, over loopback (no internet RTT), 18 samples each:

SNI pre-handshake work median max
custom domain DNS lookup + handshake 820 ms 3373 ms
<app-id>.<base> subdomain no DNS 10 ms 16 ms

The ~810 ms gap is entirely the DNS step. Adding the missing legacy TXT record (so the second lookup isn't NXDOMAIN) dropped the median to ~499 ms — confirming the join!-on-both cost, but most of the delay is the per-connection uncached resolver.

Suggested fix

  1. Build the resolver once and reuse it (hickory caches by TTL).
  2. Cache resolved app-addresses by record TTL so steady-state connections skip DNS.
  3. Make the legacy lookup a fallback (only on primary miss), not a join! that always waits.

(1)+(2) should bring custom-domain connections down to the subdomain baseline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions