Skip to content

fix(access-control-service): include port in computing unit pod URI and use Envoy Gateway for distributed CUs#5629

Open
aicam wants to merge 4 commits into
apache:mainfrom
aicam:fix/cu-pod-uri-missing-port
Open

fix(access-control-service): include port in computing unit pod URI and use Envoy Gateway for distributed CUs#5629
aicam wants to merge 4 commits into
apache:mainfrom
aicam:fix/cu-pod-uri-missing-port

Conversation

@aicam

@aicam aicam commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Make the in-cluster address of a computing unit come from a single source of truth — the URI recorded when its pod is created — and ensure that URI is complete (includes the port). This lets the gateway route a user to a computing unit located anywhere it can reach (in the local cluster, another cluster, or an external host), instead of being limited to a reconstructed in-cluster address. See #5630.

Two related changes:

1. Include the port in the generated pod URI (computing-unit-managing-service)

KubernetesClient.generatePodURI builds the address stored as the computing unit's uri (via setUri in ComputingUnitManagingResource) and returned to clients as nodeAddresses. The pod's container listens on KubernetesConfig.computeUnitPortNumber (declared with withContainerPort(...) in the same file), but the generated URI omitted the port, so the persisted address was not directly connectable. The port is now appended:

s"...svc.cluster.local:${KubernetesConfig.computeUnitPortNumber}"

2. Route using the recorded URI (access-control-service)

AccessControlResource rebuilt the computing unit's address from KubernetesConfig on every authorization request, duplicating the construction logic in generatePodURI and pinning every CU to the local cluster. It now reads the URI recorded for the unit and returns it as the Host for the gateway to route to. If no URI has been recorded, the unit is not routable and the request is refused with 403 (no in-cluster fallback, per review).

Routing flow

The access-control service is the gateway's external authorizer; the Host it returns is the upstream Envoy forwards the (upgraded) connection to. Because that host comes from the unit's recorded URI, the same gateway can reach computing units in different locations:

flowchart LR
    FE["Frontend<br/>(/wsapi?cuid=N)"] --> GW["Envoy Gateway"]
    GW -. "ext-auth: authorize + get Host" .-> ACS["access-control-service"]
    ACS -- "read recorded uri for CU N" --> DB[("workflow_computing_unit")]
    ACS -- "Host = recorded uri<br/>(or 403 if none)" --> GW
    GW == "dynamic forward proxy<br/>to returned Host" ==> R{Where the CU lives}
    R --> CU1["In-cluster CU pod<br/>computing-unit-N...svc.cluster.local:port"]
    R --> CU2["CU in another cluster"]
    R --> CU3["External / remote CU host:port"]
Loading

Any related issues, documentation, discussions?

How was this PR tested?

On live deployment.
Screenshot from 2026-06-13 13-31-00

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

`KubernetesClient.generatePodURI` builds the in-cluster address that is
stored as the computing unit's `uri` (via `setUri` in
`ComputingUnitManagingResource` and returned to clients as
`nodeAddresses`). The pod's container listens on
`KubernetesConfig.computeUnitPortNumber` (set via `withContainerPort`),
but the generated URI omitted the port, so the stored address pointed at
the default port instead of the one the computing unit actually serves
on. Append the configured port so the persisted URI is directly
connectable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added fix platform Non-amber Scala service paths labels Jun 11, 2026
@codecov-commenter

codecov-commenter commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.59%. Comparing base (7e9cabf) to head (97ea514).
⚠️ Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
...exera/service/resource/AccessControlResource.scala 55.55% 2 Missing and 2 partials ⚠️
.../apache/texera/service/util/KubernetesClient.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5629      +/-   ##
============================================
- Coverage     52.87%   52.59%   -0.29%     
+ Complexity     2509     2489      -20     
============================================
  Files          1073     1070       -3     
  Lines         41382    41322      -60     
  Branches       4444     4444              
============================================
- Hits          21882    21734     -148     
- Misses        18214    18307      +93     
+ Partials       1286     1281       -5     
Flag Coverage Δ *Carryforward flag
access-control-service 63.50% <55.55%> (-1.12%) ⬇️ Carriedforward from 83d2d70
agent-service 33.34% <ø> (-1.02%) ⬇️ Carriedforward from 83d2d70
amber 53.29% <ø> (-0.31%) ⬇️ Carriedforward from 83d2d70
computing-unit-managing-service 1.65% <0.00%> (ø)
config-service 56.71% <ø> (ø)
file-service 57.06% <ø> (ø)
frontend 47.05% <ø> (-0.25%) ⬇️ Carriedforward from 83d2d70
pyamber 90.72% <ø> (ø) Carriedforward from 83d2d70
python 90.75% <ø> (ø) Carriedforward from 83d2d70
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The access-control service rebuilt the computing unit's in-cluster
address from `KubernetesConfig` on every authorization request, which
duplicates the address-construction logic already in
`KubernetesClient.generatePodURI` and can drift from it (e.g. service
name vs. pool-name conventions). Read the URI persisted for the unit
(written by the managing service when the pod is created) and route to
it directly, so the routing target comes from a single source of truth.
Fall back to the previously constructed in-cluster address when no URI
has been recorded for the unit yet.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aicam aicam changed the title fix: include port in computing unit pod URI fix: route computing units to their recorded pod URI (incl. port) Jun 11, 2026
@aicam aicam requested a review from bobbai00 June 11, 2026 21:21
@aicam aicam changed the title fix: route computing units to their recorded pod URI (incl. port) fix(access-control-service): include port in computing unit pod URI and use Envoy Gateway for distributed CUs Jun 11, 2026

@bobbai00 bobbai00 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Please include a diagram of frontend able to connect to CUs at different places through the envoy gateway

val workflowComputingUnitPoolName = KubernetesConfig.computeUnitPoolName
val workflowComputingUnitPoolNamespace = KubernetesConfig.computeUnitPoolNamespace
val workflowComputingUnitPoolPort = KubernetesConfig.computeUnitPortNumber
s"computing-unit-$cuidInt.$workflowComputingUnitPoolName-svc.$workflowComputingUnitPoolNamespace.svc.cluster.local:$workflowComputingUnitPoolPort"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this fallback logic. If no URI is presented, refuse the connection to CU.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 97ea514 — removed the fallback. When no URI is recorded for the computing unit the request is now refused with 403 FORBIDDEN (logged as Refusing CU <cuid>: no URI recorded) instead of reconstructing an in-cluster address. Also dropped the now-unused KubernetesConfig import.

Per review, drop the in-cluster address fallback in the access-control
service. A computing unit is routed only to the URI recorded for it; if
no URI has been recorded the unit is not routable, so the authorization
request is refused (403) instead of falling back to a reconstructed
in-cluster address. Also drops the now-unused KubernetesConfig import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aicam aicam requested a review from bobbai00 June 13, 2026 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix platform Non-amber Scala service paths

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(access-control-service): route computing units by their recorded URI to enable out-of-cluster CU distribution

3 participants