Skip to content

fix(sandbox): restore GPU procfs baseline#1522

Merged
elezar merged 1 commit into
mainfrom
fix/1486-gpu-sandbox-filesystem-policy/elezar
Jun 3, 2026
Merged

fix(sandbox): restore GPU procfs baseline#1522
elezar merged 1 commit into
mainfrom
fix/1486-gpu-sandbox-filesystem-policy/elezar

Conversation

@elezar

@elezar elezar commented May 22, 2026

Copy link
Copy Markdown
Member

Summary

Restore CUDA GPU startup compatibility by promoting /proc from
filesystem_policy.read_only to filesystem_policy.read_write when /proc
is part of the active GPU runtime baseline.

This keeps the change intentionally narrow. The existing baseline enrichment
already places /proc in the GPU read-write baseline because CUDA writes
/proc/<pid>/task/<tid>/comm during initialization. The missing behavior was
that an existing read-only /proc entry caused enrichment to skip the
read-write baseline path. This PR restores that promotion and emits an
informational log message when it happens.

Broader handling for user-supplied policy conflicts and explicit baseline
conflict controls is left to follow-up work such as #1629.

Related Issue

Fixes #1486

Related follow-up: #1629

Changes

  • Promote /proc from read_only to read_write when the GPU read-write
    baseline requires it.
  • Preserve existing behavior for other read-only/read-write baseline conflicts.
  • Emit an informational log when /proc is promoted for GPU runtime
    compatibility.
  • Add a regression test covering GPU baseline enrichment without network
    policy.

Testing

  • mise exec -- cargo fmt --all
  • mise exec -- cargo test -p openshell-sandbox --lib baseline_tests -- --nocapture
  • mise run pre-commit completed Helm lint, Rust format, Rust check, Rust clippy, markdown lint, and license checks; python:proto failed in the parallel run because grpc_tools was missing after .venv recreation.
  • mise run python:proto

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture/docs updated (not applicable for this minimal runtime fix)

@elezar elezar requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 22, 2026 13:47
@github-actions

Copy link
Copy Markdown

@elezar elezar changed the base branch from main to fix/1486-gpu-enrichment-no-network/elezar May 22, 2026 14:06
Base automatically changed from fix/1486-gpu-enrichment-no-network/elezar to main May 27, 2026 08:20
@elezar elezar force-pushed the fix/1486-gpu-sandbox-filesystem-policy/elezar branch from 96a1caa to 59e399a Compare May 27, 2026 09:02
@elezar elezar force-pushed the fix/1486-gpu-sandbox-filesystem-policy/elezar branch from 12bde4d to d73e6de Compare May 28, 2026 19:22

@pimlock pimlock left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few nits and questions.

Comment thread architecture/security-policy.md Outdated
Comment thread crates/openshell-sandbox/src/sandbox/linux/landlock.rs Outdated
Comment thread crates/openshell-sandbox/src/lib.rs Outdated
Comment thread crates/openshell-sandbox/src/lib.rs Outdated
Comment thread crates/openshell-sandbox/src/lib.rs Outdated
Comment thread crates/openshell-sandbox/src/lib.rs
Comment thread crates/openshell-sandbox/src/sandbox/linux/landlock.rs Outdated
@elezar elezar force-pushed the fix/1486-gpu-sandbox-filesystem-policy/elezar branch 2 times, most recently from 2f3b5b2 to a0171ff Compare June 1, 2026 18:29
@elezar elezar changed the title fix(sandbox): restore GPU filesystem baseline fix(sandbox): restore GPU procfs baseline Jun 1, 2026
@elezar

elezar commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

Thanks for your initial review @pimlock. After the initial back and forth, I realised that there were a number of edge cases that I was not considering. I believe I was trying to detect user intent with insufficient signal and as such have updated this PR to ALWAYS promote /proc to read-write if GPUs are requested and instead capture explicit intent in #1629 as a follow-up. This PR would unblock the GPU-enabled tests, but I'm happy to continue iterating on it if required.

@elezar elezar requested a review from pimlock June 1, 2026 19:29
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the fix/1486-gpu-sandbox-filesystem-policy/elezar branch from a0171ff to c828f23 Compare June 2, 2026 08:24
@pimlock

pimlock commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Thanks for your initial review @pimlock. After the initial back and forth, I realised that there were a number of edge cases that I was not considering. I believe I was trying to detect user intent with insufficient signal and as such have updated this PR to ALWAYS promote /proc to read-write if GPUs are requested and instead capture explicit intent in #1629 as a follow-up. This PR would unblock the GPU-enabled tests, but I'm happy to continue iterating on it if required.

Thanks! I took a first pass at #1629 and I like the approach. I think it's great for the mechanism to be more explicit and exposing it through the policy makes sense, so the full picture of what's allowed is in the policy.

@elezar elezar merged commit 5102cb9 into main Jun 3, 2026
26 checks passed
@elezar elezar deleted the fix/1486-gpu-sandbox-filesystem-policy/elezar branch June 3, 2026 09:08
rodbutters added a commit to iamaible/OpenShell that referenced this pull request Jun 4, 2026
* fix(ci): eliminate image-tag race between concurrent workflows (#1413)

- Add publish-manifest input to docker-build.yml (default true); single-arch
  branch callers set it false so the merge job is skipped and the shared
  bare :SHA tag in GHCR is never written by branch workflows
- branch-kubernetes-e2e: retag :SHA-amd64 to :SHA before kind load so
  Helm's image.tag matches what is loaded in kind containerd
- branch-e2e: pass image-tag as :SHA-arm64 to e2e-test so the arch-specific
  GHCR tag is used directly without depending on the bare tag
- bare :SHA in GHCR is now written only by test-gpu.yml (multi-arch build),
  eliminating the last-writer-wins race across concurrent workflows

* test(server): cover service endpoint plaintext security (#1352)

* test(server): cover service endpoint plaintext security

* test(server): align tls test with from_files Option<&Path> signature

TlsAcceptor::from_files now accepts the client CA path as Option<&Path>
(per the require_client_auth refactor on main). Wrap the helper's CA
path in Some(...) so the new plaintext-service-http tests compile after
rebasing onto current main.

---------

Co-authored-by: Taylor Mutch <taylormutch@gmail.com>

* fix(cli): add auth and TLS support to completion client (#1489)

* fix(scripts): use portable lowercase in normalize_bool for Bash 3.2 (#1493)

* refactor(server): extract shared relay-await and sandbox-scan helpers (#1495)

* fix(sandbox): skip fork-exec socket ambiguity test on SELinux-enforcing hosts (#1449)

Exec'ing /bin/sleep (SELinux label bin_t) from a user_home_t test binary
causes /proc/<pid>/exe readlink to return ENOENT on SELinux-enforcing
hosts due to the cross-domain boundary. Skip the test at runtime when
getenforce reports Enforcing.

Also adds a ChildGuard drop guard for safe child cleanup on panic and
increases the exec-detection deadline from 2s to 5s.

Signed-off-by: Derek Carr <decarr@redhat.com>

* fix(sandbox): allow first-label L7 host wildcards (#1304)

* fix(sandbox): allow first-label L7 host wildcards

* docs(sandbox): document L7 host wildcard contract + add OPA runtime tests

- Add Host Wildcards section to architecture/security-policy.md
  describing accepted (first-label *, **, intra-label *-X) and
  rejected (bare, TLD, non-first-label, recursive-in-label) forms,
  and noting that wildcards never cross '.' boundaries.
- Expand the policy-schema.mdx 'host' field description to reflect
  the same contract instead of only mentioning '*.example.com'.
- Add OPA runtime tests asserting '*-aiplatform.googleapis.com'
  matches 'us-central1-aiplatform.googleapis.com' and does not match
  'us-central1.aiplatform.googleapis.com' (cross-dot boundary). Locks
  validator/runtime alignment for intra-label wildcards.

* chore: update mise lockfile

* test(server): tolerate serialized inference upserts

---------

Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* feat(cli): add JSON/YAML output format to gateway list (#1500)

Add -o/--output flag to `openshell gateway list` matching the existing
sandbox list pattern, enabling machine-readable output for scripting.

Signed-off-by: Florent Benoit <fbenoit@redhat.com>

* refactor: deduplicate repeated patterns across crates (#1499)

Remove ~280 lines of duplicated code across 30 files in 5 areas:

- centered_rect: consolidate 5 identical TUI layout helpers into a
  single pub fn in openshell-tui/src/ui/mod.rs
- server test helpers: replace ~100 inline Store::connect() calls
  with local test_store() helpers; deduplicate test_server_state()
  in grpc/service.rs to use the shared test_support version
- rogue PKI: extract 20-line rogue CA+client cert generation block
  (duplicated in two integration tests) into generate_rogue_pki()
  in tests/common/mod.rs
- provider tests: replace 8 identical 28-line test modules with a
  single macro_rules! test_discovers_env_credential! invocation
- label constants: centralize openshell.ai/ container label keys
  in openshell-core::driver_utils; update Docker and Kubernetes
  drivers to import from there instead of redefining them locally

* fix(ci): resolve mirror gate statuses for fork PRs (#1504)

Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>

* fix(server): respect OPENSHELL_PODMAN_SOCKET env var in embedded driver (#1483)

The env var was only wired up via clap in the standalone
openshell-driver-podman binary. When the Podman driver runs embedded
in the gateway, config came exclusively from TOML deserialization and
the env var was never consulted. Apply it as a post-deserialization
override, matching the existing OPENSHELL_K8S_WORKSPACE_DEFAULT_STORAGE_SIZE
pattern.

Closes #1446

* refactor(sandbox,driver-vm): Start moving to rustix (esp over libc unsafe) (#1505)

In the Rust ecosystem there's largely three ways to do system calls:

- raw libc
- nix
- rustix

Of the three, libc is almost all `unsafe` and really 95% of use
cases should be either nix or rustix. nix is the original one,
but after having looked at the code of both, I think rustix
is just better designed and organized. It's also reached 1.0,
whereas nix is still making semver-breaking changes (in fact
we're behind here in this project).

Now in practice, we have both *transitively* in the depchain
already, and that's true for quite a lot of projects.

But I think rustix is better, so let's add rustix as
a workspace dependency (process feature) and migrate
a few use cases to it - it's especially better than the raw
libc which is suprisingly widespread.

If we agree to do this, then many other calls can be ported.

Signed-off-by: Colin Walters <walters@verbum.org>

* fix(packaging): add upgrade migration docs and podman socket retry (#1507)

After #1415 ships, users upgrading from previous releases need guidance
on the gateway.env deprecation, port/bind/database path changes, and
the podman.socket restart requirement.

- docs(rpm): add 'Migrating from gateway.env' section to TROUBLESHOOTING
  covering backward compatibility, env-to-TOML key mapping, and three
  breaking changes (default port 8080->17670, bind address 0.0.0.0->127.0.0.1,
  database path move). Add podman.socket restart step to upgrade procedure.
- docs(rpm): add upgrade callout to CONFIGURATION.md pointing at migration
  section.
- fix(podman): retry PodmanComputeDriver ping up to 5 times with 2s delay
  to tolerate transient socket unavailability after package upgrades.
  The systemd unit uses Wants=podman.socket (not Requires) so the gateway
  can start while the socket is briefly re-activating after an RPM upgrade
  changes its unit file on disk.
- chore(rpm): update EnvironmentFile comment in RPM spec to explain
  backward-compatibility intent.

Signed-off-by: Adam Miller <admiller@redhat.com>

* ci: deduplicate e2e workflows (#1512)

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* feat(auth): per-sandbox authentication to gateway (#1404)

* docs(sandboxes): add policy advisor guide (#1480)

Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* fix(docker): use host-gateway callbacks on macOS (#1516)

* ci(e2e): load single-arch images into kind (#1518)

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* docs(rfc): add sandbox resource requirements proposal (#1360)

* docs(rfc): add sandbox resource requirements proposal

Signed-off-by: Evan Lezar <elezar@nvidia.com>

* docs(rfc): finalize sandbox resource requirements

---------

Signed-off-by: Evan Lezar <elezar@nvidia.com>

* ci(canary): keep helm jwt secret generation enabled (#1521)

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* fix(cli): add json output for policy get (#1410)

* fix(cli): add json output for policy get

* test(cli): cover policy get full json output

* fix(cli): address policy get json clippy

---------

Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* feat(providers): derive discovery from profiles (#1503)

* feat(providers): derive discovery from profiles

Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* fix(providers): keep v2 discovery profile-only

Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* docs(providers): update providers v2 behavior

Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* fix(providers): make github profile read-only

Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>

---------

Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* docs: update NemoClaw/OpenClaw references (#1529)

* ci: seed shared Rust caches from main (#1530)

* fix(release): build host Linux binaries with glibc floor (#1490)

* fix(homebrew): repair local driver bootstrap state (#1527)

* fix(homebrew): repair local driver bootstrap state

* fix(bootstrap): satisfy default SAN doc lint

* ci: install cargo-zigbuild from release binaries (#1533)

* fix(cli): propagate --gateway-insecure to OIDC auth flows (#1535)

Thread the gateway_insecure flag through gateway_add(), gateway_login(),
and all OIDC HTTP clients so that --gateway-insecure and
OPENSHELL_GATEWAY_INSECURE apply to OIDC discovery, token exchange, and
token refresh requests.

Previously, the flag only affected gRPC connections to the gateway. OIDC
HTTP clients (reqwest::get and http_client) always verified TLS
certificates, causing gateway registration and login to fail when the
OIDC issuer used a self-signed certificate (common on OpenShift with
edge-terminated routes).

Fixes #1534

Signed-off-by: Adel Zaalouk <azaalouk@redhat.com>

* ci(release): smoke test rpm artifacts on fedora (#1558)

Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>

* chore(deps): bump docker/login-action from 4.1.0 to 4.2.0 (#1554)

Bumps [docker/login-action](https://github.com/docker/login-action) from 4.1.0 to 4.2.0.
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](https://github.com/docker/login-action/compare/4907a6ddec9925e35a0a9e82d7399ccc52663121...650006c6eb7dba73a995cc03b0b2d7f5ca915bee)

---
updated-dependencies:
- dependency-name: docker/login-action
  dependency-version: 4.2.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(helm): add missing SPDX header to gateway-config template (#1545)

* chore(helm): add missing SPDX header to gateway-config template

* chore(scripts): remove helm templates from license header exclusions

The bypass had no known rationale. Removing it ensures the header
script covers deploy/helm/openshell/templates uniformly going forward.

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

---------

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

* ci(release): skip python rpm in gateway smoke test (#1559)

Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>

* ci: pin azure/setup-helm and helm/kind-action to commit SHAs (#1544)

* ci: pin azure/setup-helm and helm/kind-action to commit SHAs

* chore(python): add py.typed marker for PEP 561 compliance

* ci: use full semver in pinned action version comments

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

---------

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

* refactor: deduplicate shared code across ocsf builders and driver crates (#1526)

Extract repeated patterns into shared helpers:

- Add impl_builder_setters! macro to openshell-ocsf/builders that
  generates the identical severity(), status(), and message() setter
  methods present on all 7 OCSF event builders
- Add SandboxContext::apply_common_fields() to consolidate the
  four-line build() finalization (set_status, set_message, set_device,
  set_container) repeated in every builder
- Add driver_utils::sandbox_token_path() to centralize the XDG state
  path construction for sandbox JWT files used by both the Docker and
  Podman drivers
- Add driver_utils::build_capabilities_response() to eliminate the
  identical GetCapabilitiesResponse struct literal repeated across the
  Docker, Podman, and Kubernetes compute drivers

* fix(python): raise SandboxError instead of FileNotFoundError or KeyError (#1547)

* fix(python): raise SandboxError instead of FileNotFoundError or KeyError

* fix(python): suppress exception chaining in SandboxError raises

Add `from None` to both `raise SandboxError(...)` calls inside `except
FileNotFoundError` blocks to satisfy ruff B904.

* fix(scripts): replace mapfile with bash 3.2-compatible read loop in helm-k3s-local (#1539)

macOS ships bash 3.2 which lacks mapfile/readarray. Replace all three
occurrences in configure_ghcr_credentials, cluster_has_image, and
cluster_image_platform with a portable while-read loop, consistent
with the fix applied to docker-build-image.sh in #1334.

* docs: add macOS compiler troubleshooting (#1569)

Signed-off-by: Ann Marie Fred <afred@redhat.com>

* fix(gateway): configure local dev auth (#1575)

This makes it so you can run the dev gateway and sandbox with:

```
mise run gateway
# in another shell
mise run sandbox
```

Signed-off-by: Kris Hicks <khicks@nvidia.com>

* docs: add Pi as supported sandbox (#1572)

* fix(sandbox): add mechanistic smoke test for L4 deny and document the L4/L7 split (#1412)

* fix(sandbox): add mechanistic smoke test for L4 deny and document the L4/L7 split

The old smoke script exercised an L7 PUT which hung because the denial
aggregator is only wired to L4 CONNECT denies, not L7 enforcement.

Add mechanistic-smoke.sh which triggers an L4 deny, waits for the
aggregator to flush, and asserts a pending chunk appears under
openshell rule get --status pending.

Document the intentional L4-only scope of the mechanistic mapper in
architecture/sandbox.md.

Fixes #1333

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

* refactor(smoke): remove redundant variable inits and merge double step call

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

* fix(smoke): wire mechanistic smoke into mise and guard TMP_DIR

- Initialize TMP_DIR before trap to prevent unbound variable on early exit
- Add e2e:mechanistic-smoke mise task with gateway setup
- Document mechanistic smoke in policy-advisor README

* test(proxy): verify L4 deny enqueues a DenialEvent

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

* fix(proxy): remove unnecessary path qualifications in L4 denial smoke test

---------

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>

* docs(readme): whitespace (#1578)

Signed-off-by: Kris Hicks <khicks@nvidia.com>

* fix(cli): replace outdated name reference (#1582)

Signed-off-by: Kris Hicks <khicks@nvidia.com>

* fix(sandbox): probe Landlock before build, skip on unsupported kernels (#1585)

On kernels without Landlock (e.g. gVisor's sentry returns ENOSYS for
syscall 444), the previous best_effort path still logged "Applying
Landlock" + "Landlock ruleset built" events even though no enforcement
was happening. Probe at the top of `landlock::prepare` and short-circuit
with a single High-severity "Sandbox Unavailable" finding.

Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>

* fix(sandbox): decouple GPU baseline from network policy (#1524)

Signed-off-by: Evan Lezar <elezar@nvidia.com>

* docs(kubernetes): note that Sandbox volumeClaimTemplates is immutable (#1543)

* fix(sandbox): use succinct endpoint denial reason (#1584)

Signed-off-by: Kris Hicks <khicks@nvidia.com>

* feat(docker): add provisioning progress events (#1567)

* docs(kubernetes): add RBAC section to setup page (#1540)

Documents the ServiceAccount, Role, and ClusterRole created by the Helm
chart inline on the setup page, per reviewer feedback on #1250. Reflects
the current chart templates including pods/get for sandbox identity and
tokenreviews/create for projected token validation.

Closes #1018

* fix(sandbox): delegate PID limits to runtimes (#1497)

Signed-off-by: mjamiv <142179942+mjamiv@users.noreply.github.com>

* fix(gateway): make readiness health checks dependency-aware (#1328)

* feat(gateway): add readiness probe metrics and test-only store close

Emit Prometheus readiness metrics for database probes (healthy gauge and
outcome-labeled latency histogram) with coverage in health HTTP tests.
Restrict Store::close behind test support cfg to prevent accidental runtime
pool shutdown under live traffic.

Signed-off-by: Adrien Langou <alangou@nvidia.com>

* test(e2e): add simple e2e test with kubernetes to test /readyz

Signed-off-by: Adrien Langou <alangou@nvidia.com>

---------

Signed-off-by: Adrien Langou <alangou@nvidia.com>

* fix(vm): scope rootfs cache by openshell version (#1587)

Signed-off-by: Drew Newberry <anewberry@nvidia.com>

* fix(cli): preserve symlinks during sandbox upload (#1595)

* fix(cli): preserve symlinks during sandbox upload

* docs(sandboxes): document upload symlink behavior

* fix(core): preserve SSH gateway default ports (#1602)

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router (#1596)

* feat(server): per-handler gRPC auth annotations

Move scope, role, and auth-mode metadata to the handler definition site
via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained
SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and
ALLOWED_SANDBOX_METHODS tables are now generated from per-method
annotations on the tonic service impls, with canonical gRPC paths
derived from the service name and method name.

Adds a new openshell-server-macros proc-macro crate, an aggregator in
auth/method_authz.rs, and an exhaustiveness test that decodes the
protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and
verifies every RPC has an annotation.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* refactor(server): rename `sandbox-secret` auth mode to `sandbox`

PR #1404 replaced the shared sandbox secret with per-sandbox
gateway-minted JWTs. A handler marked `sandbox` now authenticates as a
specific `Principal::Sandbox`, not as a holder of a shared credential.

Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and
`AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches
the post-#1404 identity model.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* fix(server): enforce per-handler AuthMode at the router

Addresses review feedback on the per-handler auth-annotation work.

- Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous
  router only checked is_sandbox_callable() for Principal::Sandbox; user
  principals still flowed into AuthzPolicy::check() and bypassed the
  per-handler declaration. A user with `openshell:all` could therefore
  reach `sandbox`-only handlers like GetSandboxProviderEnvironment,
  ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even
  though their annotations said sandbox-only. Adds an
  is_user_callable() predicate and rejects User principals at the
  router for `sandbox` / `unauthenticated` methods.

- Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A
  second `auth`, `scope`, or `role` previously silently overwrote the
  first value; now it fails to compile.

- Regression tests: a unit test for is_user_callable() and a router
  test that proves a user with admin role + openshell:all cannot reach
  the nine sandbox-only handlers.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* docs(server): finish renaming sandbox-secret to sandbox in method_authz doc comments

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* refactor(server-macros): drop standalone `rpc_auth` stub

The stub was a safety net that fired only when a method had
`#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it
required `rpc_auth` to be imported, which is why both call sites carried
`#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`.

Drop the stub and the unused-import workaround. A missing `#[rpc_authz]`
now surfaces as rustc's standard "cannot find attribute `rpc_auth` in
this scope" — clear enough, and one fewer import + lint exception.

Addresses review comment on PR #1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* refactor(server-macros): emit fixed `AUTH_METADATA` const per service

The previous trait-derived const name turned `OpenShell` into
`OPEN_SHELL_AUTH_METADATA`, splitting the project name across an
underscore. Each impl already lives in its own module
(`crate::grpc::`, `crate::inference::`), so the module path is enough
to disambiguate between services — a fixed `AUTH_METADATA` name reads
more naturally.

Aggregator in `auth/method_authz.rs` now references
`crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA`
directly.

Addresses review comment on PR #1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* docs(server-macros): fix typo in AUTH_METADATA_CONST doc comment

OpenShell is one word; reference name in the doc should be
OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA.

Addresses review nit on PR #1596.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

---------

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* ci(snap): add snap release pipeline (#1600)

* docs: refresh landing terminal demo and apply NVIDIA fern theme (#1615)

- Extract landing-page terminal demo into a reusable <CommandTerminal />
  component with inline styles (no global CSS dependency)
- Animate a second command line cycling through claude/opencode/codex
  via @keyframes scoped inside the component
- Inline BadgeLinks layout styles so the component renders correctly
  without relying on .badge-links from main.css
- Add jsx.d.ts shim so editors do not flag the React global in component
  TSX files
- Switch fern instance to global-theme: nvidia with multi-source enabled
- Bump fern CLI to 5.40.0 and drop the basepath-aware experimental flag
- Register fern/components/ as a second mdx-components directory
- Remove the unused Adobe analytics script tag

* build(macos): remove unused import of tracing::warn (#1619)

Signed-off-by: Calum Murray <cmurray@redhat.com>

* chore: align .python-version with mise.toml (#1618)

Signed-off-by: Calum Murray <cmurray@redhat.com>

* feat(helm): add optional PostgreSQL backing store (#1579)

* feat(helm): add optional PostgreSQL backing store with Secret-based credentials

- Add postgres.enabled and postgres.deploy values to control database
  backend (SQLite vs PostgreSQL) and subchart deployment independently.
- Introduce db-secret.yaml template for Opaque Secret with assembled
  postgresql:// connection string injected via OPENSHELL_DB_URL env var.
- Add Bitnami PostgreSQL as optional subchart dependency keyed on
  postgres.deploy to prevent subchart deployment in external mode.
- Externalize JWT signing key file mode via sandboxJwt.secretDefaultMode
  with 0400 default matching upstream.
- Add validation guard for postgres.deploy=true without postgres.enabled.
- Add helm unit tests covering internal, external, URL-override, special
  character encoding, and misconfiguration error paths.
- Update README with Kubernetes and OpenShift install examples for
  bundled and external PostgreSQL configurations.
- Add helm dependency build to lint and unittest tasks.

* fix(helm): add database backend docs to README.md.gotmpl and regenerate

The helm-docs CI check failed because the Database backend section was
added directly to README.md instead of README.md.gotmpl. Move the
content to the template and regenerate so the check passes.

* fix(helm): use Secret-based DB credentials and support existingSecret

Replace the inline db-url stringData pattern with a proper Secret
containing individual fields plus a uri key.  When postgres.deploy=true
the Bitnami service-binding secret is referenced directly; when
deploy=false users can supply postgres.external.existingSecret to
bring their own Secret, or let the chart generate one from the external
field values.

Also restructures the README database section for clarity, adds
helm-unittest coverage for the new secret resolution paths, and
fixes a markdown lint issue in the root README.

* refactor(helm): move OpenShift e2e script to e2e/rust/ and add mise task

Move test-openshift-scenarios.sh from deploy/helm/openshell/ci/ to
e2e/rust/e2e-openshift.sh, matching the existing e2e script naming
convention. Register it as `e2e:openshift` in tasks/test.toml — not
wired into the `test` or `e2e` aggregates so it only runs on explicit
invocation against a live OpenShift cluster.

* feat(e2e): add database backend scenarios to Kubernetes e2e

Extend with-kube-gateway.sh with an optional multi-scenario loop gated
by OPENSHELL_E2E_KUBE_DB_SCENARIOS=1. When enabled, the script installs
the Helm chart three times — SQLite (default), bundled PostgreSQL, and
external PostgreSQL with existingSecret — running the full test suite
against each backend. When unset, existing single-install behavior is
unchanged.

Also adds helm dependency build before helm install, fixing CI failures
caused by the missing PostgreSQL subchart dependency.

* refactor(helm): simplify PostgreSQL config to two orthogonal controls

Replace postgres.deploy and postgres.external.* with two simple controls:
- postgres.enabled: deploy the bundled Bitnami PostgreSQL subchart
- server.externalDbSecret: name of a pre-existing Secret with a uri key

Delete db-secret.yaml — the chart no longer generates Secrets from
individual credential fields. Users either get the Bitnami service-binding
secret (bundled) or bring their own via server.externalDbSecret.

Add validation that postgres.serviceBindings.enabled must stay true
when using bundled PostgreSQL, preventing a confusing runtime failure.

* docs(config): update gateway config reference (#1624)

* feat(flake): add Nix development shell (#1592)

* feat(build): add simple nix flake with formatter for nix code

* feat(flake): setup rust toolchain, able to build and run unit tests

* feat(flake): add support for arm linux and macos

* feat(toolchain): add rust-src and rust-analyzer to the toolchain

* refactor(proto): move phase and current_policy_version into status (#1565)

* refactor(proto): move phase and current_policy_version into SandboxStatus

Move phase and current_policy_version from SandboxSpec into
SandboxStatus to correctly model mutable runtime state. Update all
callers in the gateway server, TUI, and Python SDK to read and write
these fields through SandboxStatus accessors.

Signed-off-by: Derek Carr <decarr@redhat.com>

* fix(server): preserve sandbox status on statusless driver updates

When a driver update arrives without a status payload (e.g. before
Kubernetes populates the status subresource), preserve the stored
phase, conditions, and current policy version instead of resetting
them. Adds a regression test covering the edge case.

Signed-off-by: Derek Carr <decarr@redhat.com>

---------

Signed-off-by: Derek Carr <decarr@redhat.com>

* feat(python-sdk): support OIDC Bearer auth on SandboxClient (#1621)

* feat(python-sdk): support OIDC Bearer auth on SandboxClient

PR #1596 hardened the gateway side of the OIDC story; the Python SDK
was the remaining gap — it only supported plaintext or mTLS, with no
Bearer metadata anywhere. Deployments with OIDC enabled (the
recommended posture since PR #935 / PR #1404) were unreachable from
the SDK.

Adds:

- `bearer_token: str | Callable[[], str] | None` kwarg on
  `SandboxClient`. Static strings or zero-arg callables (the latter
  is invoked per RPC, so callers can drop in a refresh loop or
  token-file watcher without reconstructing the client). Composes
  with `tls` for OIDC-over-mTLS deployments.
- `_BearerAuthInterceptor` implementing all four
  `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types.
  Appends `authorization: Bearer <token>` to outgoing metadata.
  Implemented as an interceptor (not call credentials) so it works
  on both plaintext (`disableTls=true` dev) and TLS channels without
  `grpc.composite_channel_credentials`.
- `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`,
  `key_path`) are now optional with `cert_path` / `key_path`
  required-together-or-not-at-all (enforced in `__post_init__`). This
  unlocks three transport profiles from one dataclass:
    * full mTLS (all three)
    * CA-only trust (`ca_path` only)
    * system roots (`TlsConfig()` — for OIDC gateways behind a
      public CA)
- `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs`
  `build_oidc_channel`:
    * For any `https://` gateway, always build a secure channel.
      Pick the strongest TLS profile available in `mtls/` (full
      mTLS → CA-only → system roots). No more `insecure_channel`
      fallback for HTTPS.
    * Gate OIDC bearer attachment on
      `metadata.json["auth_mode"] == "oidc"`. Matches
      `crates/openshell-cli/src/main.rs:132` and the TUI; a stale
      `oidc_token.json` next to a non-OIDC gateway no longer causes
      the SDK to attach a bearer.
- `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh
  modeled on `google.oauth2.credentials.Credentials` and
  `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every
  RPC; when stale, re-reads disk first (the CLI may have rotated
  the bundle), and only then exchanges the refresh_token against
  the IdP's token endpoint discovered via OIDC discovery
  (`/.well-known/openid-configuration`, cached after first call).
  Concurrent RPCs share a single refresh via `threading.Lock` (no
  IdP stampede). Honors refresh-token rotation. Surfaces IdP
  failures as `SandboxError` with the RFC 6749 error body included
  for diagnostics.

  Mirrors the Rust CLI's HTTP-policy posture from
  `crates/openshell-cli/src/oidc_auth.rs`:
    * `follow_redirects=False` so a 3xx during discovery can't
      steer us to an attacker-controlled token endpoint.
    * Discovery `issuer` is validated against the configured
      issuer; a discovery document claiming a different issuer is
      rejected, preventing the SDK from POSTing the refresh_token
      to a malicious endpoint.
    * `insecure: bool` flag plumbed through to httpx's
      `verify=` so self-signed-cert deployments work the same way
      they do in the Rust CLI.

  Built on `httpx` (chosen over `urllib` specifically for
  follow_redirects + verify control as kwargs). The OAuth2
  refresh-token grant itself (RFC 6749 §6) is one form-encoded
  POST — handled inline rather than via a dedicated OAuth library;
  tried `authlib`'s `OAuth2Client` first but it auto-injects an
  Authorization header on every request, which breaks the
  unauthenticated discovery GET.
- `_make_cluster_bearer_provider(..., auto_refresh=True,
  write_back=True, insecure=False)` factory. Defaults to the
  refresher path with write-back enabled; `auto_refresh=False`
  falls back to the read-only fail-closed behavior for callers that
  don't want the SDK to make outbound HTTP calls to the IdP.

  `write_back=True` is the default (changed from the first round of
  review): IdPs with refresh-token rotation (Keycloak with
  rotation, Entra in strict mode) invalidate the old refresh_token
  on each refresh, so an in-memory-only refresh would leave the
  on-disk bundle pointing at an invalidated value — any second
  process starting from disk would `invalid_grant`. With write-back
  enabled by default, the SDK keeps the shared cache consistent
  with the IdP.
- `from_active_cluster` exposes `auto_refresh`, `write_back`, and
  `insecure` kwargs (defaults: True / True / False). The
  high-level `Sandbox` context manager surfaces the same three
  kwargs and forwards them through, so callers using the wrapper
  have parity with `SandboxClient` for OIDC-protected gateways.
- `SandboxClient.close()` chains to a `_bearer_close` hook so the
  `_OidcRefresher`'s underlying `httpx.Client` is released
  deterministically instead of leaking sockets/FDs until GC runs
  `__del__`. Idempotent.
- `_OidcRefresher._write_to_disk` uses `tempfile.mkstemp` (PID +
  random suffix) instead of a fixed `.oidc_token.json.tmp` path,
  so two writers racing on the same gateway directory don't
  trample each other's tmp content. Success path atomically
  replaces; failure path unlinks the orphan.

OAuth2 refresh policy and write-back semantics deliberately mirror
what the major Python SDKs do — see
github.com/googleapis/google-auth-library-python (`Credentials`)
and github.com/boto/botocore (`SSOTokenProvider`):

| Library                       | Native refresh | Writes back |
|-------------------------------|----------------|-------------|
| google-auth Credentials       | yes            | no          |
| botocore SSOTokenProvider     | yes            | yes         |
| openshell SandboxClient (here)| yes (opt-out)  | yes (opt-out)|

OpenShell sits between the two; chose write-back-by-default because
the rotation invariant matters more for our deployments than the
"CLI is the only writer" assumption that fits google-auth.

Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library —
the refresh grant is a single POST.

Tested:

- 42 sandbox_test.py tests pass (5 pre-existing + 37 new across
  the bearer interceptor, fail-closed provider, refresher
  behavior, TlsConfig validation, from_active_cluster auth ladder,
  security-review regressions, Sandbox-wrapper kwarg forwarding,
  and lifecycle / concurrency probes).
  `mise run test:python` → 47 passed total across the python
  suite.
- `mise run python:lint` (ruff) clean.
- End-to-end against a Keycloak-protected gateway on OpenShift:
    * unauthenticated `Health` bypass works
    * admin + `openshell:all` reaches user-callable methods
    * reader (`sandbox:read`) denied on `CreateSandbox` by scope
    * admin + `openshell:all` denied on PR #1596 sandbox-only
      methods at the router (the new gate is honored from the SDK)
    * full provider CRUD lifecycle via the SDK
    * callable token provider rotates per RPC as expected
- Regression-probed against three pre-review security findings:
    * **Discovery issuer validation** — a discovery document
      claiming a different `issuer` than the configured one is
      rejected with a clear `SandboxError` before any refresh POST
      can reach the attacker-controlled endpoint.
    * **Redirect during discovery** — `follow_redirects=False` on
      the underlying httpx client means a 3xx during discovery
      surfaces as a SandboxError rather than silently chasing the
      redirect.
    * **Cross-process rotation** — a two-process simulation shows
      process B starting from disk and successfully refreshing
      with the rotated refresh_token, because process A's
      write-back updated the shared cache.
- Refresher unit tests cover: cached-fresh fast path, disk-rotated
  re-read before refresh, OAuth2 exchange against the discovered
  token endpoint, refresh-token rotation, atomic write-back at
  0600 mode (default), default-on write_back proven by test,
  concurrent N-thread coordination (one refresh shared across 8
  threads), IdP failure surfaced with the error body, the
  client_credentials / no-refresh_token error path, issuer-
  mismatch rejection, redirect-during-discovery rejection,
  insecure flag plumbing.
- Lifecycle / concurrency regression tests added: `close()`
  invokes the `_bearer_close` hook (idempotent), the refresher's
  `httpx.Client` is marked closed after `SandboxClient.close()`,
  and 16 concurrent writers don't leave orphan tmp files behind
  while producing a valid final bundle. The `Sandbox` wrapper has
  direct forwarding tests proving `auto_refresh`, `write_back`,
  and `insecure` reach `from_active_cluster` (both explicit
  values and defaults).
- End-to-end against a real OpenShift + Keycloak cluster from
  inside a pod: real OIDC discovery against
  `keycloak.keycloak.svc.cluster.local:8080`, refresh-token grant
  POST, atomic write-back of the rotated bundle at 0600, and a
  follow-up RPC reusing the freshly-rotated in-memory token —
  full round-trip in ~170ms.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* fix(python-sdk): adopt newer on-disk OIDC bundle before refreshing

_OidcRefresher.current_access_token() only adopted the on-disk
oidc_token.json when its access token was still fresh; otherwise it
refreshed using the in-memory bundle. With refresh-token rotation
enabled (Keycloak with rotation, Entra strict mode), this let a process
keep using an invalidated refresh_token:

1. Process A holds a stale in-memory bundle with refresh_token=r1.
2. Process B refreshes first and writes a rotated (r2) but now
   near-expiry bundle to disk.
3. Process A re-reads disk, sees the access token is not fresh, ignores
   the disk bundle, and POSTs the stale r1 — which the IdP has already
   invalidated, yielding invalid_grant.

Fix: when the cached bundle is stale, adopt the on-disk bundle if it was
refreshed more recently than ours, even when its access token is also
stale. "More recently" is decided by expires_at — a refresh mints a new
access token with a forward expiry alongside the rotated refresh_token,
so the later expiry carries the newest refresh_token. Comparing by
expiry (rather than unconditionally preferring disk) preserves the
write_back=False case, where the in-memory bundle has already rotated
past the on-disk copy and must not be clobbered. When the adopted
bundle's issuer differs, the cached token endpoint is reset so the
refresh re-discovers against the new issuer.

Adds regression tests for the cross-process rotation race and the
issuer-change re-discovery path.

* fix(python-sdk): recover from invalid_grant on lost rotation race

The expiry-based disk re-read narrows but does not fully close the
cross-process refresh-token rotation race: two processes sharing a
gateway directory can both enter their refresh window, both POST their
copy of the refresh_token, and with rotation enabled the IdP invalidates
the loser's token (invalid_grant). Neither google-auth nor botocore
close this window without an OS file lock; a Python-only flock would not
coordinate with the Rust CLI/TUI that also write oidc_token.json, so
locking is not worth its cost here.

Recover instead of prevent: distinguish an OAuth2 invalid_grant (the
refresh_token was rejected) from transport/5xx failures via a private
_InvalidGrantError, and on invalid_grant re-read oidc_token.json once. If
a peer wrote a different refresh_token (it won the race), adopt and retry
with it — returning early if it is already fresh — so the loser succeeds
transparently instead of forcing a re-authenticate. If disk offers no new
token, the rejection is genuine and surfaces the re-authenticate hint as
before. The retry is single-shot; a second invalid_grant propagates.

Adds tests for the peer-rotation recovery and the genuine-rejection
(no-retry) paths.

---------

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

* fix(helm): vendor chart dependencies before release packaging (#1627)

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* fix(driver-podman): bind gateway to 0.0.0.0 in rootless mode (#1623)

Rootless Podman sandbox containers reach the host through pasta's local
connection bypass, which translates L2 frames to L4 host sockets. The
dev gateway script binds to 127.0.0.1 by default, which is not routable
through pasta. Auto-detect rootless mode and bind to 0.0.0.0 so sandbox
containers can connect to the gateway.

- Auto-detect rootless Podman in gateway.sh and export
  OPENSHELL_BIND_ADDRESS=0.0.0.0 when not explicitly set
- Add e2e:podman:rootless mise task and CI matrix entry to validate
  rootless Podman networking end-to-end
- CI creates a non-root user inside the privileged container to trigger
  Podman's rootless code paths (pasta, user namespace isolation)

Signed-off-by: Naveen Malik <nmalik@redhat.com>

* docs(providers): note that ANTHROPIC_API_KEY requires an API account, not a subscription (#1542)

* docs(providers): note that ANTHROPIC_API_KEY requires an API account, not a subscription

Anthropic subscription users authenticate via OAuth, not an API key,
causing a silent failure when creating the provider. Adds a Note callout
in the provider type table and quickstart guide directing subscription
users to generate an API key from console.anthropic.com.

Closes #620

* docs(providers): fix Note placement and remove subscription brand names

Move the Note callout in manage-providers.mdx to after the complete
provider type table so it does not break table rendering. Remove
subscription brand names from both Note callouts.

* fix(podman): avoid host-gateway on macOS machines (#1637)

Closes #1307

Default the Podman host gateway alias override to gvproxy's host-loopback IP on macOS while preserving host-gateway resolution on Linux. Wire the setting through Podman config, gateway TOML inheritance, and the standalone driver, and document the platform behavior.

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* chore(vm): generalize crate for multi-device PCIe passthrough (#1573)

* generalize crate for multi-device PCIe passthrough

Signed-off-by: Patrick Riel <priel@nvidia.com>

* add adopt apis which allow for devices already bound to vfio-pci during restart reconciliation, without rebinding or mutating sysfs.

Signed-off-by: Patrick Riel <priel@nvidia.com>

* refactor(vfio): generalize GPU passthrough sysfs handling

Signed-off-by: Patrick Riel <priel@nvidia.com>

* fix(vfio): centralize vfio ID refcounting

Signed-off-by: Evan Lezar <elezar@nvidia.com>

---------

Signed-off-by: Patrick Riel <priel@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>

* fix(sandbox): trust exact declared private endpoints (#1560)

* fix(sandbox): trust exact declared private endpoints

* fix(sandbox): preserve advisor endpoint provenance

* fix(sandbox): repair advisor provenance lint failures

---------

Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* feat(policy): add agentic approval loop (#1528)

* fix(e2e): clean up temp files in sandbox-runner on exit (#1647)

* ci(kubernetes): add HA e2e workflow (#1598)

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* ci(release): use bundled Z3 for macOS gateway build (#1658)

* fix(gateway): align package TLS bootstrap path (#1601)

* fix(gateway): align package TLS bootstrap path

Closes #1593

Default package-managed gateway services to a stable local TLS directory and use that same value for certificate generation and runtime startup.

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* test(packaging): validate package asset paths exist

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* ci(e2e): pin mise in kubernetes job

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

---------

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* feat(tui): add PageUp/PageDown scrolling to all panes (#1656)

Add PageUp/PageDown key support to the policy, logs, and draft/rules
views. All three panes now scroll by one viewport height per keypress.

Also fix scroll_policy() clamping to stop at the last viewport of
content instead of the last line, preventing a blank-screen overshoot
on G and PageDown.

Signed-off-by: Major Hayden <major@redhat.com>

* feat(telemetry): add anonymous opt-out OpenShell usage telemetry (#1433)

* feat(telemetry): add anonymous opt-out usage telemetry

Signed-off-by: Kirit93 <kthadaka@nvidia.com>

* Removed enums from schema

Signed-off-by: Kirit Thadaka <kthadaka@nvidia.com>

* Updated telemetry URL

Signed-off-by: Kirit93 <kthadaka@nvidia.com>

* ci(kubernetes): pin mise installer for e2e

---------

Signed-off-by: Kirit93 <kthadaka@nvidia.com>
Signed-off-by: Kirit Thadaka <kthadaka@nvidia.com>
Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>

* ci(release): gate helm/oci artifact publishing on release (#1662)

release-helm and tag-ghcr-release now depend on the release job.

This is to prevent a GHCR image or helm chart from being published when some
other aspect of the release fails.

Signed-off-by: Kris Hicks <khicks@nvidia.com>

* ci(kubernetes): stabilize HA e2e setup (#1659)

* ci(kubernetes): pin mise in e2e workflow

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* ci(kubernetes): mirror postgres image for ha e2e

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* ci(kubernetes): reuse e2e workflow for ha

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

---------

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

* fix(gateway): place supervisor_image under podman driver TOML table (#1661)

The gateway.sh script appended supervisor_image after the
[openshell.gateway.gateway_jwt] table header, so TOML parsed it as a
gateway_jwt field. The Podman driver never saw the override and fell
back to the default ghcr.io/nvidia/openshell/supervisor:latest.
Move supervisor_image into [openshell.drivers.podman] where the driver
config deserializer expects it.

* refactor: deduplicate shared utilities across driver crates (#1660)

Move three duplicated definitions into openshell-core so every
consumer has a single canonical source:

- format_bytes: identical 14-line function existed in docker,
  kubernetes, and vm drivers. Moved to openshell-core::progress
  where all three already imported from.

- DEFAULT_SANDBOX_PIDS_LIMIT: i64 constant (2048) duplicated in
  docker driver and podman config. Moved to openshell-core::config
  alongside other shared defaults. Podman re-exports it for
  internal call-site compatibility.

- current_time_ms: secrets.rs in openshell-sandbox reimplemented
  the same logic as openshell-core::time::now_ms(). Remove the
  local copy and call now_ms() directly via the existing dep.

* fix(config): reject unknown fields in nested gateway config tables (#1666)

* fix(config): reject unknown fields in nested gateway config tables

The gateway TOML loader silently ignored keys placed under the wrong
table header. PR #1661 fixed one instance of this (supervisor_image
under [openshell.gateway.gateway_jwt]) but the root cause remained: the
nested gateway config tables did not deny unknown fields, so a misplaced
key was accepted and dropped instead of erroring.

Concretely, tasks/scripts/gateway.sh emitted `sandbox_namespace` right
after the [openshell.gateway.gateway_jwt] heredoc, so it landed inside
the gateway_jwt table rather than [openshell.gateway]. The k8s driver
already receives the namespace via [openshell.drivers.kubernetes], so
the stray line was dead config that parsed without complaint.

Changes:
- Add #[serde(deny_unknown_fields)] to the nested gateway config tables
  that are part of the config-file parse tree: TlsConfig, OidcConfig,
  MtlsAuthConfig, GatewayAuthConfig, GatewayJwtConfig.
- Remove the misplaced sandbox_namespace line from gateway.sh.
- Drop the unused Serialize/Deserialize derives from Config and
  ServiceRoutingConfig (see below).
- Add a regression test asserting a key under the wrong nested table is
  rejected.

* feat(kubernetes): support sandbox image pull secrets (#1671)

* refactor(driver): trim compute capability response (#1402)

Signed-off-by: Evan Lezar <elezar@nvidia.com>

* feat(providers): add Google Vertex AI inference provider (#1568)

* feat(providers): add Google Vertex AI provider

Adds Vertex AI provider profiles, routing, credential refresh plumbing, CLI support, docs, and regression coverage. Keeps the related NETLINK_ROUTE seccomp allowance needed by Vertex client tooling that calls getifaddrs.

* docs: add Vertex AI sandbox usage for Claude Code and OpenCode

Cover the full end-to-end setup for running Claude Code and OpenCode
inside an OpenShell sandbox via inference.local with a Vertex AI backend:

- google-vertex-ai.mdx: add 'Use from a Sandbox' section with tabbed
  examples for Claude Code (--bare flag, no /v1 suffix) and OpenCode
  (/v1 suffix required). Add providers_v2_enabled prerequisite and
  --no-verify note for global region. Document policy proposals table
  covering metadata.google.internal (always blocked), downloads.claude.ai,
  and storage.googleapis.com.

- inference-routing.mdx: expand 'Use the Local Endpoint' section with
  tabbed examples for Claude Code, OpenCode, Python OpenAI SDK, and
  Python Anthropic SDK. Add notes explaining the /v1 path suffix
  difference between clients.

- supported-agents.mdx: update Claude Code and OpenCode rows to mention
  inference.local support and correct base URL requirements.

* fix: address vertex review findings

* test(sandbox): retry on spurious Ok in fork-exec ambiguity test

On arm64 under heavy CI load, the /proc fd scan in
find_socket_inode_owners can transiently miss the parent process's
socket fd entry, returning only the child as an owner. This causes
resolve_process_identity to return Ok (single owner, no ambiguity
check fires) instead of the expected ambiguous-ownership Err.

Extend the retry loop to also handle unexpected Ok results, mirroring
the existing retry for transient Err results. 10 retries at 50ms gives
a 500ms settling window, which is sufficient for procfs to stabilize
on loaded arm64 runners.

* fix: address vertex review regressions

* docs(router): clarify stream_response semantics for Vertex rawPredict routing

Document the three call sites of prepare_backend_request and their
stream_response values in a caller table:

- send_backend_request: false → :rawPredict (unary endpoint)
- send_backend_request_streaming: true → :streamRawPredict
- verify_backend_endpoint: explicitly false to probe the unary endpoint

Cross-reference the table from build_provider_url and
is_vertex_anthropic_rawpredict_route so the stream_response=true guard
in the suffix upgrade branch is understood in full context.

Also note that is_vertex_anthropic_rawpredict_route is a structural
predicate (model_in_path + anthropic_messages + :rawPredict suffix),
not a named-provider check, so any future provider with the same route
shape inherits the transforms automatically.

* fix: correct example paths in local-inference README (#1676)

* fix: correct example paths in local-inference README

* fix: correct example paths in local-inference routes.yaml

* ci(release): bring Fedora RPM canary to parity (#1688)

The RPM canary needs to exercise the install.sh user-service path, but a GitHub
Actions job container does not boot with systemd as PID 1. The Fedora RPM
canary needs to exercise the install.sh user-service path, but a GitHub Actions
job container does not boot with systemd as PID 1. This means the Fedora RPM
canary was incomplete as compared to the others.

With this change, we run Fedora as a nested privileged systemd container
instead, wait for systemd to become reachable, then start the root user manager
so systemctl --user works for the RPM gateway unit, achieving parity with the
other canary tests.

Signed-off-by: Kris Hicks <khicks@nvidia.com>

* fix: update RFC link in agent-driven-policy-management README (#1677)

* feat(providers): add profile-backed policy visibility (#1640)

* chore: wip providers v2 tui and codex profile

* chore: wip effective policy get and codex profile

* chore: wip provider profiles and tui detail views

* feat(tui): annotate policy proposal review status

* ci(release): fix Ubuntu Snap canary install and registration (#1699)

Install the Snap built by the triggering Release Dev workflow by setting
merge-multiple: true on the artifact download. actions/download-artifact
otherwise extracts each artifact into its own subdirectory, leaving the
package at release/snap-linux-amd64/*.snap, so the install glob
./release/*.snap matched nothing. Merging flattens the artifact's contents
directly into release/ where the dangerous local snap install expects it.

Harden the Snap canary setup by enabling snapd.socket, waiting for snap
seeding (snap wait system seed.loaded), and running every step with strict
shell options (set -euo pipefail) so failures surface immediately.

Register the snapped gateway with the CLI as the documented local plaintext
snap-docker gateway, and print version and snap services, before running
openshell status so the canary verifies a configured and reachable gateway
instead of only the install.

Signed-off-by: Kris Hicks <khicks@nvidia.com>

* feat(snap): add openshell.term desktop app (#1693)

Add a desktop launcher for the OpenShell TUI so users can launch
"openshell term" from their desktop environment application menu.

The change adds three files:
- snap/local/term.desktop: desktop entry file for the application launcher
- snap/local/icon.png: application icon (copied from snap store data)
- snapcraft.yaml: new "term" app entry that runs "openshell term"
  with home, network, ssh-keys, and system-observe plugs, plus install
  rules to stage the desktop file and icon under meta/gui/

The desktop file references the icon via ${SNAP} which is resolved
at runtime to the snap installation directory. The term app reuses
the same connection plugs as the main openshell app.

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

* fix(sandbox): restore GPU procfs baseline (#1522)

Signed-off-by: Evan Lezar <elezar@nvidia.com>

* fix(gateway): try harder to detect Podman (#1536)

Auto-detection previously treated Podman as available only when the podman CLI
was visible on PATH. However, package manager services can run with a
restricted PATH, which lets Docker be selected even when a Podman API socket is
reachable. Additionally, podman may symlink /var/run/docker.sock to podman's
machine unix socket, which would be incorrectly detected as Docker. Worse
still: the podman machine may not even be running.

This replaces the Podman binary check with a functional HTTP probe against the
standard Podman socket paths. The probe requires /_ping to answer with a
Libpod-Api-Version header before treating the socket as Podman, which lets the
gateway select the embedded Podman driver only when the API is usable.

Signed-off-by: Kris Hicks <khicks@nvidia.com>

* chore(mise): refresh tool lockfile (#1712)

Signed-off-by: Kris Hicks <khicks@nvidia.com>

* ci(release): authenticate snap canary artifact download (#1711)

The Ubuntu Snap canary downloads its artifact from a different workflow run
(the triggering Release Dev run) via run-id. Cross-run downloads require
authentication, so pass github.token to actions/download-artifact.

Signed-off-by: Kris Hicks <khicks@nvidia.com>

* docs(container-gateway): fix Docker driver setup for containerized gateway (#1419)

The existing docs omitted or misstated several requirements when running
the gateway as a container with the Docker compute driver:

- OPENSHELL_GRPC_ENDPOINT is required; the Docker driver uses only the
  scheme (http/https) — host and port are substituted automatically with
  host.openshell.internal and the gateway's own bind port
- Supervisor binary must be extracted to a host path before starting the
  gateway; bind-mount sources are resolved by the host Docker daemon so
  the path must be identical inside and outside the gateway container
- Docker socket access requires adding the docker group (UID 1000 default)
- Port binding should remain 127.0.0.1; Docker driver adds a bridge
  listener automatically
- add --server-san host.openshell.internal to generate-certs for mTLS
- Complete the mTLS docker run with all Docker driver requirements
- Add deploy/docker/gateway.toml — TOML config for the Docker driver
- Add deploy/docker/docker-compose.yml referencing the TOML
- Add docs/get-started/tutorials/docker-compose.mdx tutorial page
- Remote gateway registration instructions (--remote flag)

Address reviewer feedback:
- Move Docker Compose tutorials card to the bottom of the list
- Replace inline YAML snippet in Docker Compose section with a reference
  to deploy/docker/ to avoid drift
- Clarify OPENSHELL_DB_URL is safe in compose.yml (plain SQLite path,
  no credentials); the TOML block targets credential-bearing DSNs
- Note that ./ in source: resolves relative to the compose file directory
- Clarify that only the scheme from OPENSHELL_GRPC_ENDPOINT matters
- Add note that the tilde volume mount resolves to the same absolute
  path on both host and container

* refactor(server): deduplicate test helpers and grpc utilities (#1708)

Remove three groups of copy-pasted code in openshell-server:

1. grpc/mod.rs had a private current_time_ms() wrapper identical to the
   one already exported from persistence/mod.rs. Remove the duplicate
   and update the three grpc sub-modules (policy, sandbox, service) to
   import directly from crate::persistence.

2. test_store() was repeated verbatim in seven #[cfg(test)] blocks.
   Promote a single canonical version to persistence/mod.rs (cfg-gated)
   and replace all copies with crate::persistence::test_store() calls or
   a thin Arc wrapper in supervisor_session.

3. grpc_client_mtls() and build_tls_root() were copy-pasted across
   edge_tunnel_auth.rs and multiplex_tls_integration.rs. Move both into
   the existing tests/common/mod.rs shared module and import from there.

* fix(gateway): allow local sandbox jwt to not expire (#1721)

* fix(helm): create sandbox JWT secret when cert-manager is enabled (#1700)

* fix(helm): create sandbox JWT secret under cert-manager

The cert-manager install path (certManager.enabled=true,
pkiInitJob.enabled=false) left the gateway StatefulSet unable to start
because nothing created the openshell-jwt-keys Secret: cert-manager owns
TLS Secrets but does not mint the sandbox JWT signing key, and the
certgen hook only rendered when pkiInitJob.enabled was true.

Separate JWT signing-key provisioning from TLS PKI provisioning:

- certgen: add a --jwt-only mode that creates only the Opaque JWT
  signing Secret, for use when another controller owns TLS Secrets.
- certgen.yaml: render the hook when pkiInitJob.enabled OR
  certManager.enabled is true. cert-manager takes precedence and runs
  the hook with --jwt-only even if pkiInitJob.enabled remains true.
  Remove the mutual-exclusion failure between the two values.
- _helpers.tpl: add openshell.sandboxJwtSecretName, shared by the hook
  and the StatefulSet mount.
- Update values, README, docs, architecture, and the
  debug-openshell-cluster skill to reflect the new precedence; the
  documented cert-manager install no longer needs pkiInitJob.enabled=false.

Closes #1691

* fix(helm): honor cert-manager precedence for client CA volume

The client CA volume logic treated pkiInitJob.enabled as proof that
built-in PKI owns the client CA. With cert-manager precedence now
allowing certManager.enabled=true alongside the default
pkiInitJob.enabled=true, that assumption mounts the server TLS cert
secret as the client CA and ignores
certManager.clientCaFromServerTlsSecret=false, which can break mTLS or
trust the wrong CA.

Gate the pkiInitJob.enabled term with (not certManager.enabled) in all
three client CA conditions (volume mount, volume definition, and secret
selection) so cert-manager owns TLS when enabled. Add a Helm test suite
covering built-in PKI, cert-manager shared CA, the regression config
(cert-manager + clientCaFromServerTlsSecret=false + default pkiInitJob),
and the no-client-CA case.

* feat(k8s-driver): add default_runtime_class_name config for sandbox pods (#1729)

Allow operators to configure a default Kubernetes runtimeClassName that
is applied to sandbox pods when the CreateSandbox request does not
specify one. This avoids requiring every API caller to explicitly set the
runtime class for clusters that always need a specific RuntimeClass
(e.g. kata-containers, nvidia).

The fallback is applied in the Kubernetes driver only — per-request
values still take priority, and an empty default (the built-in) preserves
existing behavior (field omitted, cluster default applies).

* docs: add Hermes Agent to supported agents (#1735)

* fix(cli): roll back gateway registration when auth fails during gateway add (#1538)

* refactor: deduplicate shared driver and TUI helpers (#1741)

* feat(cli): support multiple --upload flags on sandbox create (#1635) (#1645)

Closes #1635

Signed-off-by: Philippe Martin <phmartin@redhat.com>

* updates for new containers

---------

Signed-off-by: Derek Carr <decarr@redhat.com>
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
Signed-off-by: Colin Walters <walters@verbum.org>
Signed-off-by: Adam Miller <admiller@redhat.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Adel Zaalouk <azaalouk@redhat.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
Signed-off-by: Ann Marie Fred <afred@redhat.com>
Signed-off-by: Kris Hicks <khicks@nvidia.com>
Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
Signed-off-by: mjamiv <142179942+mjamiv@users.noreply.github.com>
Signed-off-by: Adrien Langou <alangou@nvidia.com>
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Signed-off-by: Calum Murray <cmurray@redhat.com>
Signed-off-by: Naveen Malik <nmalik@redhat.com>
Signed-off-by: Patrick Riel <priel@nvidia.com>
Signed-off-by: Major Hayden <major@redhat.com>
Signed-off-by: Kirit93 <kthadaka@nvidia.com>
Signed-off-by: Kirit Thadaka <kthadaka@nvidia.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Philippe Martin <phmartin@redhat.com>
Co-authored-by: Mesut Oezdil <114185853+mesutoezdil@users.noreply.github.com>
Co-authored-by: Drew Newberry <anewberry@nvidia.com>
Co-authored-by: Taylor Mutch <taylormutch@gmail.com>
Co-authored-by: Seth Jennings <sjenning@redhat.com>
Co-authored-by: Florent BENOIT <fbenoit@redhat.com>
Co-authored-by: Eric Curtin <eric.curtin@docker.com>
Co-authored-by: Derek Carr <decarr@redhat.com>
Co-authored-by: mjamiv <142179942+mjamiv@users.noreply.github.com>
Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>
Co-authored-by: Piotr Mlocek <pmlocek@nvidia.com>
Co-authored-by: Russell Bryant <russell.bryant@gmail.com>
Co-authored-by: Colin Walters <walters@verbum.org>
Co-authored-by: Adam Miller <admiller@redhat.com>
Co-authored-by: Taylor Mutch <tmutch@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>
Co-authored-by: Adel Zaalouk <azaalouk@redhat.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ann Marie Fred <afred@redhat.com>
Co-authored-by: krishicks <kris@krishicks.com>
Co-authored-by: Vegard Stikbakke <vegard.stikbakke@gmail.com>
Co-authored-by: krishicks <khicks@nvidia.com>
Co-authored-by: Davanum Srinivas <davanum@gmail.com>
Co-authored-by: alangou <alangou@nvidia.com>
Co-authored-by: Mrunal Patel <mrunalp@gmail.com>
Co-authored-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Co-authored-by: Calum Murray <cmurray@redhat.com>
Co-authored-by: Saurabh Agarwal <sauagarw@redhat.com>
Co-authored-by: Simon Scatton <44714756+SDAChess@users.noreply.github.com>
Co-authored-by: Naveen Malik <nmalik@redhat.com>
Co-authored-by: Patrick Riel <71560045+cheese-head@users.noreply.github.com>
Co-authored-by: Alexander Watson <zredlined@users.noreply.github.com>
Co-authored-by: Major Hayden <major@mhtx.net>
Co-authored-by: Kirit Thadaka <kirit.thadaka@gmail.com>
Co-authored-by: Jesse Jaggars <jhjaggars@gmail.com>
Co-authored-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Co-authored-by: shannonsands <shannon.sands.1979@gmail.com>
Co-authored-by: Philippe Martin <feloy1@gmail.com>
rrhubenov pushed a commit to rrhubenov/OpenShell that referenced this pull request Jun 12, 2026
Signed-off-by: Evan Lezar <elezar@nvidia.com>
(cherry picked from commit 5102cb9)
Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
rrhubenov pushed a commit to rrhubenov/OpenShell that referenced this pull request Jun 12, 2026
Signed-off-by: Evan Lezar <elezar@nvidia.com>
(cherry picked from commit 5102cb9)
Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
derekwaynecarr pushed a commit that referenced this pull request Jun 15, 2026
…ork` subcrates. (#1650)

* refactor(sandbox): extract run_networking from run_sandbox

Lifts TLS state generation, network namespace setup, proxy startup,
bypass monitor spawn, and SSH-side proxy URL / netns FD computation
out of run_sandbox into a sibling async fn `run_networking` that
returns a Networking struct. The identity cache moves with it (only
consumed by the proxy). Entrypoint PID allocation moves just above
the call site so it can be passed in.

No behavior changes — same OCSF emits, same async order, same RAII
lifetimes for the proxy and bypass-monitor handles, now held by the
returned Networking value in run_sandbox's frame.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(sandbox): extract run_process and lift netns to run_sandbox

Lifts the post-networking tail of `run_sandbox` (zombie reaper, SSH
server, supervisor session, process spawn, OPA probe, policy poll loop,
denial aggregator, wait/exit) into a sibling async fn `run_process`.

Also moves network namespace creation out of `run_networking` into a new
`create_netns_for_proxy` helper invoked from `run_sandbox`, so
`run_networking` is purely the proxy component (OPA evaluation, TLS
interception, credential injection, inference routing, gRPC control
API). The netns is then borrowed into both `run_networking` and
`run_process`.

No behavior change.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* chore(workspace): scaffold openshell-supervisor-networking and openshell-supervisor-process crates

Add empty placeholder crates that subsequent commits will populate as the
sandbox decomposition proceeds. Both crates compile clean as part of the
workspace and are picked up automatically by the existing
`members = ["crates/*"]` glob.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(core): lift DenialEvent to openshell-core

The DenialEvent struct is emitted by both the proxy/L7 layer (networking-side)
and the bypass monitor (process-side), and crosses the run_networking ->
run_process API boundary. Move it to openshell-core so the eventual
supervisor-networking and supervisor-process crates can both reference it
without depending on each other. DenialAggregator and the channel/flush
helpers stay in openshell-sandbox for now.

A thin `pub use openshell_core::DenialEvent;` re-export from
denial_aggregator.rs keeps every existing `crate::denial_aggregator::DenialEvent`
call site resolving without further edits.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(core): lift normalize_path to openshell-core

Move the lexical path-normalization helper from openshell-policy to
openshell-core::paths so it can be reached from crates that sit below
openshell-policy in the dependency graph. openshell-policy keeps its
existing public API via a `pub use` re-export, so all current call sites
(e.g. openshell-sandbox/src/policy.rs) continue to resolve unchanged.

This is a prerequisite for lifting openshell-sandbox/src/policy.rs into
openshell-core: that file's `From<ProtoFilesystemPolicy>` impl calls
normalize_path, and lifting it as-is would cycle through openshell-policy.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(core): lift SandboxPolicy and friends to openshell-core

Move openshell-sandbox/src/policy.rs (SandboxPolicy, NetworkPolicy,
ProxyPolicy, FilesystemPolicy, LandlockPolicy, ProcessPolicy, NetworkMode,
LandlockCompatibility, plus their Proto* TryFrom/From impls) to
openshell-core/src/policy.rs.

Both prospective supervisor leaves (networking and process) dispatch on
SandboxPolicy. Hosting it in openshell-core lets either leaf reach for it
without depending on the other (or on the future orchestrator).

The From<ProtoFilesystemPolicy> impl now calls the in-crate
openshell_core::paths::normalize_path lifted in the previous commit, which
is what made this move cycle-free.

Update all crate::policy::* call sites in openshell-sandbox to
openshell_core::policy::*.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move child_env from openshell-sandbox

child_env (proxy_env_vars, tls_env_vars) is process-side only — consumed
by process.rs and ssh.rs. With the orchestrator staying in
openshell-sandbox (Shape A), openshell-sandbox depends on the new leaf
crates, so process-only modules can land in
openshell-supervisor-process directly.

Add openshell-supervisor-process as a path dependency of
openshell-sandbox. Update process.rs and ssh.rs to import from
openshell_supervisor_process::child_env.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move skills from openshell-sandbox

Move the static skills installer (and its embedded resource directory)
out of openshell-sandbox into openshell-supervisor-process. The module
is process-side only — invoked once during sandbox start to drop
agent skill files into the workspace — and has no cross-leaf consumers.

Adds miette as a dependency and tempfile as a dev-dependency on
openshell-supervisor-process.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-networking): move mechanistic_mapper from openshell-sandbox

Move the mechanistic mapper (HTTP method/path → operation classifier
that derives policy proposals from connection summaries) out of
openshell-sandbox into openshell-supervisor-networking. Single internal
caller (run_policy_poll_loop in lib.rs) and only depends on
openshell-core + tracing — no cross-leaf entanglement.

First population of the openshell-supervisor-networking crate; adds
openshell-core and tracing as dependencies.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(core): lift procfs to openshell-core

Move procfs (PID lookups, ancestor walking, /proc/net/tcp socket-owner
resolution, file SHA256 hashing) from openshell-sandbox into
openshell-core. The module is consumed cross-leaf — by bypass_monitor
on the process side and by identity / proxy on the networking side —
so it has to sit below both leaves.

Adds tracing, sha2, and hex as dependencies on openshell-core.
Updates the three call sites in openshell-sandbox to import from
openshell_core::procfs.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-networking): move identity from openshell-sandbox

Move BinaryIdentityCache (path → SHA256 cache used to identify the
process behind an outbound connection) from openshell-sandbox into
openshell-supervisor-networking. The cache is consumed only by the
networking-side proxy and the orchestrator; with procfs already in
openshell-core there are no remaining cross-leaf dependencies.

Adds miette as a dependency and tempfile as a dev-dependency on
openshell-supervisor-networking. Adds a Default impl for
BinaryIdentityCache to satisfy clippy::new_without_default now that
the type is publicly exposed across crates.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move agent-proposals flag from openshell-sandbox

Move AGENT_PROPOSALS_ENABLED, agent_proposals_enabled(), and the
test-only ProposalsFlagGuard out of openshell-sandbox into
openshell-supervisor-process::proposals. The flag is read only by the
process-side policy_local route handler and the orchestrator; lifting
it to openshell-core would have made core carry sandbox-owned runtime
state without buying anything.

The test-only ProposalsFlagGuard is still consumed from networking-side
l7/rest tests today (until the wider Q2 OCSF/gRPC injection work lands).
Expose it via a new optional `test-helpers` feature on
openshell-supervisor-process so test crates opt in explicitly without
pulling tokio sync primitives into production builds.

openshell-sandbox keeps its existing crate-private path
(`crate::AGENT_PROPOSALS_ENABLED`, `crate::test_helpers`) via re-exports
so call sites and tests are unchanged.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(core): lift secrets to openshell-core

Move crates/openshell-sandbox/src/secrets.rs to crates/openshell-core/src/secrets.rs so both supervisor leaves can reach SecretResolver and the placeholder helpers without depending on openshell-sandbox.

Add base64 to openshell-core deps (only stdlib + base64 are used). Promote previously pub(crate) constructors and methods on SecretResolver to pub since cross-crate callers (provider_credentials, proxy/L7 tests) now name them across the crate boundary. Update import paths in proxy.rs, l7/{rest,relay,websocket}.rs, and provider_credentials.rs from crate::secrets to openshell_core::secrets.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(core): lift provider_credentials to openshell-core

Move crates/openshell-sandbox/src/provider_credentials.rs to crates/openshell-core/src/provider_credentials.rs. Both supervisor leaves now name ProviderCredentialState in their function signatures (run_networking takes &ProviderCredentialState, run_process takes ProviderCredentialState by value), and under Shape A leaves can't depend on openshell-sandbox, so the type must live in openshell-core.

The orchestrator (run_sandbox in openshell-sandbox) remains the only writer: it constructs ProviderCredentialState::from_environment and the policy poll loop calls install_environment on credential rotation. Both leaves stay pure readers via snapshot()/resolver().

Update import paths in proxy.rs, ssh.rs, and lib.rs from crate::provider_credentials to openshell_core::provider_credentials.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* style: rustfmt import ordering

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(ocsf): move SandboxContext singleton from openshell-sandbox

Move the process-wide OCSF SandboxContext OnceLock + LazyLock fallback + getter from openshell-sandbox/src/lib.rs into a new openshell-ocsf::ctx module. The type already lives in openshell-ocsf, so its singleton lives next to it.

Add openshell_ocsf::ctx::set_ctx() and openshell_ocsf::ctx::ctx(). The orchestrator (run_sandbox) now calls set_ctx during startup. Sandbox keeps a pub(crate) use openshell_ocsf::ctx::ctx as ocsf_ctx; re-export so the 138 existing crate::ocsf_ctx() call sites resolve unchanged.

When the sandbox modules themselves migrate into the leaf crates, they'll import openshell_ocsf::ctx directly and the re-export goes away.

Under Shape A neither leaf can depend on openshell-sandbox; both already depend on openshell-ocsf to construct events, so this adds no new dep edge.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(core): lift grpc_client to openshell-core

Both prospective leaves (supervisor-networking and supervisor-process)
need CachedOpenShellClient, AuthedChannel, and the connect/fetch
helpers. Under Shape A the leaves cannot depend on openshell-sandbox,
so the type has to live below them. openshell-core already pulls in
tonic and miette; this enables tonic's channel/tls features and adds
tokio as a direct dep.

Updates all crate::grpc_client::* call sites in openshell-sandbox to
openshell_core::grpc_client::*. No re-export shim — the call-site
count was small enough to update directly.

See architecture/plans/sandbox-split-design-choices.md for the full
rationale and trade-offs.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-networking): move denial_aggregator from openshell-sandbox

DenialAggregator and FlushableDenialSummary belong with the proxy and L7
layer that emit denials. Moves the file into openshell-supervisor-networking;
adds tokio as a regular dep there since DenialAggregator uses
tokio::sync::mpsc.

Drops the pub use openshell_core::DenialEvent re-export inside the moved
file (no longer needed cross-crate). Updates bypass_monitor.rs, proxy.rs,
and lib.rs to import openshell_core::DenialEvent directly.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move log_push from openshell-sandbox

LogPushLayer is a process-side tracing layer that streams sandbox logs
to the gateway via gRPC. Moves into openshell-supervisor-process; adds
openshell-core, openshell-ocsf, tokio-stream, tracing, and
tracing-subscriber as direct deps there.

Updates the only external call site (openshell-sandbox/src/main.rs) to
import from openshell_supervisor_process::log_push.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move bypass_monitor from openshell-sandbox

bypass_monitor reads /dev/kmsg for nftables drop log lines and emits
denial events. Pure process-side concern, called only from
run_networking which spawns it on the netns. Moves into
openshell-supervisor-process; all deps (openshell-core, openshell-ocsf,
tokio, tracing) were already declared there.

Replaces crate::ocsf_ctx() shim calls inside the moved file with
openshell_ocsf::ctx::ctx() — first leaf-side caller to import the OCSF
context singleton directly instead of going through openshell-sandbox's
re-export.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move debug_rpc from openshell-sandbox

debug_rpc is the CLI subcommand handler that exercises authenticated
gRPC calls (issue-token, refresh-token, get-config, etc.). Pure
process-side concern, called only from openshell-sandbox/main.rs.

Adds base64, hex, serde_json, sha2, and tonic (with channel/tls
features) as direct deps on openshell-supervisor-process. Updates the
single call site in main.rs.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move supervisor_session from openshell-sandbox

supervisor_session opens a bidirectional gRPC stream that lets the
gateway initiate shells inside the sandbox. Pure process-side concern,
called only from run_process. Adds uuid as a direct dep on
openshell-supervisor-process.

Replaces crate::ocsf_ctx() shim calls inside the moved file with
openshell_ocsf::ctx::ctx() — same pattern as bypass_monitor.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): lift managed_children tracker from openshell-sandbox

The MANAGED_CHILDREN set tracks PIDs of supervisor-spawned children
(entrypoint + SSH sessions) so the orchestrator's SIGCHLD reaper can
distinguish them from incidental zombies. Pure process-side concern,
moves to openshell_supervisor_process::managed_children with three
public fns: register, unregister, is_managed.

Updates lib.rs reaper, process.rs, and ssh.rs to call through the new
module path. Drops the now-unused HashSet import from lib.rs.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move sandbox hardening from openshell-sandbox

Lift the process-only hardening pieces (landlock, seccomp, PreparedSandbox,
prepare/enforce, log_sandbox_readiness, top-level apply, and
apply_supervisor_startup_hardening) from crates/openshell-sandbox/src/sandbox/
to crates/openshell-supervisor-process/src/sandbox/.

Leave netns.rs and nft_ruleset.rs in openshell-sandbox for now, since both
eventual leaf crates (supervisor-networking and supervisor-process) read from
NetworkNamespace and its final home is decided when run_networking and
run_process are extracted.

Replace crate::ocsf_ctx() shims in landlock.rs and the new linux/mod.rs with
direct openshell_ocsf::ctx::ctx() calls. Update call sites in lib.rs,
process.rs, and ssh.rs to import sandbox from openshell_supervisor_process
while keeping the netns import unchanged.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(core): lift proposals flag from openshell-supervisor-process

Move proposals.rs (AGENT_PROPOSALS_ENABLED OnceLock + agent_proposals_enabled
reader + test_helpers::ProposalsFlagGuard) from openshell-supervisor-process
to openshell-core so both eventual leaf crates can read it without depending
on each other.

The flag is process-wide singleton state initialised once during sandbox
startup and read by both the policy.local route (networking-side) and the
skills installer (process-side) — same shape as openshell_ocsf::ctx.

Move the test-helpers Cargo feature alongside it: openshell-core gains the
feature, openshell-supervisor-process loses it, and openshell-sandbox's
dev-dependency now enables openshell-core/test-helpers. Update the sandbox
re-export shim to point at openshell_core::proposals.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(core): lift netns + nft_ruleset from openshell-sandbox

Move NetworkNamespace and the nft_ruleset bypass-rule generator from
crates/openshell-sandbox/src/sandbox/linux/ to crates/openshell-core/src/netns/.
Both eventual leaf crates (supervisor-networking and supervisor-process) read
from NetworkNamespace, so it must live somewhere both can depend on without
violating the Shape A no-leaf-to-leaf rule.

Replace crate::ocsf_ctx() shims in netns with direct openshell_ocsf::ctx::ctx()
calls, matching the pattern used in already-migrated process modules. Update
super::nft_ruleset references inside netns to nft_ruleset since the module
is now a sibling sub-module of netns/mod.rs.

Add openshell-ocsf and uuid as linux-only dependencies of openshell-core, and
gate pub mod netns on target_os = "linux" since the implementation uses
netlink, ip(8), and namespace fds. Delete the now-empty sandbox/{mod.rs,
linux/mod.rs} stubs and update NetworkNamespace import paths in lib.rs and
process.rs to point at openshell_core::netns.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move process.rs and ssh.rs from openshell-sandbox

Lift the entrypoint process spawn module and the embedded SSH server
module into openshell-supervisor-process. openshell-sandbox now
re-exports ProcessHandle/ProcessStatus and calls
openshell_supervisor_process::ssh::run_ssh_server directly.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-networking): move proxy, l7, opa, policy_local from openshell-sandbox

Lift the egress proxy, L7 enforcement modules, OPA engine, and policy.local
advisor API into openshell-supervisor-networking. Move accompanying data
files (sandbox-policy.rego), test fixtures (testdata/), and integration
tests (system_inference, websocket_upgrade). Sandbox lib.rs now references
these via openshell_supervisor_networking::* and ProxyHandle::start_with_bind_addr
is exposed as pub for the orchestrator call site.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(sandbox): hoist policy poll loop and denial aggregator into orchestrator

Move the symlink-resolver, policy poll loop, and denial-aggregator flush
spawns out of run_process and into run_sandbox so run_process no longer
needs OpaEngine, retained_proto, the local policy context, the sandbox
name, the gateway endpoint for telemetry, the OCSF flag, or the denial
receiver. These long-running orchestrator-owned tasks now live alongside
the other sandbox-startup wiring, matching the design log decision in
architecture/plans/sandbox-split-design-choices.md (Q5).

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move run_process from openshell-sandbox

Lift the workload supervision entry point (zombie reaper, SSH server
spawn, supervisor session, entrypoint child spawn, exit-with-timeout)
into its own module in openshell-supervisor-process. The orchestrator
in openshell-sandbox now calls openshell_supervisor_process::run::run_process
directly. With this move run_process names only types from openshell-core,
openshell-ocsf, openshell-supervisor-process itself, std, and tokio —
no openshell-supervisor-networking dependency.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-networking): move bypass_monitor from supervisor-process

Bypass detection is network-policy enforcement: it parses nftables LOG
entries from /dev/kmsg and emits OCSF NetworkActivity / DetectionFinding
events plus DenialEvents into the same channel the proxy feeds. Its
lifetime is tied to the network namespace, not to the workload child.
Moving it to openshell-supervisor-networking puts it next to the proxy
and the denial aggregator that consume its output, and unblocks moving
run_networking out of openshell-sandbox without a leaf-to-leaf dep.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-networking): move inference route helpers from openshell-sandbox

Move build_inference_context, partition_routes, bundle_to_resolved_routes,
spawn_route_refresh, the InferenceRouteSource enum, and the route refresh
interval helpers into a new openshell-supervisor-networking::inference_routes
module along with their unit tests. The orchestrator now calls into the
networking leaf for inference context construction; the leaf owns its own
route bundle resolution end-to-end.

The new module is named inference_routes to avoid colliding with the
existing l7::inference module, which handles request-time HTTP parsing
and pattern matching rather than route bundle setup.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-networking): move run_networking from openshell-sandbox

Move the Networking handle struct, run_networking, and the Linux-only
create_netns_for_proxy helper into a new openshell-supervisor-networking::run
module. The orchestrator in openshell-sandbox now invokes
openshell_supervisor_networking::run::{create_netns_for_proxy, run_networking}
and reads the Networking fields directly; the leaf owns the entire
networking-stack startup path (CA generation, proxy task, bypass monitor,
inference context, denial channel) end-to-end.

The Networking RAII handle fields (proxy, bypass_monitor) are now public
without leading underscores so the public API satisfies clippy's
pub_underscore_fields lint while still serving as drop guards held by the
orchestrator's frame.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* fix(workspace): align Cargo deps and call sites for split crates

The recent module lifts left two Linux-only gaps that the macOS host
workspace check skipped:

- openshell-core's netns module needs libc, tempfile, and nix on Linux,
  but only openshell-ocsf and uuid were carried over.
- openshell-supervisor-process's seccomp/landlock modules need landlock
  and seccompiler, which still lived on openshell-sandbox.
- openshell-sandbox's runtime_pid_limit branch referenced an unqualified
  process:: that pointed at the old in-crate module.

Move landlock/seccompiler to supervisor-process, add the missing core
deps, qualify the call sites, and drop sandbox deps that no longer have
runtime users (landlock, seccompiler, target-gated tempfile/uuid, the
unix libc/rustix block).

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-network): rename openshell-supervisor-networking to openshell-supervisor-network

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-network): own denial-aggregator flush end-to-end

Move the denial-aggregator spawn and flush_proposals_to_gateway out of
run_sandbox and into run_networking. The networking leaf already owns
every other input (proxy + bypass_monitor as producers, denial channel,
mechanistic_mapper, denial_aggregator) and already opens its own gRPC
connections (inference_routes, policy_local) — the orchestrator was the
only piece left straddling the boundary.

Networking now drives the full path: producers -> channel -> aggregator
-> flush -> gateway. Drops denial_rx from Networking; adds sandbox_name
to run_networking so SubmitPolicyAnalysis can resolve by sandbox name
(falls back to ID when unset). Same shape as log_push in the process leaf.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-network): own symlink-resolution task

Move the OPA binary-symlink resolver out of run_sandbox and into
run_networking. The task probes /proc/<entrypoint_pid>/root/ until the
workload's mount namespace is accessible, then rebuilds the OPA engine
with resolved binary paths so policy rules match canonical names instead
of symlinks.

Both inputs (Arc<OpaEngine>, retained_proto) are networking-leaf concerns
and were already plumbed into run_networking; the entrypoint_pid Arc is
read lazily after the process leaf populates it. Adds retained_proto as
a parameter and spawns the resolver early in run_networking so the probe
loop starts before the proxy comes up.

Same shape as the denial-flush move: networking owns its own background
task end-to-end; the orchestrator stops hosting work that doesn't
conceptually belong to it.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move seccomp install into run_process

The supervisor seccomp prelude is part of "set up the workload-side
process tree", not part of orchestration. Move the call site from
run_sandbox into the top of run_process and drop the now-unused
re-export from openshell-sandbox::lib.

Timing is preserved: by the time the orchestrator calls run_process,
run_networking has already returned, so netns + nftables setup is
complete.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move check_runtime_pid_limit into run_process

The PID-limit precondition is process-side: it gates whether the workload
child can be spawned at all. Move the call from run_sandbox into the top
of run_process, alongside the seccomp prelude. Same shape as the seccomp
move — function already lives process-side, only the call site relocates.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move validate_sandbox_user to process crate

The sandbox-user check is a precondition for privilege-dropping the
workload child; it has no relevance to networking. Move the function
next to drop_privileges in openshell-supervisor-process::process and
call it from the top of run_process.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move prepare_filesystem to process crate

Creating and chowning read_write directories is workload-side
preparation, not orchestration. Move prepare_filesystem and its
prepare_read_write_path helper (plus tests) into
openshell-supervisor-process::process and call from run_process,
alongside validate_sandbox_user.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-process): move startup skill install into run_process

The eager initial-settings fetch + agent skill install is process-side:
the install materializes files the workload's filesystem sees. The
orchestrator still owns the AGENT_PROPOSALS_ENABLED OnceLock init
because the policy poll loop also reads it; only the early fetch and
install hop into run_process.

Behavior unchanged. Best-effort: any RPC or install failure is logged
but does not fail startup.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-network): own PolicyLocalContext construction

Move the PolicyLocalContext construction from run_sandbox into
run_networking. The orchestrator was building it solely to thread it into
the networking leaf and to share it with the policy poll loop; now
run_networking builds it from inputs it already takes (retained_proto,
openshell_endpoint, sandbox_name|sandbox_id) and exposes it on the
returned Networking struct.

The orchestrator's poll loop now grabs the Arc clone from
networking.policy_local_ctx, so the orchestrator no longer imports
openshell_supervisor_network::policy_local.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* feat(supervisor): add --mode flag to gate network/process leaves

Add a --mode flag (default "network,process") that selects which
supervisor leaves run in the current process. Two new shapes are
unlocked without splitting the binary:

  --mode=network             # network-only sidecar
  --mode=process             # process-only supervisor
  --mode=network,process     # combined (default; current behavior)

In network-only mode the orchestrator skips run_process and waits on
SIGINT/SIGTERM before tearing down the proxy. The entrypoint PID stays
at 0 for the lifetime of the process, which silently degrades the
proxy's binary-identity TOFU and the bypass monitor's PID enrichment;
this is correct in a split-pod topology where the workload's /proc
lives in another pod.

In process-only mode run_networking is skipped entirely. SSH sessions
get no proxy URL, no netns FD, and no CA paths, matching what a
split-pod consumer would expect when network enforcement is delegated
to a sidecar.

The policy poll loop continues to run unconditionally; its OPA-reload
and policy.local hooks already gate on the resources only present when
network is enabled, and the env-refresh / proposals-toggle hooks
remain active in process mode.

Closes a step toward the RFC-0001 supervisor topology proposed in
issue #1305 by drew.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* style(supervisor-process): rustfmt long debug! line

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-network): pull DenialEvent down from core

DenialEvent is only emitted and consumed inside openshell-supervisor-network
(proxy, bypass monitor, denial aggregator). It never crossed the leaf
boundary, so the earlier lift to openshell-core was speculative. Move it
back into the network crate where its only callers live.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-network): pull procfs down from core

procfs was lifted to openshell-core under the assumption it would be
shared cross-leaf, but on the current branch all three callers
(bypass_monitor, identity, proxy) live in openshell-supervisor-network.
No file in openshell-supervisor-process imports it. Move the module to
the network crate and drop sha2/hex from openshell-core, which were
pulled in only for procfs.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* style(supervisor-network): run cargo fmt

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* fix(supervisor-network): add libc dev-dependency for procfs tests

The procfs/bypass_monitor/proxy test modules use libc::{fork, exec,
fcntl, kill, waitpid} but the dep wasn't declared in this crate's
Cargo.toml. It was previously satisfied transitively when these
modules lived in openshell-core; the move left the test target
unable to resolve libc.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(sandbox): move denial aggregator to orchestrator

The denial aggregator and mechanistic mapper consume denial events
produced by the proxy and (subsequently) the bypass monitor. With both
supervisor leaves becoming pure producers of `DenialEvent`, the
consumer-side aggregation belongs in the orchestrator, not in either
leaf.

Move `denial_aggregator.rs` and `mechanistic_mapper.rs` from
`openshell-supervisor-network` to `openshell-sandbox` (the
orchestrator). The orchestrator now owns the unbounded denial channel:
it constructs `(tx, rx)`, hands `tx` to `run_networking` for the proxy
to clone, drains `rx` via the aggregator task, and runs the gateway
flush helper itself.

`run_networking`'s signature gains a `denial_tx` parameter and loses
its internal channel construction, aggregator spawn, and
`flush_proposals_to_gateway` helper. `DenialEvent` stays in
`openshell-supervisor-network` for now; a follow-up commit will lift
it to `openshell-core` alongside the bypass monitor relocation.

* refactor(supervisor-process): pull bypass monitor down from network

`bypass_monitor` is process-isolation machinery: it tails the kernel
log via `dmesg --follow`, parses nftables LOG lines emitted from the
workload's network namespace, resolves PIDs via `/proc`, and emits
OCSF events plus optional `DenialEvent`s. None of this touches the
proxy, OPA, TLS, or any other supervisor-network state — it only
shared the denial channel because both feed the same aggregator.

Move `bypass_monitor.rs` from `openshell-supervisor-network` to
`openshell-supervisor-process` (as `bypass_monitor/mod.rs`). Spawn it
in `run_process` where the netns name and entrypoint PID are already
in scope. The orchestrator hands an extra `bypass_denial_tx` clone of
the denial channel sender to `run_process` for this purpose.

Lift `DenialEvent` from `openshell-supervisor-network` to
`openshell-core`. Both supervisor leaves now produce it, so it needs
a shared location that neither leaf depends on. This reverses an
earlier commit that pulled the type into the network leaf when it was
the only producer.

Copy the minimal subset of `/proc` parsers used by `bypass_monitor`
into a private `bypass_monitor::procfs` submodule. The alternative —
extracting a shared procfs crate — is a much larger refactor that
this commit does not need; supervisor-network's `procfs.rs` continues
to serve the proxy and identity cache.

* refactor(supervisor-process): derive ssh netns fd inside run_process

The ssh_netns_fd was computed in run_networking purely to forward it
through the Networking struct and back into run_process. supervisor-network
never read it. Move the derivation to run_process where the
NetworkNamespace handle is already in scope.

* refactor(supervisor-process): derive ssh proxy url inside run_process

The ssh_proxy_url was computed in run_networking purely to forward it
through the Networking struct and back into run_process. supervisor-network
never read it. Move the derivation to run_process where the
NetworkNamespace handle and SandboxPolicy are already in scope.

After this commit the Networking struct no longer carries any SSH-shaped
fields, and supervisor-network reads only host_ip from the netns (for the
proxy bind address).

* refactor(supervisor-network): take proxy bind ip directly instead of netns

run_networking only ever read host_ip from the netns it was passed (the
SSH plumbing reads moved to run_process in earlier commits). Replace the
NetworkNamespace parameter with a plain Option<IpAddr> the orchestrator
extracts. supervisor-network's run module no longer references the netns
type for any consumer, only for create_netns_for_proxy (which still lives
in this crate; relocates next).

* refactor(supervisor-process): move netns ownership out of core

Relocates the NetworkNamespace handle, nft ruleset builder, and
create_netns_for_proxy constructor into openshell-supervisor-process.
The orchestrator (openshell-sandbox) phantom-owns the RAII handle for
the duration of run_sandbox; supervisor-network no longer references
the type at all.

Drops uuid, libc, nix, openshell-ocsf, and tempfile from core's Linux
target deps (all were exclusive to netns). tempfile becomes a Linux
runtime dep on supervisor-process for nft ruleset application.

* chore(sandbox): prune leaf-only deps from orchestrator manifest

cargo-machete flagged 26 direct dependencies that were carried over
from the pre-split monolith and are no longer used by the orchestrator
itself: regorus, russh, rcgen, tokio-rustls, ipnet, apollo-parser,
openshell-router, anyhow, base64, bytes, flate2, glob, hex, hmac, nix,
rand_core, rustls-pemfile, serde, serde_yml, sha1, sha2, thiserror,
tokio-stream, uuid, webpki-roots.

These now live (transitively) in openshell-supervisor-network and
openshell-supervisor-process where they are actually consumed.

* chore(deps): prune unused deps from supervisor crates

- Drop unused `url` from openshell-supervisor-network.
- Mark `prost` and `prost-types` as cargo-machete-ignored in
  openshell-core: they have no source-level `use`, but the tonic-
  generated proto code references them via `::prost::Message` etc.
- openshell-supervisor-process is already clean.

* fix(supervisor-network): wait for entrypoint PID before symlink probe

The OPA symlink-resolution task reads entrypoint_pid once at the top
of the spawned closure. Because the spawn happens before run_process
publishes the workload PID, the load returns 0, the probe path bakes
in as /proc/0/root/, and the loop exhausts its retries against a path
that does not exist on Linux. The reload never fires, so policies that
whitelist symlinked binaries (e.g. /usr/bin/python3 → python3.11) get
silent denials when the workload exec's the realpath.

Split the wait into two phases: 5s polling entrypoint_pid for a
non-zero value, then the existing 5s window probing /proc/<pid>/root/.
Distinct warn messages on each timeout so future debugging can tell
"PID never published" apart from "container fs never appeared".

* fix(sandbox): restore GPU procfs baseline (#1522)

Signed-off-by: Evan Lezar <elezar@nvidia.com>
(cherry picked from commit 5102cb9)
Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* fix(supervisor-process): use renamed tonic tls-native-roots feature

Upstream renamed the tonic `tls` feature to `tls-native-roots`. The
supervisor-process Cargo.toml still referenced the old name, which broke
the workspace build after merging upstream.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* refactor(supervisor-network): relocate token_grant and spiffe_endpoint

Upstream's SPIFFE-backed token grant feature landed in
crates/openshell-sandbox/src/. After the supervisor split, the L7
enforcement code in supervisor-network calls into token_grant, which
would require supervisor-network to depend back on sandbox.

Move token_grant.rs and spiffe_endpoint.rs into supervisor-network
where the only callers live, add the reqwest and spiffe deps to
supervisor-network's Cargo.toml, and drop them from sandbox.

Also fix two stale `openshell_core::proto::` self-references in
openshell-core (a pre-existing breakage that surfaced once the rest of
the merge compiled).

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

* fix(supervisor-process): broaden Path import cfg to all unix targets

The `Path` import was gated on `cfg(any(test, target_os = "linux"))`,
but `prepare_read_write_path` is gated on `cfg(unix)` — broader. On
non-Linux unix the function still referenced `&std::path::Path`
explicitly, so upstream's qualified path was load-bearing.

After the supervisor split, lint runs on Linux where `Path` IS in
scope, so `unused_qualifications` fires. Broaden the import cfg to
match the function's cfg and use the bare `Path` name everywhere.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

---------

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: GPU sandboxes miss filesystem access for CUDA workloads

2 participants