fix(sandbox): restore GPU procfs baseline by elezar · Pull Request #1522 · NVIDIA/OpenShell

elezar · 2026-05-22T13:47:40Z

Summary

Restore CUDA GPU startup compatibility by promoting /proc from
filesystem_policy.read_only to filesystem_policy.read_write when /proc
is part of the active GPU runtime baseline.

This keeps the change intentionally narrow. The existing baseline enrichment
already places /proc in the GPU read-write baseline because CUDA writes
/proc/<pid>/task/<tid>/comm during initialization. The missing behavior was
that an existing read-only /proc entry caused enrichment to skip the
read-write baseline path. This PR restores that promotion and emits an
informational log message when it happens.

Broader handling for user-supplied policy conflicts and explicit baseline
conflict controls is left to follow-up work such as #1629.

Related Issue

Fixes #1486

Related follow-up: #1629

Changes

Promote /proc from read_only to read_write when the GPU read-write
baseline requires it.
Preserve existing behavior for other read-only/read-write baseline conflicts.
Emit an informational log when /proc is promoted for GPU runtime
compatibility.
Add a regression test covering GPU baseline enrichment without network
policy.

Testing

mise exec -- cargo fmt --all
mise exec -- cargo test -p openshell-sandbox --lib baseline_tests -- --nocapture
mise run pre-commit completed Helm lint, Rust format, Rust check, Rust clippy, markdown lint, and license checks; python:proto failed in the parallel run because grpc_tools was missing after .venv recreation.
mise run python:proto

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture/docs updated (not applicable for this minimal runtime fix)

github-actions · 2026-05-22T13:48:05Z

🌿 Preview your docs: https://nvidia-preview-pr-1522.docs.buildwithfern.com/openshell

pimlock

LGTM with a few nits and questions.

elezar · 2026-06-01T19:29:39Z

Thanks for your initial review @pimlock. After the initial back and forth, I realised that there were a number of edge cases that I was not considering. I believe I was trying to detect user intent with insufficient signal and as such have updated this PR to ALWAYS promote /proc to read-write if GPUs are requested and instead capture explicit intent in #1629 as a follow-up. This PR would unblock the GPU-enabled tests, but I'm happy to continue iterating on it if required.

Signed-off-by: Evan Lezar <elezar@nvidia.com>

pimlock · 2026-06-02T16:44:09Z

Thanks for your initial review @pimlock. After the initial back and forth, I realised that there were a number of edge cases that I was not considering. I believe I was trying to detect user intent with insufficient signal and as such have updated this PR to ALWAYS promote /proc to read-write if GPUs are requested and instead capture explicit intent in #1629 as a follow-up. This PR would unblock the GPU-enabled tests, but I'm happy to continue iterating on it if required.

Thanks! I took a first pass at #1629 and I like the approach. I think it's great for the mechanism to be more explicit and exposing it through the policy makes sense, so the full picture of what's allowed is in the policy.

* fix(ci): eliminate image-tag race between concurrent workflows (#1413) - Add publish-manifest input to docker-build.yml (default true); single-arch branch callers set it false so the merge job is skipped and the shared bare :SHA tag in GHCR is never written by branch workflows - branch-kubernetes-e2e: retag :SHA-amd64 to :SHA before kind load so Helm's image.tag matches what is loaded in kind containerd - branch-e2e: pass image-tag as :SHA-arm64 to e2e-test so the arch-specific GHCR tag is used directly without depending on the bare tag - bare :SHA in GHCR is now written only by test-gpu.yml (multi-arch build), eliminating the last-writer-wins race across concurrent workflows * test(server): cover service endpoint plaintext security (#1352) * test(server): cover service endpoint plaintext security * test(server): align tls test with from_files Option<&Path> signature TlsAcceptor::from_files now accepts the client CA path as Option<&Path> (per the require_client_auth refactor on main). Wrap the helper's CA path in Some(...) so the new plaintext-service-http tests compile after rebasing onto current main. --------- Co-authored-by: Taylor Mutch <taylormutch@gmail.com> * fix(cli): add auth and TLS support to completion client (#1489) * fix(scripts): use portable lowercase in normalize_bool for Bash 3.2 (#1493) * refactor(server): extract shared relay-await and sandbox-scan helpers (#1495) * fix(sandbox): skip fork-exec socket ambiguity test on SELinux-enforcing hosts (#1449) Exec'ing /bin/sleep (SELinux label bin_t) from a user_home_t test binary causes /proc/<pid>/exe readlink to return ENOENT on SELinux-enforcing hosts due to the cross-domain boundary. Skip the test at runtime when getenforce reports Enforcing. Also adds a ChildGuard drop guard for safe child cleanup on panic and increases the exec-detection deadline from 2s to 5s. Signed-off-by: Derek Carr <decarr@redhat.com> * fix(sandbox): allow first-label L7 host wildcards (#1304) * fix(sandbox): allow first-label L7 host wildcards * docs(sandbox): document L7 host wildcard contract + add OPA runtime tests - Add Host Wildcards section to architecture/security-policy.md describing accepted (first-label *, **, intra-label *-X) and rejected (bare, TLD, non-first-label, recursive-in-label) forms, and noting that wildcards never cross '.' boundaries. - Expand the policy-schema.mdx 'host' field description to reflect the same contract instead of only mentioning '*.example.com'. - Add OPA runtime tests asserting '*-aiplatform.googleapis.com' matches 'us-central1-aiplatform.googleapis.com' and does not match 'us-central1.aiplatform.googleapis.com' (cross-dot boundary). Locks validator/runtime alignment for intra-label wildcards. * chore: update mise lockfile * test(server): tolerate serialized inference upserts --------- Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com> * feat(cli): add JSON/YAML output format to gateway list (#1500) Add -o/--output flag to `openshell gateway list` matching the existing sandbox list pattern, enabling machine-readable output for scripting. Signed-off-by: Florent Benoit <fbenoit@redhat.com> * refactor: deduplicate repeated patterns across crates (#1499) Remove ~280 lines of duplicated code across 30 files in 5 areas: - centered_rect: consolidate 5 identical TUI layout helpers into a single pub fn in openshell-tui/src/ui/mod.rs - server test helpers: replace ~100 inline Store::connect() calls with local test_store() helpers; deduplicate test_server_state() in grpc/service.rs to use the shared test_support version - rogue PKI: extract 20-line rogue CA+client cert generation block (duplicated in two integration tests) into generate_rogue_pki() in tests/common/mod.rs - provider tests: replace 8 identical 28-line test modules with a single macro_rules! test_discovers_env_credential! invocation - label constants: centralize openshell.ai/ container label keys in openshell-core::driver_utils; update Docker and Kubernetes drivers to import from there instead of redefining them locally * fix(ci): resolve mirror gate statuses for fork PRs (#1504) Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com> * fix(server): respect OPENSHELL_PODMAN_SOCKET env var in embedded driver (#1483) The env var was only wired up via clap in the standalone openshell-driver-podman binary. When the Podman driver runs embedded in the gateway, config came exclusively from TOML deserialization and the env var was never consulted. Apply it as a post-deserialization override, matching the existing OPENSHELL_K8S_WORKSPACE_DEFAULT_STORAGE_SIZE pattern. Closes #1446 * refactor(sandbox,driver-vm): Start moving to rustix (esp over libc unsafe) (#1505) In the Rust ecosystem there's largely three ways to do system calls: - raw libc - nix - rustix Of the three, libc is almost all `unsafe` and really 95% of use cases should be either nix or rustix. nix is the original one, but after having looked at the code of both, I think rustix is just better designed and organized. It's also reached 1.0, whereas nix is still making semver-breaking changes (in fact we're behind here in this project). Now in practice, we have both *transitively* in the depchain already, and that's true for quite a lot of projects. But I think rustix is better, so let's add rustix as a workspace dependency (process feature) and migrate a few use cases to it - it's especially better than the raw libc which is suprisingly widespread. If we agree to do this, then many other calls can be ported. Signed-off-by: Colin Walters <walters@verbum.org> * fix(packaging): add upgrade migration docs and podman socket retry (#1507) After #1415 ships, users upgrading from previous releases need guidance on the gateway.env deprecation, port/bind/database path changes, and the podman.socket restart requirement. - docs(rpm): add 'Migrating from gateway.env' section to TROUBLESHOOTING covering backward compatibility, env-to-TOML key mapping, and three breaking changes (default port 8080->17670, bind address 0.0.0.0->127.0.0.1, database path move). Add podman.socket restart step to upgrade procedure. - docs(rpm): add upgrade callout to CONFIGURATION.md pointing at migration section. - fix(podman): retry PodmanComputeDriver ping up to 5 times with 2s delay to tolerate transient socket unavailability after package upgrades. The systemd unit uses Wants=podman.socket (not Requires) so the gateway can start while the socket is briefly re-activating after an RPM upgrade changes its unit file on disk. - chore(rpm): update EnvironmentFile comment in RPM spec to explain backward-compatibility intent. Signed-off-by: Adam Miller <admiller@redhat.com> * ci: deduplicate e2e workflows (#1512) Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * feat(auth): per-sandbox authentication to gateway (#1404) * docs(sandboxes): add policy advisor guide (#1480) Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com> * fix(docker): use host-gateway callbacks on macOS (#1516) * ci(e2e): load single-arch images into kind (#1518) Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * docs(rfc): add sandbox resource requirements proposal (#1360) * docs(rfc): add sandbox resource requirements proposal Signed-off-by: Evan Lezar <elezar@nvidia.com> * docs(rfc): finalize sandbox resource requirements --------- Signed-off-by: Evan Lezar <elezar@nvidia.com> * ci(canary): keep helm jwt secret generation enabled (#1521) Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * fix(cli): add json output for policy get (#1410) * fix(cli): add json output for policy get * test(cli): cover policy get full json output * fix(cli): address policy get json clippy --------- Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com> * feat(providers): derive discovery from profiles (#1503) * feat(providers): derive discovery from profiles Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com> * fix(providers): keep v2 discovery profile-only Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com> * docs(providers): update providers v2 behavior Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com> * fix(providers): make github profile read-only Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com> --------- Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com> * docs: update NemoClaw/OpenClaw references (#1529) * ci: seed shared Rust caches from main (#1530) * fix(release): build host Linux binaries with glibc floor (#1490) * fix(homebrew): repair local driver bootstrap state (#1527) * fix(homebrew): repair local driver bootstrap state * fix(bootstrap): satisfy default SAN doc lint * ci: install cargo-zigbuild from release binaries (#1533) * fix(cli): propagate --gateway-insecure to OIDC auth flows (#1535) Thread the gateway_insecure flag through gateway_add(), gateway_login(), and all OIDC HTTP clients so that --gateway-insecure and OPENSHELL_GATEWAY_INSECURE apply to OIDC discovery, token exchange, and token refresh requests. Previously, the flag only affected gRPC connections to the gateway. OIDC HTTP clients (reqwest::get and http_client) always verified TLS certificates, causing gateway registration and login to fail when the OIDC issuer used a self-signed certificate (common on OpenShift with edge-terminated routes). Fixes #1534 Signed-off-by: Adel Zaalouk <azaalouk@redhat.com> * ci(release): smoke test rpm artifacts on fedora (#1558) Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com> * chore(deps): bump docker/login-action from 4.1.0 to 4.2.0 (#1554) Bumps [docker/login-action](https://github.com/docker/login-action) from 4.1.0 to 4.2.0. - [Release notes](https://github.com/docker/login-action/releases) - [Commits](https://github.com/docker/login-action/compare/4907a6ddec9925e35a0a9e82d7399ccc52663121...650006c6eb7dba73a995cc03b0b2d7f5ca915bee) --- updated-dependencies: - dependency-name: docker/login-action dependency-version: 4.2.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(helm): add missing SPDX header to gateway-config template (#1545) * chore(helm): add missing SPDX header to gateway-config template * chore(scripts): remove helm templates from license header exclusions The bypass had no known rationale. Removing it ensures the header script covers deploy/helm/openshell/templates uniformly going forward. Signed-off-by: mesutoezdil <mesudozdil@gmail.com> --------- Signed-off-by: mesutoezdil <mesudozdil@gmail.com> * ci(release): skip python rpm in gateway smoke test (#1559) Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com> * ci: pin azure/setup-helm and helm/kind-action to commit SHAs (#1544) * ci: pin azure/setup-helm and helm/kind-action to commit SHAs * chore(python): add py.typed marker for PEP 561 compliance * ci: use full semver in pinned action version comments Signed-off-by: mesutoezdil <mesudozdil@gmail.com> --------- Signed-off-by: mesutoezdil <mesudozdil@gmail.com> * refactor: deduplicate shared code across ocsf builders and driver crates (#1526) Extract repeated patterns into shared helpers: - Add impl_builder_setters! macro to openshell-ocsf/builders that generates the identical severity(), status(), and message() setter methods present on all 7 OCSF event builders - Add SandboxContext::apply_common_fields() to consolidate the four-line build() finalization (set_status, set_message, set_device, set_container) repeated in every builder - Add driver_utils::sandbox_token_path() to centralize the XDG state path construction for sandbox JWT files used by both the Docker and Podman drivers - Add driver_utils::build_capabilities_response() to eliminate the identical GetCapabilitiesResponse struct literal repeated across the Docker, Podman, and Kubernetes compute drivers * fix(python): raise SandboxError instead of FileNotFoundError or KeyError (#1547) * fix(python): raise SandboxError instead of FileNotFoundError or KeyError * fix(python): suppress exception chaining in SandboxError raises Add `from None` to both `raise SandboxError(...)` calls inside `except FileNotFoundError` blocks to satisfy ruff B904. * fix(scripts): replace mapfile with bash 3.2-compatible read loop in helm-k3s-local (#1539) macOS ships bash 3.2 which lacks mapfile/readarray. Replace all three occurrences in configure_ghcr_credentials, cluster_has_image, and cluster_image_platform with a portable while-read loop, consistent with the fix applied to docker-build-image.sh in #1334. * docs: add macOS compiler troubleshooting (#1569) Signed-off-by: Ann Marie Fred <afred@redhat.com> * fix(gateway): configure local dev auth (#1575) This makes it so you can run the dev gateway and sandbox with: ``` mise run gateway # in another shell mise run sandbox ``` Signed-off-by: Kris Hicks <khicks@nvidia.com> * docs: add Pi as supported sandbox (#1572) * fix(sandbox): add mechanistic smoke test for L4 deny and document the L4/L7 split (#1412) * fix(sandbox): add mechanistic smoke test for L4 deny and document the L4/L7 split The old smoke script exercised an L7 PUT which hung because the denial aggregator is only wired to L4 CONNECT denies, not L7 enforcement. Add mechanistic-smoke.sh which triggers an L4 deny, waits for the aggregator to flush, and asserts a pending chunk appears under openshell rule get --status pending. Document the intentional L4-only scope of the mechanistic mapper in architecture/sandbox.md. Fixes #1333 Signed-off-by: mesutoezdil <mesudozdil@gmail.com> * refactor(smoke): remove redundant variable inits and merge double step call Signed-off-by: mesutoezdil <mesudozdil@gmail.com> * fix(smoke): wire mechanistic smoke into mise and guard TMP_DIR - Initialize TMP_DIR before trap to prevent unbound variable on early exit - Add e2e:mechanistic-smoke mise task with gateway setup - Document mechanistic smoke in policy-advisor README * test(proxy): verify L4 deny enqueues a DenialEvent Signed-off-by: mesutoezdil <mesudozdil@gmail.com> * fix(proxy): remove unnecessary path qualifications in L4 denial smoke test --------- Signed-off-by: mesutoezdil <mesudozdil@gmail.com> * docs(readme): whitespace (#1578) Signed-off-by: Kris Hicks <khicks@nvidia.com> * fix(cli): replace outdated name reference (#1582) Signed-off-by: Kris Hicks <khicks@nvidia.com> * fix(sandbox): probe Landlock before build, skip on unsupported kernels (#1585) On kernels without Landlock (e.g. gVisor's sentry returns ENOSYS for syscall 444), the previous best_effort path still logged "Applying Landlock" + "Landlock ruleset built" events even though no enforcement was happening. Probe at the top of `landlock::prepare` and short-circuit with a single High-severity "Sandbox Unavailable" finding. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com> * fix(sandbox): decouple GPU baseline from network policy (#1524) Signed-off-by: Evan Lezar <elezar@nvidia.com> * docs(kubernetes): note that Sandbox volumeClaimTemplates is immutable (#1543) * fix(sandbox): use succinct endpoint denial reason (#1584) Signed-off-by: Kris Hicks <khicks@nvidia.com> * feat(docker): add provisioning progress events (#1567) * docs(kubernetes): add RBAC section to setup page (#1540) Documents the ServiceAccount, Role, and ClusterRole created by the Helm chart inline on the setup page, per reviewer feedback on #1250. Reflects the current chart templates including pods/get for sandbox identity and tokenreviews/create for projected token validation. Closes #1018 * fix(sandbox): delegate PID limits to runtimes (#1497) Signed-off-by: mjamiv <142179942+mjamiv@users.noreply.github.com> * fix(gateway): make readiness health checks dependency-aware (#1328) * feat(gateway): add readiness probe metrics and test-only store close Emit Prometheus readiness metrics for database probes (healthy gauge and outcome-labeled latency histogram) with coverage in health HTTP tests. Restrict Store::close behind test support cfg to prevent accidental runtime pool shutdown under live traffic. Signed-off-by: Adrien Langou <alangou@nvidia.com> * test(e2e): add simple e2e test with kubernetes to test /readyz Signed-off-by: Adrien Langou <alangou@nvidia.com> --------- Signed-off-by: Adrien Langou <alangou@nvidia.com> * fix(vm): scope rootfs cache by openshell version (#1587) Signed-off-by: Drew Newberry <anewberry@nvidia.com> * fix(cli): preserve symlinks during sandbox upload (#1595) * fix(cli): preserve symlinks during sandbox upload * docs(sandboxes): document upload symlink behavior * fix(core): preserve SSH gateway default ports (#1602) Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router (#1596) * feat(server): per-handler gRPC auth annotations Move scope, role, and auth-mode metadata to the handler definition site via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and ALLOWED_SANDBOX_METHODS tables are now generated from per-method annotations on the tonic service impls, with canonical gRPC paths derived from the service name and method name. Adds a new openshell-server-macros proc-macro crate, an aggregator in auth/method_authz.rs, and an exhaustiveness test that decodes the protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and verifies every RPC has an annotation. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server): rename `sandbox-secret` auth mode to `sandbox` PR #1404 replaced the shared sandbox secret with per-sandbox gateway-minted JWTs. A handler marked `sandbox` now authenticates as a specific `Principal::Sandbox`, not as a holder of a shared credential. Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and `AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches the post-#1404 identity model. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * fix(server): enforce per-handler AuthMode at the router Addresses review feedback on the per-handler auth-annotation work. - Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous router only checked is_sandbox_callable() for Principal::Sandbox; user principals still flowed into AuthzPolicy::check() and bypassed the per-handler declaration. A user with `openshell:all` could therefore reach `sandbox`-only handlers like GetSandboxProviderEnvironment, ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even though their annotations said sandbox-only. Adds an is_user_callable() predicate and rejects User principals at the router for `sandbox` / `unauthenticated` methods. - Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A second `auth`, `scope`, or `role` previously silently overwrote the first value; now it fails to compile. - Regression tests: a unit test for is_user_callable() and a router test that proves a user with admin role + openshell:all cannot reach the nine sandbox-only handlers. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server): finish renaming sandbox-secret to sandbox in method_authz doc comments Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): drop standalone `rpc_auth` stub The stub was a safety net that fired only when a method had `#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it required `rpc_auth` to be imported, which is why both call sites carried `#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`. Drop the stub and the unused-import workaround. A missing `#[rpc_authz]` now surfaces as rustc's standard "cannot find attribute `rpc_auth` in this scope" — clear enough, and one fewer import + lint exception. Addresses review comment on PR #1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * refactor(server-macros): emit fixed `AUTH_METADATA` const per service The previous trait-derived const name turned `OpenShell` into `OPEN_SHELL_AUTH_METADATA`, splitting the project name across an underscore. Each impl already lives in its own module (`crate::grpc::`, `crate::inference::`), so the module path is enough to disambiguate between services — a fixed `AUTH_METADATA` name reads more naturally. Aggregator in `auth/method_authz.rs` now references `crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA` directly. Addresses review comment on PR #1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * docs(server-macros): fix typo in AUTH_METADATA_CONST doc comment OpenShell is one word; reference name in the doc should be OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA. Addresses review nit on PR #1596. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> --------- Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * ci(snap): add snap release pipeline (#1600) * docs: refresh landing terminal demo and apply NVIDIA fern theme (#1615) - Extract landing-page terminal demo into a reusable <CommandTerminal /> component with inline styles (no global CSS dependency) - Animate a second command line cycling through claude/opencode/codex via @keyframes scoped inside the component - Inline BadgeLinks layout styles so the component renders correctly without relying on .badge-links from main.css - Add jsx.d.ts shim so editors do not flag the React global in component TSX files - Switch fern instance to global-theme: nvidia with multi-source enabled - Bump fern CLI to 5.40.0 and drop the basepath-aware experimental flag - Register fern/components/ as a second mdx-components directory - Remove the unused Adobe analytics script tag * build(macos): remove unused import of tracing::warn (#1619) Signed-off-by: Calum Murray <cmurray@redhat.com> * chore: align .python-version with mise.toml (#1618) Signed-off-by: Calum Murray <cmurray@redhat.com> * feat(helm): add optional PostgreSQL backing store (#1579) * feat(helm): add optional PostgreSQL backing store with Secret-based credentials - Add postgres.enabled and postgres.deploy values to control database backend (SQLite vs PostgreSQL) and subchart deployment independently. - Introduce db-secret.yaml template for Opaque Secret with assembled postgresql:// connection string injected via OPENSHELL_DB_URL env var. - Add Bitnami PostgreSQL as optional subchart dependency keyed on postgres.deploy to prevent subchart deployment in external mode. - Externalize JWT signing key file mode via sandboxJwt.secretDefaultMode with 0400 default matching upstream. - Add validation guard for postgres.deploy=true without postgres.enabled. - Add helm unit tests covering internal, external, URL-override, special character encoding, and misconfiguration error paths. - Update README with Kubernetes and OpenShift install examples for bundled and external PostgreSQL configurations. - Add helm dependency build to lint and unittest tasks. * fix(helm): add database backend docs to README.md.gotmpl and regenerate The helm-docs CI check failed because the Database backend section was added directly to README.md instead of README.md.gotmpl. Move the content to the template and regenerate so the check passes. * fix(helm): use Secret-based DB credentials and support existingSecret Replace the inline db-url stringData pattern with a proper Secret containing individual fields plus a uri key. When postgres.deploy=true the Bitnami service-binding secret is referenced directly; when deploy=false users can supply postgres.external.existingSecret to bring their own Secret, or let the chart generate one from the external field values. Also restructures the README database section for clarity, adds helm-unittest coverage for the new secret resolution paths, and fixes a markdown lint issue in the root README. * refactor(helm): move OpenShift e2e script to e2e/rust/ and add mise task Move test-openshift-scenarios.sh from deploy/helm/openshell/ci/ to e2e/rust/e2e-openshift.sh, matching the existing e2e script naming convention. Register it as `e2e:openshift` in tasks/test.toml — not wired into the `test` or `e2e` aggregates so it only runs on explicit invocation against a live OpenShift cluster. * feat(e2e): add database backend scenarios to Kubernetes e2e Extend with-kube-gateway.sh with an optional multi-scenario loop gated by OPENSHELL_E2E_KUBE_DB_SCENARIOS=1. When enabled, the script installs the Helm chart three times — SQLite (default), bundled PostgreSQL, and external PostgreSQL with existingSecret — running the full test suite against each backend. When unset, existing single-install behavior is unchanged. Also adds helm dependency build before helm install, fixing CI failures caused by the missing PostgreSQL subchart dependency. * refactor(helm): simplify PostgreSQL config to two orthogonal controls Replace postgres.deploy and postgres.external.* with two simple controls: - postgres.enabled: deploy the bundled Bitnami PostgreSQL subchart - server.externalDbSecret: name of a pre-existing Secret with a uri key Delete db-secret.yaml — the chart no longer generates Secrets from individual credential fields. Users either get the Bitnami service-binding secret (bundled) or bring their own via server.externalDbSecret. Add validation that postgres.serviceBindings.enabled must stay true when using bundled PostgreSQL, preventing a confusing runtime failure. * docs(config): update gateway config reference (#1624) * feat(flake): add Nix development shell (#1592) * feat(build): add simple nix flake with formatter for nix code * feat(flake): setup rust toolchain, able to build and run unit tests * feat(flake): add support for arm linux and macos * feat(toolchain): add rust-src and rust-analyzer to the toolchain * refactor(proto): move phase and current_policy_version into status (#1565) * refactor(proto): move phase and current_policy_version into SandboxStatus Move phase and current_policy_version from SandboxSpec into SandboxStatus to correctly model mutable runtime state. Update all callers in the gateway server, TUI, and Python SDK to read and write these fields through SandboxStatus accessors. Signed-off-by: Derek Carr <decarr@redhat.com> * fix(server): preserve sandbox status on statusless driver updates When a driver update arrives without a status payload (e.g. before Kubernetes populates the status subresource), preserve the stored phase, conditions, and current policy version instead of resetting them. Adds a regression test covering the edge case. Signed-off-by: Derek Carr <decarr@redhat.com> --------- Signed-off-by: Derek Carr <decarr@redhat.com> * feat(python-sdk): support OIDC Bearer auth on SandboxClient (#1621) * feat(python-sdk): support OIDC Bearer auth on SandboxClient PR #1596 hardened the gateway side of the OIDC story; the Python SDK was the remaining gap — it only supported plaintext or mTLS, with no Bearer metadata anywhere. Deployments with OIDC enabled (the recommended posture since PR #935 / PR #1404) were unreachable from the SDK. Adds: - `bearer_token: str | Callable[[], str] | None` kwarg on `SandboxClient`. Static strings or zero-arg callables (the latter is invoked per RPC, so callers can drop in a refresh loop or token-file watcher without reconstructing the client). Composes with `tls` for OIDC-over-mTLS deployments. - `_BearerAuthInterceptor` implementing all four `grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types. Appends `authorization: Bearer <token>` to outgoing metadata. Implemented as an interceptor (not call credentials) so it works on both plaintext (`disableTls=true` dev) and TLS channels without `grpc.composite_channel_credentials`. - `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`, `key_path`) are now optional with `cert_path` / `key_path` required-together-or-not-at-all (enforced in `__post_init__`). This unlocks three transport profiles from one dataclass: * full mTLS (all three) * CA-only trust (`ca_path` only) * system roots (`TlsConfig()` — for OIDC gateways behind a public CA) - `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs` `build_oidc_channel`: * For any `https://` gateway, always build a secure channel. Pick the strongest TLS profile available in `mtls/` (full mTLS → CA-only → system roots). No more `insecure_channel` fallback for HTTPS. * Gate OIDC bearer attachment on `metadata.json["auth_mode"] == "oidc"`. Matches `crates/openshell-cli/src/main.rs:132` and the TUI; a stale `oidc_token.json` next to a non-OIDC gateway no longer causes the SDK to attach a bearer. - `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh modeled on `google.oauth2.credentials.Credentials` and `botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every RPC; when stale, re-reads disk first (the CLI may have rotated the bundle), and only then exchanges the refresh_token against the IdP's token endpoint discovered via OIDC discovery (`/.well-known/openid-configuration`, cached after first call). Concurrent RPCs share a single refresh via `threading.Lock` (no IdP stampede). Honors refresh-token rotation. Surfaces IdP failures as `SandboxError` with the RFC 6749 error body included for diagnostics. Mirrors the Rust CLI's HTTP-policy posture from `crates/openshell-cli/src/oidc_auth.rs`: * `follow_redirects=False` so a 3xx during discovery can't steer us to an attacker-controlled token endpoint. * Discovery `issuer` is validated against the configured issuer; a discovery document claiming a different issuer is rejected, preventing the SDK from POSTing the refresh_token to a malicious endpoint. * `insecure: bool` flag plumbed through to httpx's `verify=` so self-signed-cert deployments work the same way they do in the Rust CLI. Built on `httpx` (chosen over `urllib` specifically for follow_redirects + verify control as kwargs). The OAuth2 refresh-token grant itself (RFC 6749 §6) is one form-encoded POST — handled inline rather than via a dedicated OAuth library; tried `authlib`'s `OAuth2Client` first but it auto-injects an Authorization header on every request, which breaks the unauthenticated discovery GET. - `_make_cluster_bearer_provider(..., auto_refresh=True, write_back=True, insecure=False)` factory. Defaults to the refresher path with write-back enabled; `auto_refresh=False` falls back to the read-only fail-closed behavior for callers that don't want the SDK to make outbound HTTP calls to the IdP. `write_back=True` is the default (changed from the first round of review): IdPs with refresh-token rotation (Keycloak with rotation, Entra in strict mode) invalidate the old refresh_token on each refresh, so an in-memory-only refresh would leave the on-disk bundle pointing at an invalidated value — any second process starting from disk would `invalid_grant`. With write-back enabled by default, the SDK keeps the shared cache consistent with the IdP. - `from_active_cluster` exposes `auto_refresh`, `write_back`, and `insecure` kwargs (defaults: True / True / False). The high-level `Sandbox` context manager surfaces the same three kwargs and forwards them through, so callers using the wrapper have parity with `SandboxClient` for OIDC-protected gateways. - `SandboxClient.close()` chains to a `_bearer_close` hook so the `_OidcRefresher`'s underlying `httpx.Client` is released deterministically instead of leaking sockets/FDs until GC runs `__del__`. Idempotent. - `_OidcRefresher._write_to_disk` uses `tempfile.mkstemp` (PID + random suffix) instead of a fixed `.oidc_token.json.tmp` path, so two writers racing on the same gateway directory don't trample each other's tmp content. Success path atomically replaces; failure path unlinks the orphan. OAuth2 refresh policy and write-back semantics deliberately mirror what the major Python SDKs do — see github.com/googleapis/google-auth-library-python (`Credentials`) and github.com/boto/botocore (`SSOTokenProvider`): | Library | Native refresh | Writes back | |-------------------------------|----------------|-------------| | google-auth Credentials | yes | no | | botocore SSOTokenProvider | yes | yes | | openshell SandboxClient (here)| yes (opt-out) | yes (opt-out)| OpenShell sits between the two; chose write-back-by-default because the rotation invariant matters more for our deployments than the "CLI is the only writer" assumption that fits google-auth. Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library — the refresh grant is a single POST. Tested: - 42 sandbox_test.py tests pass (5 pre-existing + 37 new across the bearer interceptor, fail-closed provider, refresher behavior, TlsConfig validation, from_active_cluster auth ladder, security-review regressions, Sandbox-wrapper kwarg forwarding, and lifecycle / concurrency probes). `mise run test:python` → 47 passed total across the python suite. - `mise run python:lint` (ruff) clean. - End-to-end against a Keycloak-protected gateway on OpenShift: * unauthenticated `Health` bypass works * admin + `openshell:all` reaches user-callable methods * reader (`sandbox:read`) denied on `CreateSandbox` by scope * admin + `openshell:all` denied on PR #1596 sandbox-only methods at the router (the new gate is honored from the SDK) * full provider CRUD lifecycle via the SDK * callable token provider rotates per RPC as expected - Regression-probed against three pre-review security findings: * **Discovery issuer validation** — a discovery document claiming a different `issuer` than the configured one is rejected with a clear `SandboxError` before any refresh POST can reach the attacker-controlled endpoint. * **Redirect during discovery** — `follow_redirects=False` on the underlying httpx client means a 3xx during discovery surfaces as a SandboxError rather than silently chasing the redirect. * **Cross-process rotation** — a two-process simulation shows process B starting from disk and successfully refreshing with the rotated refresh_token, because process A's write-back updated the shared cache. - Refresher unit tests cover: cached-fresh fast path, disk-rotated re-read before refresh, OAuth2 exchange against the discovered token endpoint, refresh-token rotation, atomic write-back at 0600 mode (default), default-on write_back proven by test, concurrent N-thread coordination (one refresh shared across 8 threads), IdP failure surfaced with the error body, the client_credentials / no-refresh_token error path, issuer- mismatch rejection, redirect-during-discovery rejection, insecure flag plumbing. - Lifecycle / concurrency regression tests added: `close()` invokes the `_bearer_close` hook (idempotent), the refresher's `httpx.Client` is marked closed after `SandboxClient.close()`, and 16 concurrent writers don't leave orphan tmp files behind while producing a valid final bundle. The `Sandbox` wrapper has direct forwarding tests proving `auto_refresh`, `write_back`, and `insecure` reach `from_active_cluster` (both explicit values and defaults). - End-to-end against a real OpenShift + Keycloak cluster from inside a pod: real OIDC discovery against `keycloak.keycloak.svc.cluster.local:8080`, refresh-token grant POST, atomic write-back of the rotated bundle at 0600, and a follow-up RPC reusing the freshly-rotated in-memory token — full round-trip in ~170ms. Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * fix(python-sdk): adopt newer on-disk OIDC bundle before refreshing _OidcRefresher.current_access_token() only adopted the on-disk oidc_token.json when its access token was still fresh; otherwise it refreshed using the in-memory bundle. With refresh-token rotation enabled (Keycloak with rotation, Entra strict mode), this let a process keep using an invalidated refresh_token: 1. Process A holds a stale in-memory bundle with refresh_token=r1. 2. Process B refreshes first and writes a rotated (r2) but now near-expiry bundle to disk. 3. Process A re-reads disk, sees the access token is not fresh, ignores the disk bundle, and POSTs the stale r1 — which the IdP has already invalidated, yielding invalid_grant. Fix: when the cached bundle is stale, adopt the on-disk bundle if it was refreshed more recently than ours, even when its access token is also stale. "More recently" is decided by expires_at — a refresh mints a new access token with a forward expiry alongside the rotated refresh_token, so the later expiry carries the newest refresh_token. Comparing by expiry (rather than unconditionally preferring disk) preserves the write_back=False case, where the in-memory bundle has already rotated past the on-disk copy and must not be clobbered. When the adopted bundle's issuer differs, the cached token endpoint is reset so the refresh re-discovers against the new issuer. Adds regression tests for the cross-process rotation race and the issuer-change re-discovery path. * fix(python-sdk): recover from invalid_grant on lost rotation race The expiry-based disk re-read narrows but does not fully close the cross-process refresh-token rotation race: two processes sharing a gateway directory can both enter their refresh window, both POST their copy of the refresh_token, and with rotation enabled the IdP invalidates the loser's token (invalid_grant). Neither google-auth nor botocore close this window without an OS file lock; a Python-only flock would not coordinate with the Rust CLI/TUI that also write oidc_token.json, so locking is not worth its cost here. Recover instead of prevent: distinguish an OAuth2 invalid_grant (the refresh_token was rejected) from transport/5xx failures via a private _InvalidGrantError, and on invalid_grant re-read oidc_token.json once. If a peer wrote a different refresh_token (it won the race), adopt and retry with it — returning early if it is already fresh — so the loser succeeds transparently instead of forcing a re-authenticate. If disk offers no new token, the rejection is genuine and surfaces the re-authenticate hint as before. The retry is single-shot; a second invalid_grant propagates. Adds tests for the peer-rotation recovery and the genuine-rejection (no-retry) paths. --------- Signed-off-by: Mrunal Patel <mrunalp@gmail.com> * fix(helm): vendor chart dependencies before release packaging (#1627) Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * fix(driver-podman): bind gateway to 0.0.0.0 in rootless mode (#1623) Rootless Podman sandbox containers reach the host through pasta's local connection bypass, which translates L2 frames to L4 host sockets. The dev gateway script binds to 127.0.0.1 by default, which is not routable through pasta. Auto-detect rootless mode and bind to 0.0.0.0 so sandbox containers can connect to the gateway. - Auto-detect rootless Podman in gateway.sh and export OPENSHELL_BIND_ADDRESS=0.0.0.0 when not explicitly set - Add e2e:podman:rootless mise task and CI matrix entry to validate rootless Podman networking end-to-end - CI creates a non-root user inside the privileged container to trigger Podman's rootless code paths (pasta, user namespace isolation) Signed-off-by: Naveen Malik <nmalik@redhat.com> * docs(providers): note that ANTHROPIC_API_KEY requires an API account, not a subscription (#1542) * docs(providers): note that ANTHROPIC_API_KEY requires an API account, not a subscription Anthropic subscription users authenticate via OAuth, not an API key, causing a silent failure when creating the provider. Adds a Note callout in the provider type table and quickstart guide directing subscription users to generate an API key from console.anthropic.com. Closes #620 * docs(providers): fix Note placement and remove subscription brand names Move the Note callout in manage-providers.mdx to after the complete provider type table so it does not break table rendering. Remove subscription brand names from both Note callouts. * fix(podman): avoid host-gateway on macOS machines (#1637) Closes #1307 Default the Podman host gateway alias override to gvproxy's host-loopback IP on macOS while preserving host-gateway resolution on Linux. Wire the setting through Podman config, gateway TOML inheritance, and the standalone driver, and document the platform behavior. Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * chore(vm): generalize crate for multi-device PCIe passthrough (#1573) * generalize crate for multi-device PCIe passthrough Signed-off-by: Patrick Riel <priel@nvidia.com> * add adopt apis which allow for devices already bound to vfio-pci during restart reconciliation, without rebinding or mutating sysfs. Signed-off-by: Patrick Riel <priel@nvidia.com> * refactor(vfio): generalize GPU passthrough sysfs handling Signed-off-by: Patrick Riel <priel@nvidia.com> * fix(vfio): centralize vfio ID refcounting Signed-off-by: Evan Lezar <elezar@nvidia.com> --------- Signed-off-by: Patrick Riel <priel@nvidia.com> Signed-off-by: Evan Lezar <elezar@nvidia.com> Co-authored-by: Evan Lezar <elezar@nvidia.com> * fix(sandbox): trust exact declared private endpoints (#1560) * fix(sandbox): trust exact declared private endpoints * fix(sandbox): preserve advisor endpoint provenance * fix(sandbox): repair advisor provenance lint failures --------- Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com> * feat(policy): add agentic approval loop (#1528) * fix(e2e): clean up temp files in sandbox-runner on exit (#1647) * ci(kubernetes): add HA e2e workflow (#1598) Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * ci(release): use bundled Z3 for macOS gateway build (#1658) * fix(gateway): align package TLS bootstrap path (#1601) * fix(gateway): align package TLS bootstrap path Closes #1593 Default package-managed gateway services to a stable local TLS directory and use that same value for certificate generation and runtime startup. Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * test(packaging): validate package asset paths exist Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * ci(e2e): pin mise in kubernetes job Signed-off-by: Taylor Mutch <taylormutch@gmail.com> --------- Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * feat(tui): add PageUp/PageDown scrolling to all panes (#1656) Add PageUp/PageDown key support to the policy, logs, and draft/rules views. All three panes now scroll by one viewport height per keypress. Also fix scroll_policy() clamping to stop at the last viewport of content instead of the last line, preventing a blank-screen overshoot on G and PageDown. Signed-off-by: Major Hayden <major@redhat.com> * feat(telemetry): add anonymous opt-out OpenShell usage telemetry (#1433) * feat(telemetry): add anonymous opt-out usage telemetry Signed-off-by: Kirit93 <kthadaka@nvidia.com> * Removed enums from schema Signed-off-by: Kirit Thadaka <kthadaka@nvidia.com> * Updated telemetry URL Signed-off-by: Kirit93 <kthadaka@nvidia.com> * ci(kubernetes): pin mise installer for e2e --------- Signed-off-by: Kirit93 <kthadaka@nvidia.com> Signed-off-by: Kirit Thadaka <kthadaka@nvidia.com> Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com> * ci(release): gate helm/oci artifact publishing on release (#1662) release-helm and tag-ghcr-release now depend on the release job. This is to prevent a GHCR image or helm chart from being published when some other aspect of the release fails. Signed-off-by: Kris Hicks <khicks@nvidia.com> * ci(kubernetes): stabilize HA e2e setup (#1659) * ci(kubernetes): pin mise in e2e workflow Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * ci(kubernetes): mirror postgres image for ha e2e Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * ci(kubernetes): reuse e2e workflow for ha Signed-off-by: Taylor Mutch <taylormutch@gmail.com> --------- Signed-off-by: Taylor Mutch <taylormutch@gmail.com> * fix(gateway): place supervisor_image under podman driver TOML table (#1661) The gateway.sh script appended supervisor_image after the [openshell.gateway.gateway_jwt] table header, so TOML parsed it as a gateway_jwt field. The Podman driver never saw the override and fell back to the default ghcr.io/nvidia/openshell/supervisor:latest. Move supervisor_image into [openshell.drivers.podman] where the driver config deserializer expects it. * refactor: deduplicate shared utilities across driver crates (#1660) Move three duplicated definitions into openshell-core so every consumer has a single canonical source: - format_bytes: identical 14-line function existed in docker, kubernetes, and vm drivers. Moved to openshell-core::progress where all three already imported from. - DEFAULT_SANDBOX_PIDS_LIMIT: i64 constant (2048) duplicated in docker driver and podman config. Moved to openshell-core::config alongside other shared defaults. Podman re-exports it for internal call-site compatibility. - current_time_ms: secrets.rs in openshell-sandbox reimplemented the same logic as openshell-core::time::now_ms(). Remove the local copy and call now_ms() directly via the existing dep. * fix(config): reject unknown fields in nested gateway config tables (#1666) * fix(config): reject unknown fields in nested gateway config tables The gateway TOML loader silently ignored keys placed under the wrong table header. PR #1661 fixed one instance of this (supervisor_image under [openshell.gateway.gateway_jwt]) but the root cause remained: the nested gateway config tables did not deny unknown fields, so a misplaced key was accepted and dropped instead of erroring. Concretely, tasks/scripts/gateway.sh emitted `sandbox_namespace` right after the [openshell.gateway.gateway_jwt] heredoc, so it landed inside the gateway_jwt table rather than [openshell.gateway]. The k8s driver already receives the namespace via [openshell.drivers.kubernetes], so the stray line was dead config that parsed without complaint. Changes: - Add #[serde(deny_unknown_fields)] to the nested gateway config tables that are part of the config-file parse tree: TlsConfig, OidcConfig, MtlsAuthConfig, GatewayAuthConfig, GatewayJwtConfig. - Remove the misplaced sandbox_namespace line from gateway.sh. - Drop the unused Serialize/Deserialize derives from Config and ServiceRoutingConfig (see below). - Add a regression test asserting a key under the wrong nested table is rejected. * feat(kubernetes): support sandbox image pull secrets (#1671) * refactor(driver): trim compute capability response (#1402) Signed-off-by: Evan Lezar <elezar@nvidia.com> * feat(providers): add Google Vertex AI inference provider (#1568) * feat(providers): add Google Vertex AI provider Adds Vertex AI provider profiles, routing, credential refresh plumbing, CLI support, docs, and regression coverage. Keeps the related NETLINK_ROUTE seccomp allowance needed by Vertex client tooling that calls getifaddrs. * docs: add Vertex AI sandbox usage for Claude Code and OpenCode Cover the full end-to-end setup for running Claude Code and OpenCode inside an OpenShell sandbox via inference.local with a Vertex AI backend: - google-vertex-ai.mdx: add 'Use from a Sandbox' section with tabbed examples for Claude Code (--bare flag, no /v1 suffix) and OpenCode (/v1 suffix required). Add providers_v2_enabled prerequisite and --no-verify note for global region. Document policy proposals table covering metadata.google.internal (always blocked), downloads.claude.ai, and storage.googleapis.com. - inference-routing.mdx: expand 'Use the Local Endpoint' section with tabbed examples for Claude Code, OpenCode, Python OpenAI SDK, and Python Anthropic SDK. Add notes explaining the /v1 path suffix difference between clients. - supported-agents.mdx: update Claude Code and OpenCode rows to mention inference.local support and correct base URL requirements. * fix: address vertex review findings * test(sandbox): retry on spurious Ok in fork-exec ambiguity test On arm64 under heavy CI load, the /proc fd scan in find_socket_inode_owners can transiently miss the parent process's socket fd entry, returning only the child as an owner. This causes resolve_process_identity to return Ok (single owner, no ambiguity check fires) instead of the expected ambiguous-ownership Err. Extend the retry loop to also handle unexpected Ok results, mirroring the existing retry for transient Err results. 10 retries at 50ms gives a 500ms settling window, which is sufficient for procfs to stabilize on loaded arm64 runners. * fix: address vertex review regressions * docs(router): clarify stream_response semantics for Vertex rawPredict routing Document the three call sites of prepare_backend_request and their stream_response values in a caller table: - send_backend_request: false → :rawPredict (unary endpoint) - send_backend_request_streaming: true → :streamRawPredict - verify_backend_endpoint: explicitly false to probe the unary endpoint Cross-reference the table from build_provider_url and is_vertex_anthropic_rawpredict_route so the stream_response=true guard in the suffix upgrade branch is understood in full context. Also note that is_vertex_anthropic_rawpredict_route is a structural predicate (model_in_path + anthropic_messages + :rawPredict suffix), not a named-provider check, so any future provider with the same route shape inherits the transforms automatically. * fix: correct example paths in local-inference README (#1676) * fix: correct example paths in local-inference README * fix: correct example paths in local-inference routes.yaml * ci(release): bring Fedora RPM canary to parity (#1688) The RPM canary needs to exercise the install.sh user-service path, but a GitHub Actions job container does not boot with systemd as PID 1. The Fedora RPM canary needs to exercise the install.sh user-service path, but a GitHub Actions job container does not boot with systemd as PID 1. This means the Fedora RPM canary was incomplete as compared to the others. With this change, we run Fedora as a nested privileged systemd container instead, wait for systemd to become reachable, then start the root user manager so systemctl --user works for the RPM gateway unit, achieving parity with the other canary tests. Signed-off-by: Kris Hicks <khicks@nvidia.com> * fix: update RFC link in agent-driven-policy-management README (#1677) * feat(providers): add profile-backed policy visibility (#1640) * chore: wip providers v2 tui and codex profile * chore: wip effective policy get and codex profile * chore: wip provider profiles and tui detail views * feat(tui): annotate policy proposal review status * ci(release): fix Ubuntu Snap canary install and registration (#1699) Install the Snap built by the triggering Release Dev workflow by setting merge-multiple: true on the artifact download. actions/download-artifact otherwise extracts each artifact into its own subdirectory, leaving the package at release/snap-linux-amd64/*.snap, so the install glob ./release/*.snap matched nothing. Merging flattens the artifact's contents directly into release/ where the dangerous local snap install expects it. Harden the Snap canary setup by enabling snapd.socket, waiting for snap seeding (snap wait system seed.loaded), and running every step with strict shell options (set -euo pipefail) so failures surface immediately. Register the snapped gateway with the CLI as the documented local plaintext snap-docker gateway, and print version and snap services, before running openshell status so the canary verifies a configured and reachable gateway instead of only the install. Signed-off-by: Kris Hicks <khicks@nvidia.com> * feat(snap): add openshell.term desktop app (#1693) Add a desktop launcher for the OpenShell TUI so users can launch "openshell term" from their desktop environment application menu. The change adds three files: - snap/local/term.desktop: desktop entry file for the application launcher - snap/local/icon.png: application icon (copied from snap store data) - snapcraft.yaml: new "term" app entry that runs "openshell term" with home, network, ssh-keys, and system-observe plugs, plus install rules to stage the desktop file and icon under meta/gui/ The desktop file references the icon via ${SNAP} which is resolved at runtime to the snap installation directory. The term app reuses the same connection plugs as the main openshell app. Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com> * fix(sandbox): restore GPU procfs baseline (#1522) Signed-off-by: Evan Lezar <elezar@nvidia.com> * fix(gateway): try harder to detect Podman (#1536) Auto-detection previously treated Podman as available only when the podman CLI was visible on PATH. However, package manager services can run with a restricted PATH, which lets Docker be selected even when a Podman API socket is reachable. Additionally, podman may symlink /var/run/docker.sock to podman's machine unix socket, which would be incorrectly detected as Docker. Worse still: the podman machine may not even be running. This replaces the Podman binary check with a functional HTTP probe against the standard Podman socket paths. The probe requires /_ping to answer with a Libpod-Api-Version header before treating the socket as Podman, which lets the gateway select the embedded Podman driver only when the API is usable. Signed-off-by: Kris Hicks <khicks@nvidia.com> * chore(mise): refresh tool lockfile (#1712) Signed-off-by: Kris Hicks <khicks@nvidia.com> * ci(release): authenticate snap canary artifact download (#1711) The Ubuntu Snap canary downloads its artifact from a different workflow run (the triggering Release Dev run) via run-id. Cross-run downloads require authentication, so pass github.token to actions/download-artifact. Signed-off-by: Kris Hicks <khicks@nvidia.com> * docs(container-gateway): fix Docker driver setup for containerized gateway (#1419) The existing docs omitted or misstated several requirements when running the gateway as a container with the Docker compute driver: - OPENSHELL_GRPC_ENDPOINT is required; the Docker driver uses only the scheme (http/https) — host and port are substituted automatically with host.openshell.internal and the gateway's own bind port - Supervisor binary must be extracted to a host path before starting the gateway; bind-mount sources are resolved by the host Docker daemon so the path must be identical inside and outside the gateway container - Docker socket access requires adding the docker group (UID 1000 default) - Port binding should remain 127.0.0.1; Docker driver adds a bridge listener automatically - add --server-san host.openshell.internal to generate-certs for mTLS - Complete the mTLS docker run with all Docker driver requirements - Add deploy/docker/gateway.toml — TOML config for the Docker driver - Add deploy/docker/docker-compose.yml referencing the TOML - Add docs/get-started/tutorials/docker-compose.mdx tutorial page - Remote gateway registration instructions (--remote flag) Address reviewer feedback: - Move Docker Compose tutorials card to the bottom of the list - Replace inline YAML snippet in Docker Compose section with a reference to deploy/docker/ to avoid drift - Clarify OPENSHELL_DB_URL is safe in compose.yml (plain SQLite path, no credentials); the TOML block targets credential-bearing DSNs - Note that ./ in source: resolves relative to the compose file directory - Clarify that only the scheme from OPENSHELL_GRPC_ENDPOINT matters - Add note that the tilde volume mount resolves to the same absolute path on both host and container * refactor(server): deduplicate test helpers and grpc utilities (#1708) Remove three groups of copy-pasted code in openshell-server: 1. grpc/mod.rs had a private current_time_ms() wrapper identical to the one already exported from persistence/mod.rs. Remove the duplicate and update the three grpc sub-modules (policy, sandbox, service) to import directly from crate::persistence. 2. test_store() was repeated verbatim in seven #[cfg(test)] blocks. Promote a single canonical version to persistence/mod.rs (cfg-gated) and replace all copies with crate::persistence::test_store() calls or a thin Arc wrapper in supervisor_session. 3. grpc_client_mtls() and build_tls_root() were copy-pasted across edge_tunnel_auth.rs and multiplex_tls_integration.rs. Move both into the existing tests/common/mod.rs shared module and import from there. * fix(gateway): allow local sandbox jwt to not expire (#1721) * fix(helm): create sandbox JWT secret when cert-manager is enabled (#1700) * fix(helm): create sandbox JWT secret under cert-manager The cert-manager install path (certManager.enabled=true, pkiInitJob.enabled=false) left the gateway StatefulSet unable to start because nothing created the openshell-jwt-keys Secret: cert-manager owns TLS Secrets but does not mint the sandbox JWT signing key, and the certgen hook only rendered when pkiInitJob.enabled was true. Separate JWT signing-key provisioning from TLS PKI provisioning: - certgen: add a --jwt-only mode that creates only the Opaque JWT signing Secret, for use when another controller owns TLS Secrets. - certgen.yaml: render the hook when pkiInitJob.enabled OR certManager.enabled is true. cert-manager takes precedence and runs the hook with --jwt-only even if pkiInitJob.enabled remains true. Remove the mutual-exclusion failure between the two values. - _helpers.tpl: add openshell.sandboxJwtSecretName, shared by the hook and the StatefulSet mount. - Update values, README, docs, architecture, and the debug-openshell-cluster skill to reflect the new precedence; the documented cert-manager install no longer needs pkiInitJob.enabled=false. Closes #1691 * fix(helm): honor cert-manager precedence for client CA volume The client CA volume logic treated pkiInitJob.enabled as proof that built-in PKI owns the client CA. With cert-manager precedence now allowing certManager.enabled=true alongside the default pkiInitJob.enabled=true, that assumption mounts the server TLS cert secret as the client CA and ignores certManager.clientCaFromServerTlsSecret=false, which can break mTLS or trust the wrong CA. Gate the pkiInitJob.enabled term with (not certManager.enabled) in all three client CA conditions (volume mount, volume definition, and secret selection) so cert-manager owns TLS when enabled. Add a Helm test suite covering built-in PKI, cert-manager shared CA, the regression config (cert-manager + clientCaFromServerTlsSecret=false + default pkiInitJob), and the no-client-CA case. * feat(k8s-driver): add default_runtime_class_name config for sandbox pods (#1729) Allow operators to configure a default Kubernetes runtimeClassName that is applied to sandbox pods when the CreateSandbox request does not specify one. This avoids requiring every API caller to explicitly set the runtime class for clusters that always need a specific RuntimeClass (e.g. kata-containers, nvidia). The fallback is applied in the Kubernetes driver only — per-request values still take priority, and an empty default (the built-in) preserves existing behavior (field omitted, cluster default applies). * docs: add Hermes Agent to supported agents (#1735) * fix(cli): roll back gateway registration when auth fails during gateway add (#1538) * refactor: deduplicate shared driver and TUI helpers (#1741) * feat(cli): support multiple --upload flags on sandbox create (#1635) (#1645) Closes #1635 Signed-off-by: Philippe Martin <phmartin@redhat.com> * updates for new containers --------- Signed-off-by: Derek Carr <decarr@redhat.com> Signed-off-by: Florent Benoit <fbenoit@redhat.com> Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com> Signed-off-by: Colin Walters <walters@verbum.org> Signed-off-by: Adam Miller <admiller@redhat.com> Signed-off-by: Taylor Mutch <taylormutch@gmail.com> Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com> Signed-off-by: Evan Lezar <elezar@nvidia.com> Signed-off-by: Adel Zaalouk <azaalouk@redhat.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: mesutoezdil <mesudozdil@gmail.com> Signed-off-by: Ann Marie Fred <afred@redhat.com> Signed-off-by: Kris Hicks <khicks@nvidia.com> Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com> Signed-off-by: mjamiv <142179942+mjamiv@users.noreply.github.com> Signed-off-by: Adrien Langou <alangou@nvidia.com> Signed-off-by: Drew Newberry <anewberry@nvidia.com> Signed-off-by: Mrunal Patel <mrunalp@gmail.com> Signed-off-by: Calum Murray <cmurray@redhat.com> Signed-off-by: Naveen Malik <nmalik@redhat.com> Signed-off-by: Patrick Riel <priel@nvidia.com> Signed-off-by: Major Hayden <major@redhat.com> Signed-off-by: Kirit93 <kthadaka@nvidia.com> Signed-off-by: Kirit Thadaka <kthadaka@nvidia.com> Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com> Signed-off-by: Philippe Martin <phmartin@redhat.com> Co-authored-by: Mesut Oezdil <114185853+mesutoezdil@users.noreply.github.com> Co-authored-by: Drew Newberry <anewberry@nvidia.com> Co-authored-by: Taylor Mutch <taylormutch@gmail.com> Co-authored-by: Seth Jennings <sjenning@redhat.com> Co-authored-by: Florent BENOIT <fbenoit@redhat.com> Co-authored-by: Eric Curtin <eric.curtin@docker.com> Co-authored-by: Derek Carr <decarr@redhat.com> Co-authored-by: mjamiv <142179942+mjamiv@users.noreply.github.com> Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com> Co-authored-by: Piotr Mlocek <pmlocek@nvidia.com> Co-authored-by: Russell Bryant <russell.bryant@gmail.com> Co-authored-by: Colin Walters <walters@verbum.org> Co-authored-by: Adam Miller <admiller@redhat.com> Co-authored-by: Taylor Mutch <tmutch@nvidia.com> Co-authored-by: Evan Lezar <elezar@nvidia.com> Co-authored-by: Adel Zaalouk <azaalouk@redhat.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ann Marie Fred <afred@redhat.com> Co-authored-by: krishicks <kris@krishicks.com> Co-authored-by: Vegard Stikbakke <vegard.stikbakke@gmail.com> Co-authored-by: krishicks <khicks@nvidia.com> Co-authored-by: Davanum Srinivas <davanum@gmail.com> Co-authored-by: alangou <alangou@nvidia.com> Co-authored-by: Mrunal Patel <mrunalp@gmail.com> Co-authored-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Co-authored-by: Calum Murray <cmurray@redhat.com> Co-authored-by: Saurabh Agarwal <sauagarw@redhat.com> Co-authored-by: Simon Scatton <44714756+SDAChess@users.noreply.github.com> Co-authored-by: Naveen Malik <nmalik@redhat.com> Co-authored-by: Patrick Riel <71560045+cheese-head@users.noreply.github.com> Co-authored-by: Alexander Watson <zredlined@users.noreply.github.com> Co-authored-by: Major Hayden <major@mhtx.net> Co-authored-by: Kirit Thadaka <kirit.thadaka@gmail.com> Co-authored-by: Jesse Jaggars <jhjaggars@gmail.com> Co-authored-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com> Co-authored-by: shannonsands <shannon.sands.1979@gmail.com> Co-authored-by: Philippe Martin <feloy1@gmail.com>

Signed-off-by: Evan Lezar <elezar@nvidia.com> (cherry picked from commit 5102cb9) Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…ork` subcrates. (#1650) * refactor(sandbox): extract run_networking from run_sandbox Lifts TLS state generation, network namespace setup, proxy startup, bypass monitor spawn, and SSH-side proxy URL / netns FD computation out of run_sandbox into a sibling async fn `run_networking` that returns a Networking struct. The identity cache moves with it (only consumed by the proxy). Entrypoint PID allocation moves just above the call site so it can be passed in. No behavior changes — same OCSF emits, same async order, same RAII lifetimes for the proxy and bypass-monitor handles, now held by the returned Networking value in run_sandbox's frame. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(sandbox): extract run_process and lift netns to run_sandbox Lifts the post-networking tail of `run_sandbox` (zombie reaper, SSH server, supervisor session, process spawn, OPA probe, policy poll loop, denial aggregator, wait/exit) into a sibling async fn `run_process`. Also moves network namespace creation out of `run_networking` into a new `create_netns_for_proxy` helper invoked from `run_sandbox`, so `run_networking` is purely the proxy component (OPA evaluation, TLS interception, credential injection, inference routing, gRPC control API). The netns is then borrowed into both `run_networking` and `run_process`. No behavior change. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * chore(workspace): scaffold openshell-supervisor-networking and openshell-supervisor-process crates Add empty placeholder crates that subsequent commits will populate as the sandbox decomposition proceeds. Both crates compile clean as part of the workspace and are picked up automatically by the existing `members = ["crates/*"]` glob. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift DenialEvent to openshell-core The DenialEvent struct is emitted by both the proxy/L7 layer (networking-side) and the bypass monitor (process-side), and crosses the run_networking -> run_process API boundary. Move it to openshell-core so the eventual supervisor-networking and supervisor-process crates can both reference it without depending on each other. DenialAggregator and the channel/flush helpers stay in openshell-sandbox for now. A thin `pub use openshell_core::DenialEvent;` re-export from denial_aggregator.rs keeps every existing `crate::denial_aggregator::DenialEvent` call site resolving without further edits. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift normalize_path to openshell-core Move the lexical path-normalization helper from openshell-policy to openshell-core::paths so it can be reached from crates that sit below openshell-policy in the dependency graph. openshell-policy keeps its existing public API via a `pub use` re-export, so all current call sites (e.g. openshell-sandbox/src/policy.rs) continue to resolve unchanged. This is a prerequisite for lifting openshell-sandbox/src/policy.rs into openshell-core: that file's `From<ProtoFilesystemPolicy>` impl calls normalize_path, and lifting it as-is would cycle through openshell-policy. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift SandboxPolicy and friends to openshell-core Move openshell-sandbox/src/policy.rs (SandboxPolicy, NetworkPolicy, ProxyPolicy, FilesystemPolicy, LandlockPolicy, ProcessPolicy, NetworkMode, LandlockCompatibility, plus their Proto* TryFrom/From impls) to openshell-core/src/policy.rs. Both prospective supervisor leaves (networking and process) dispatch on SandboxPolicy. Hosting it in openshell-core lets either leaf reach for it without depending on the other (or on the future orchestrator). The From<ProtoFilesystemPolicy> impl now calls the in-crate openshell_core::paths::normalize_path lifted in the previous commit, which is what made this move cycle-free. Update all crate::policy::* call sites in openshell-sandbox to openshell_core::policy::*. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move child_env from openshell-sandbox child_env (proxy_env_vars, tls_env_vars) is process-side only — consumed by process.rs and ssh.rs. With the orchestrator staying in openshell-sandbox (Shape A), openshell-sandbox depends on the new leaf crates, so process-only modules can land in openshell-supervisor-process directly. Add openshell-supervisor-process as a path dependency of openshell-sandbox. Update process.rs and ssh.rs to import from openshell_supervisor_process::child_env. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move skills from openshell-sandbox Move the static skills installer (and its embedded resource directory) out of openshell-sandbox into openshell-supervisor-process. The module is process-side only — invoked once during sandbox start to drop agent skill files into the workspace — and has no cross-leaf consumers. Adds miette as a dependency and tempfile as a dev-dependency on openshell-supervisor-process. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move mechanistic_mapper from openshell-sandbox Move the mechanistic mapper (HTTP method/path → operation classifier that derives policy proposals from connection summaries) out of openshell-sandbox into openshell-supervisor-networking. Single internal caller (run_policy_poll_loop in lib.rs) and only depends on openshell-core + tracing — no cross-leaf entanglement. First population of the openshell-supervisor-networking crate; adds openshell-core and tracing as dependencies. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift procfs to openshell-core Move procfs (PID lookups, ancestor walking, /proc/net/tcp socket-owner resolution, file SHA256 hashing) from openshell-sandbox into openshell-core. The module is consumed cross-leaf — by bypass_monitor on the process side and by identity / proxy on the networking side — so it has to sit below both leaves. Adds tracing, sha2, and hex as dependencies on openshell-core. Updates the three call sites in openshell-sandbox to import from openshell_core::procfs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move identity from openshell-sandbox Move BinaryIdentityCache (path → SHA256 cache used to identify the process behind an outbound connection) from openshell-sandbox into openshell-supervisor-networking. The cache is consumed only by the networking-side proxy and the orchestrator; with procfs already in openshell-core there are no remaining cross-leaf dependencies. Adds miette as a dependency and tempfile as a dev-dependency on openshell-supervisor-networking. Adds a Default impl for BinaryIdentityCache to satisfy clippy::new_without_default now that the type is publicly exposed across crates. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move agent-proposals flag from openshell-sandbox Move AGENT_PROPOSALS_ENABLED, agent_proposals_enabled(), and the test-only ProposalsFlagGuard out of openshell-sandbox into openshell-supervisor-process::proposals. The flag is read only by the process-side policy_local route handler and the orchestrator; lifting it to openshell-core would have made core carry sandbox-owned runtime state without buying anything. The test-only ProposalsFlagGuard is still consumed from networking-side l7/rest tests today (until the wider Q2 OCSF/gRPC injection work lands). Expose it via a new optional `test-helpers` feature on openshell-supervisor-process so test crates opt in explicitly without pulling tokio sync primitives into production builds. openshell-sandbox keeps its existing crate-private path (`crate::AGENT_PROPOSALS_ENABLED`, `crate::test_helpers`) via re-exports so call sites and tests are unchanged. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift secrets to openshell-core Move crates/openshell-sandbox/src/secrets.rs to crates/openshell-core/src/secrets.rs so both supervisor leaves can reach SecretResolver and the placeholder helpers without depending on openshell-sandbox. Add base64 to openshell-core deps (only stdlib + base64 are used). Promote previously pub(crate) constructors and methods on SecretResolver to pub since cross-crate callers (provider_credentials, proxy/L7 tests) now name them across the crate boundary. Update import paths in proxy.rs, l7/{rest,relay,websocket}.rs, and provider_credentials.rs from crate::secrets to openshell_core::secrets. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift provider_credentials to openshell-core Move crates/openshell-sandbox/src/provider_credentials.rs to crates/openshell-core/src/provider_credentials.rs. Both supervisor leaves now name ProviderCredentialState in their function signatures (run_networking takes &ProviderCredentialState, run_process takes ProviderCredentialState by value), and under Shape A leaves can't depend on openshell-sandbox, so the type must live in openshell-core. The orchestrator (run_sandbox in openshell-sandbox) remains the only writer: it constructs ProviderCredentialState::from_environment and the policy poll loop calls install_environment on credential rotation. Both leaves stay pure readers via snapshot()/resolver(). Update import paths in proxy.rs, ssh.rs, and lib.rs from crate::provider_credentials to openshell_core::provider_credentials. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * style: rustfmt import ordering Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(ocsf): move SandboxContext singleton from openshell-sandbox Move the process-wide OCSF SandboxContext OnceLock + LazyLock fallback + getter from openshell-sandbox/src/lib.rs into a new openshell-ocsf::ctx module. The type already lives in openshell-ocsf, so its singleton lives next to it. Add openshell_ocsf::ctx::set_ctx() and openshell_ocsf::ctx::ctx(). The orchestrator (run_sandbox) now calls set_ctx during startup. Sandbox keeps a pub(crate) use openshell_ocsf::ctx::ctx as ocsf_ctx; re-export so the 138 existing crate::ocsf_ctx() call sites resolve unchanged. When the sandbox modules themselves migrate into the leaf crates, they'll import openshell_ocsf::ctx directly and the re-export goes away. Under Shape A neither leaf can depend on openshell-sandbox; both already depend on openshell-ocsf to construct events, so this adds no new dep edge. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift grpc_client to openshell-core Both prospective leaves (supervisor-networking and supervisor-process) need CachedOpenShellClient, AuthedChannel, and the connect/fetch helpers. Under Shape A the leaves cannot depend on openshell-sandbox, so the type has to live below them. openshell-core already pulls in tonic and miette; this enables tonic's channel/tls features and adds tokio as a direct dep. Updates all crate::grpc_client::* call sites in openshell-sandbox to openshell_core::grpc_client::*. No re-export shim — the call-site count was small enough to update directly. See architecture/plans/sandbox-split-design-choices.md for the full rationale and trade-offs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move denial_aggregator from openshell-sandbox DenialAggregator and FlushableDenialSummary belong with the proxy and L7 layer that emit denials. Moves the file into openshell-supervisor-networking; adds tokio as a regular dep there since DenialAggregator uses tokio::sync::mpsc. Drops the pub use openshell_core::DenialEvent re-export inside the moved file (no longer needed cross-crate). Updates bypass_monitor.rs, proxy.rs, and lib.rs to import openshell_core::DenialEvent directly. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move log_push from openshell-sandbox LogPushLayer is a process-side tracing layer that streams sandbox logs to the gateway via gRPC. Moves into openshell-supervisor-process; adds openshell-core, openshell-ocsf, tokio-stream, tracing, and tracing-subscriber as direct deps there. Updates the only external call site (openshell-sandbox/src/main.rs) to import from openshell_supervisor_process::log_push. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move bypass_monitor from openshell-sandbox bypass_monitor reads /dev/kmsg for nftables drop log lines and emits denial events. Pure process-side concern, called only from run_networking which spawns it on the netns. Moves into openshell-supervisor-process; all deps (openshell-core, openshell-ocsf, tokio, tracing) were already declared there. Replaces crate::ocsf_ctx() shim calls inside the moved file with openshell_ocsf::ctx::ctx() — first leaf-side caller to import the OCSF context singleton directly instead of going through openshell-sandbox's re-export. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move debug_rpc from openshell-sandbox debug_rpc is the CLI subcommand handler that exercises authenticated gRPC calls (issue-token, refresh-token, get-config, etc.). Pure process-side concern, called only from openshell-sandbox/main.rs. Adds base64, hex, serde_json, sha2, and tonic (with channel/tls features) as direct deps on openshell-supervisor-process. Updates the single call site in main.rs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move supervisor_session from openshell-sandbox supervisor_session opens a bidirectional gRPC stream that lets the gateway initiate shells inside the sandbox. Pure process-side concern, called only from run_process. Adds uuid as a direct dep on openshell-supervisor-process. Replaces crate::ocsf_ctx() shim calls inside the moved file with openshell_ocsf::ctx::ctx() — same pattern as bypass_monitor. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): lift managed_children tracker from openshell-sandbox The MANAGED_CHILDREN set tracks PIDs of supervisor-spawned children (entrypoint + SSH sessions) so the orchestrator's SIGCHLD reaper can distinguish them from incidental zombies. Pure process-side concern, moves to openshell_supervisor_process::managed_children with three public fns: register, unregister, is_managed. Updates lib.rs reaper, process.rs, and ssh.rs to call through the new module path. Drops the now-unused HashSet import from lib.rs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move sandbox hardening from openshell-sandbox Lift the process-only hardening pieces (landlock, seccomp, PreparedSandbox, prepare/enforce, log_sandbox_readiness, top-level apply, and apply_supervisor_startup_hardening) from crates/openshell-sandbox/src/sandbox/ to crates/openshell-supervisor-process/src/sandbox/. Leave netns.rs and nft_ruleset.rs in openshell-sandbox for now, since both eventual leaf crates (supervisor-networking and supervisor-process) read from NetworkNamespace and its final home is decided when run_networking and run_process are extracted. Replace crate::ocsf_ctx() shims in landlock.rs and the new linux/mod.rs with direct openshell_ocsf::ctx::ctx() calls. Update call sites in lib.rs, process.rs, and ssh.rs to import sandbox from openshell_supervisor_process while keeping the netns import unchanged. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift proposals flag from openshell-supervisor-process Move proposals.rs (AGENT_PROPOSALS_ENABLED OnceLock + agent_proposals_enabled reader + test_helpers::ProposalsFlagGuard) from openshell-supervisor-process to openshell-core so both eventual leaf crates can read it without depending on each other. The flag is process-wide singleton state initialised once during sandbox startup and read by both the policy.local route (networking-side) and the skills installer (process-side) — same shape as openshell_ocsf::ctx. Move the test-helpers Cargo feature alongside it: openshell-core gains the feature, openshell-supervisor-process loses it, and openshell-sandbox's dev-dependency now enables openshell-core/test-helpers. Update the sandbox re-export shim to point at openshell_core::proposals. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift netns + nft_ruleset from openshell-sandbox Move NetworkNamespace and the nft_ruleset bypass-rule generator from crates/openshell-sandbox/src/sandbox/linux/ to crates/openshell-core/src/netns/. Both eventual leaf crates (supervisor-networking and supervisor-process) read from NetworkNamespace, so it must live somewhere both can depend on without violating the Shape A no-leaf-to-leaf rule. Replace crate::ocsf_ctx() shims in netns with direct openshell_ocsf::ctx::ctx() calls, matching the pattern used in already-migrated process modules. Update super::nft_ruleset references inside netns to nft_ruleset since the module is now a sibling sub-module of netns/mod.rs. Add openshell-ocsf and uuid as linux-only dependencies of openshell-core, and gate pub mod netns on target_os = "linux" since the implementation uses netlink, ip(8), and namespace fds. Delete the now-empty sandbox/{mod.rs, linux/mod.rs} stubs and update NetworkNamespace import paths in lib.rs and process.rs to point at openshell_core::netns. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move process.rs and ssh.rs from openshell-sandbox Lift the entrypoint process spawn module and the embedded SSH server module into openshell-supervisor-process. openshell-sandbox now re-exports ProcessHandle/ProcessStatus and calls openshell_supervisor_process::ssh::run_ssh_server directly. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move proxy, l7, opa, policy_local from openshell-sandbox Lift the egress proxy, L7 enforcement modules, OPA engine, and policy.local advisor API into openshell-supervisor-networking. Move accompanying data files (sandbox-policy.rego), test fixtures (testdata/), and integration tests (system_inference, websocket_upgrade). Sandbox lib.rs now references these via openshell_supervisor_networking::* and ProxyHandle::start_with_bind_addr is exposed as pub for the orchestrator call site. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(sandbox): hoist policy poll loop and denial aggregator into orchestrator Move the symlink-resolver, policy poll loop, and denial-aggregator flush spawns out of run_process and into run_sandbox so run_process no longer needs OpaEngine, retained_proto, the local policy context, the sandbox name, the gateway endpoint for telemetry, the OCSF flag, or the denial receiver. These long-running orchestrator-owned tasks now live alongside the other sandbox-startup wiring, matching the design log decision in architecture/plans/sandbox-split-design-choices.md (Q5). Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move run_process from openshell-sandbox Lift the workload supervision entry point (zombie reaper, SSH server spawn, supervisor session, entrypoint child spawn, exit-with-timeout) into its own module in openshell-supervisor-process. The orchestrator in openshell-sandbox now calls openshell_supervisor_process::run::run_process directly. With this move run_process names only types from openshell-core, openshell-ocsf, openshell-supervisor-process itself, std, and tokio — no openshell-supervisor-networking dependency. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move bypass_monitor from supervisor-process Bypass detection is network-policy enforcement: it parses nftables LOG entries from /dev/kmsg and emits OCSF NetworkActivity / DetectionFinding events plus DenialEvents into the same channel the proxy feeds. Its lifetime is tied to the network namespace, not to the workload child. Moving it to openshell-supervisor-networking puts it next to the proxy and the denial aggregator that consume its output, and unblocks moving run_networking out of openshell-sandbox without a leaf-to-leaf dep. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move inference route helpers from openshell-sandbox Move build_inference_context, partition_routes, bundle_to_resolved_routes, spawn_route_refresh, the InferenceRouteSource enum, and the route refresh interval helpers into a new openshell-supervisor-networking::inference_routes module along with their unit tests. The orchestrator now calls into the networking leaf for inference context construction; the leaf owns its own route bundle resolution end-to-end. The new module is named inference_routes to avoid colliding with the existing l7::inference module, which handles request-time HTTP parsing and pattern matching rather than route bundle setup. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move run_networking from openshell-sandbox Move the Networking handle struct, run_networking, and the Linux-only create_netns_for_proxy helper into a new openshell-supervisor-networking::run module. The orchestrator in openshell-sandbox now invokes openshell_supervisor_networking::run::{create_netns_for_proxy, run_networking} and reads the Networking fields directly; the leaf owns the entire networking-stack startup path (CA generation, proxy task, bypass monitor, inference context, denial channel) end-to-end. The Networking RAII handle fields (proxy, bypass_monitor) are now public without leading underscores so the public API satisfies clippy's pub_underscore_fields lint while still serving as drop guards held by the orchestrator's frame. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * fix(workspace): align Cargo deps and call sites for split crates The recent module lifts left two Linux-only gaps that the macOS host workspace check skipped: - openshell-core's netns module needs libc, tempfile, and nix on Linux, but only openshell-ocsf and uuid were carried over. - openshell-supervisor-process's seccomp/landlock modules need landlock and seccompiler, which still lived on openshell-sandbox. - openshell-sandbox's runtime_pid_limit branch referenced an unqualified process:: that pointed at the old in-crate module. Move landlock/seccompiler to supervisor-process, add the missing core deps, qualify the call sites, and drop sandbox deps that no longer have runtime users (landlock, seccompiler, target-gated tempfile/uuid, the unix libc/rustix block). Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): rename openshell-supervisor-networking to openshell-supervisor-network Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): own denial-aggregator flush end-to-end Move the denial-aggregator spawn and flush_proposals_to_gateway out of run_sandbox and into run_networking. The networking leaf already owns every other input (proxy + bypass_monitor as producers, denial channel, mechanistic_mapper, denial_aggregator) and already opens its own gRPC connections (inference_routes, policy_local) — the orchestrator was the only piece left straddling the boundary. Networking now drives the full path: producers -> channel -> aggregator -> flush -> gateway. Drops denial_rx from Networking; adds sandbox_name to run_networking so SubmitPolicyAnalysis can resolve by sandbox name (falls back to ID when unset). Same shape as log_push in the process leaf. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): own symlink-resolution task Move the OPA binary-symlink resolver out of run_sandbox and into run_networking. The task probes /proc/<entrypoint_pid>/root/ until the workload's mount namespace is accessible, then rebuilds the OPA engine with resolved binary paths so policy rules match canonical names instead of symlinks. Both inputs (Arc<OpaEngine>, retained_proto) are networking-leaf concerns and were already plumbed into run_networking; the entrypoint_pid Arc is read lazily after the process leaf populates it. Adds retained_proto as a parameter and spawns the resolver early in run_networking so the probe loop starts before the proxy comes up. Same shape as the denial-flush move: networking owns its own background task end-to-end; the orchestrator stops hosting work that doesn't conceptually belong to it. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move seccomp install into run_process The supervisor seccomp prelude is part of "set up the workload-side process tree", not part of orchestration. Move the call site from run_sandbox into the top of run_process and drop the now-unused re-export from openshell-sandbox::lib. Timing is preserved: by the time the orchestrator calls run_process, run_networking has already returned, so netns + nftables setup is complete. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move check_runtime_pid_limit into run_process The PID-limit precondition is process-side: it gates whether the workload child can be spawned at all. Move the call from run_sandbox into the top of run_process, alongside the seccomp prelude. Same shape as the seccomp move — function already lives process-side, only the call site relocates. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move validate_sandbox_user to process crate The sandbox-user check is a precondition for privilege-dropping the workload child; it has no relevance to networking. Move the function next to drop_privileges in openshell-supervisor-process::process and call it from the top of run_process. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move prepare_filesystem to process crate Creating and chowning read_write directories is workload-side preparation, not orchestration. Move prepare_filesystem and its prepare_read_write_path helper (plus tests) into openshell-supervisor-process::process and call from run_process, alongside validate_sandbox_user. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move startup skill install into run_process The eager initial-settings fetch + agent skill install is process-side: the install materializes files the workload's filesystem sees. The orchestrator still owns the AGENT_PROPOSALS_ENABLED OnceLock init because the policy poll loop also reads it; only the early fetch and install hop into run_process. Behavior unchanged. Best-effort: any RPC or install failure is logged but does not fail startup. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): own PolicyLocalContext construction Move the PolicyLocalContext construction from run_sandbox into run_networking. The orchestrator was building it solely to thread it into the networking leaf and to share it with the policy poll loop; now run_networking builds it from inputs it already takes (retained_proto, openshell_endpoint, sandbox_name|sandbox_id) and exposes it on the returned Networking struct. The orchestrator's poll loop now grabs the Arc clone from networking.policy_local_ctx, so the orchestrator no longer imports openshell_supervisor_network::policy_local. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * feat(supervisor): add --mode flag to gate network/process leaves Add a --mode flag (default "network,process") that selects which supervisor leaves run in the current process. Two new shapes are unlocked without splitting the binary: --mode=network # network-only sidecar --mode=process # process-only supervisor --mode=network,process # combined (default; current behavior) In network-only mode the orchestrator skips run_process and waits on SIGINT/SIGTERM before tearing down the proxy. The entrypoint PID stays at 0 for the lifetime of the process, which silently degrades the proxy's binary-identity TOFU and the bypass monitor's PID enrichment; this is correct in a split-pod topology where the workload's /proc lives in another pod. In process-only mode run_networking is skipped entirely. SSH sessions get no proxy URL, no netns FD, and no CA paths, matching what a split-pod consumer would expect when network enforcement is delegated to a sidecar. The policy poll loop continues to run unconditionally; its OPA-reload and policy.local hooks already gate on the resources only present when network is enabled, and the env-refresh / proposals-toggle hooks remain active in process mode. Closes a step toward the RFC-0001 supervisor topology proposed in issue #1305 by drew. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * style(supervisor-process): rustfmt long debug! line Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): pull DenialEvent down from core DenialEvent is only emitted and consumed inside openshell-supervisor-network (proxy, bypass monitor, denial aggregator). It never crossed the leaf boundary, so the earlier lift to openshell-core was speculative. Move it back into the network crate where its only callers live. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): pull procfs down from core procfs was lifted to openshell-core under the assumption it would be shared cross-leaf, but on the current branch all three callers (bypass_monitor, identity, proxy) live in openshell-supervisor-network. No file in openshell-supervisor-process imports it. Move the module to the network crate and drop sha2/hex from openshell-core, which were pulled in only for procfs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * style(supervisor-network): run cargo fmt Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * fix(supervisor-network): add libc dev-dependency for procfs tests The procfs/bypass_monitor/proxy test modules use libc::{fork, exec, fcntl, kill, waitpid} but the dep wasn't declared in this crate's Cargo.toml. It was previously satisfied transitively when these modules lived in openshell-core; the move left the test target unable to resolve libc. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(sandbox): move denial aggregator to orchestrator The denial aggregator and mechanistic mapper consume denial events produced by the proxy and (subsequently) the bypass monitor. With both supervisor leaves becoming pure producers of `DenialEvent`, the consumer-side aggregation belongs in the orchestrator, not in either leaf. Move `denial_aggregator.rs` and `mechanistic_mapper.rs` from `openshell-supervisor-network` to `openshell-sandbox` (the orchestrator). The orchestrator now owns the unbounded denial channel: it constructs `(tx, rx)`, hands `tx` to `run_networking` for the proxy to clone, drains `rx` via the aggregator task, and runs the gateway flush helper itself. `run_networking`'s signature gains a `denial_tx` parameter and loses its internal channel construction, aggregator spawn, and `flush_proposals_to_gateway` helper. `DenialEvent` stays in `openshell-supervisor-network` for now; a follow-up commit will lift it to `openshell-core` alongside the bypass monitor relocation. * refactor(supervisor-process): pull bypass monitor down from network `bypass_monitor` is process-isolation machinery: it tails the kernel log via `dmesg --follow`, parses nftables LOG lines emitted from the workload's network namespace, resolves PIDs via `/proc`, and emits OCSF events plus optional `DenialEvent`s. None of this touches the proxy, OPA, TLS, or any other supervisor-network state — it only shared the denial channel because both feed the same aggregator. Move `bypass_monitor.rs` from `openshell-supervisor-network` to `openshell-supervisor-process` (as `bypass_monitor/mod.rs`). Spawn it in `run_process` where the netns name and entrypoint PID are already in scope. The orchestrator hands an extra `bypass_denial_tx` clone of the denial channel sender to `run_process` for this purpose. Lift `DenialEvent` from `openshell-supervisor-network` to `openshell-core`. Both supervisor leaves now produce it, so it needs a shared location that neither leaf depends on. This reverses an earlier commit that pulled the type into the network leaf when it was the only producer. Copy the minimal subset of `/proc` parsers used by `bypass_monitor` into a private `bypass_monitor::procfs` submodule. The alternative — extracting a shared procfs crate — is a much larger refactor that this commit does not need; supervisor-network's `procfs.rs` continues to serve the proxy and identity cache. * refactor(supervisor-process): derive ssh netns fd inside run_process The ssh_netns_fd was computed in run_networking purely to forward it through the Networking struct and back into run_process. supervisor-network never read it. Move the derivation to run_process where the NetworkNamespace handle is already in scope. * refactor(supervisor-process): derive ssh proxy url inside run_process The ssh_proxy_url was computed in run_networking purely to forward it through the Networking struct and back into run_process. supervisor-network never read it. Move the derivation to run_process where the NetworkNamespace handle and SandboxPolicy are already in scope. After this commit the Networking struct no longer carries any SSH-shaped fields, and supervisor-network reads only host_ip from the netns (for the proxy bind address). * refactor(supervisor-network): take proxy bind ip directly instead of netns run_networking only ever read host_ip from the netns it was passed (the SSH plumbing reads moved to run_process in earlier commits). Replace the NetworkNamespace parameter with a plain Option<IpAddr> the orchestrator extracts. supervisor-network's run module no longer references the netns type for any consumer, only for create_netns_for_proxy (which still lives in this crate; relocates next). * refactor(supervisor-process): move netns ownership out of core Relocates the NetworkNamespace handle, nft ruleset builder, and create_netns_for_proxy constructor into openshell-supervisor-process. The orchestrator (openshell-sandbox) phantom-owns the RAII handle for the duration of run_sandbox; supervisor-network no longer references the type at all. Drops uuid, libc, nix, openshell-ocsf, and tempfile from core's Linux target deps (all were exclusive to netns). tempfile becomes a Linux runtime dep on supervisor-process for nft ruleset application. * chore(sandbox): prune leaf-only deps from orchestrator manifest cargo-machete flagged 26 direct dependencies that were carried over from the pre-split monolith and are no longer used by the orchestrator itself: regorus, russh, rcgen, tokio-rustls, ipnet, apollo-parser, openshell-router, anyhow, base64, bytes, flate2, glob, hex, hmac, nix, rand_core, rustls-pemfile, serde, serde_yml, sha1, sha2, thiserror, tokio-stream, uuid, webpki-roots. These now live (transitively) in openshell-supervisor-network and openshell-supervisor-process where they are actually consumed. * chore(deps): prune unused deps from supervisor crates - Drop unused `url` from openshell-supervisor-network. - Mark `prost` and `prost-types` as cargo-machete-ignored in openshell-core: they have no source-level `use`, but the tonic- generated proto code references them via `::prost::Message` etc. - openshell-supervisor-process is already clean. * fix(supervisor-network): wait for entrypoint PID before symlink probe The OPA symlink-resolution task reads entrypoint_pid once at the top of the spawned closure. Because the spawn happens before run_process publishes the workload PID, the load returns 0, the probe path bakes in as /proc/0/root/, and the loop exhausts its retries against a path that does not exist on Linux. The reload never fires, so policies that whitelist symlinked binaries (e.g. /usr/bin/python3 → python3.11) get silent denials when the workload exec's the realpath. Split the wait into two phases: 5s polling entrypoint_pid for a non-zero value, then the existing 5s window probing /proc/<pid>/root/. Distinct warn messages on each timeout so future debugging can tell "PID never published" apart from "container fs never appeared". * fix(sandbox): restore GPU procfs baseline (#1522) Signed-off-by: Evan Lezar <elezar@nvidia.com> (cherry picked from commit 5102cb9) Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * fix(supervisor-process): use renamed tonic tls-native-roots feature Upstream renamed the tonic `tls` feature to `tls-native-roots`. The supervisor-process Cargo.toml still referenced the old name, which broke the workspace build after merging upstream. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): relocate token_grant and spiffe_endpoint Upstream's SPIFFE-backed token grant feature landed in crates/openshell-sandbox/src/. After the supervisor split, the L7 enforcement code in supervisor-network calls into token_grant, which would require supervisor-network to depend back on sandbox. Move token_grant.rs and spiffe_endpoint.rs into supervisor-network where the only callers live, add the reqwest and spiffe deps to supervisor-network's Cargo.toml, and drop them from sandbox. Also fix two stale `openshell_core::proto::` self-references in openshell-core (a pre-existing breakage that surfaced once the rest of the merge compiled). Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * fix(supervisor-process): broaden Path import cfg to all unix targets The `Path` import was gated on `cfg(any(test, target_os = "linux"))`, but `prepare_read_write_path` is gated on `cfg(unix)` — broader. On non-Linux unix the function still referenced `&std::path::Path` explicitly, so upstream's qualified path was load-bearing. After the supervisor split, lint runs on Linux where `Path` IS in scope, so `unused_qualifications` fires. Broaden the import cfg to match the function's cfg and use the bare `Path` name everywhere. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> --------- Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> Signed-off-by: Evan Lezar <elezar@nvidia.com> Co-authored-by: Evan Lezar <elezar@nvidia.com>

elezar requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 22, 2026 13:47

elezar mentioned this pull request May 22, 2026

fix(sandbox): decouple GPU baseline from network policy #1524

Merged

6 tasks

elezar changed the base branch from main to fix/1486-gpu-enrichment-no-network/elezar May 22, 2026 14:06

Base automatically changed from fix/1486-gpu-enrichment-no-network/elezar to main May 27, 2026 08:20

elezar force-pushed the fix/1486-gpu-sandbox-filesystem-policy/elezar branch from 96a1caa to 59e399a Compare May 27, 2026 09:02

elezar mentioned this pull request May 28, 2026

feat(gpu): derive sandbox access requirements from CDI specs #1606

Open

17 tasks

elezar force-pushed the fix/1486-gpu-sandbox-filesystem-policy/elezar branch from 12bde4d to d73e6de Compare May 28, 2026 19:22

pimlock reviewed May 29, 2026

View reviewed changes

Comment thread crates/openshell-sandbox/src/sandbox/linux/landlock.rs Outdated

This was referenced May 29, 2026

feat(sandbox): narrow GPU procfs permissions and surface runtime additions #1628

Open

feat(policy): add runtime baseline conflict controls #1629

Draft

elezar force-pushed the fix/1486-gpu-sandbox-filesystem-policy/elezar branch 2 times, most recently from 2f3b5b2 to a0171ff Compare June 1, 2026 18:29

elezar changed the title ~~fix(sandbox): restore GPU filesystem baseline~~ fix(sandbox): restore GPU procfs baseline Jun 1, 2026

elezar requested a review from pimlock June 1, 2026 19:29

fix(sandbox): restore GPU procfs baseline

c828f23

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the fix/1486-gpu-sandbox-filesystem-policy/elezar branch from a0171ff to c828f23 Compare June 2, 2026 08:24

pimlock approved these changes Jun 2, 2026

View reviewed changes

elezar merged commit 5102cb9 into main Jun 3, 2026
26 checks passed

elezar deleted the fix/1486-gpu-sandbox-filesystem-policy/elezar branch June 3, 2026 09:08

vtripathy mentioned this pull request Jun 3, 2026

--gpu sandboxes can't initialize CUDA — seccomp blocks memfd_create (CUDA error 304) #1696

Open

truffle-dev mentioned this pull request Jun 5, 2026

OpenShell GPU sandbox: CUDA cuInit fails under Landlock on Spark/GB10 despite nvidia-smi working NVIDIA/NemoClaw#4016

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sandbox): restore GPU procfs baseline#1522

fix(sandbox): restore GPU procfs baseline#1522
elezar merged 1 commit into
mainfrom
fix/1486-gpu-sandbox-filesystem-policy/elezar

elezar commented May 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

pimlock left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elezar commented Jun 1, 2026

Uh oh!

pimlock commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elezar commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

pimlock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elezar commented Jun 1, 2026

Uh oh!

pimlock commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elezar commented May 22, 2026 •

edited

Loading