fix(sandbox): restore GPU procfs baseline#1522
Conversation
|
🌿 Preview your docs: https://nvidia-preview-pr-1522.docs.buildwithfern.com/openshell |
96a1caa to
59e399a
Compare
12bde4d to
d73e6de
Compare
pimlock
left a comment
There was a problem hiding this comment.
LGTM with a few nits and questions.
2f3b5b2 to
a0171ff
Compare
|
Thanks for your initial review @pimlock. After the initial back and forth, I realised that there were a number of edge cases that I was not considering. I believe I was trying to detect user intent with insufficient signal and as such have updated this PR to ALWAYS promote |
Signed-off-by: Evan Lezar <elezar@nvidia.com>
a0171ff to
c828f23
Compare
Thanks! I took a first pass at #1629 and I like the approach. I think it's great for the mechanism to be more explicit and exposing it through the policy makes sense, so the full picture of what's allowed is in the policy. |
* fix(ci): eliminate image-tag race between concurrent workflows (#1413)
- Add publish-manifest input to docker-build.yml (default true); single-arch
branch callers set it false so the merge job is skipped and the shared
bare :SHA tag in GHCR is never written by branch workflows
- branch-kubernetes-e2e: retag :SHA-amd64 to :SHA before kind load so
Helm's image.tag matches what is loaded in kind containerd
- branch-e2e: pass image-tag as :SHA-arm64 to e2e-test so the arch-specific
GHCR tag is used directly without depending on the bare tag
- bare :SHA in GHCR is now written only by test-gpu.yml (multi-arch build),
eliminating the last-writer-wins race across concurrent workflows
* test(server): cover service endpoint plaintext security (#1352)
* test(server): cover service endpoint plaintext security
* test(server): align tls test with from_files Option<&Path> signature
TlsAcceptor::from_files now accepts the client CA path as Option<&Path>
(per the require_client_auth refactor on main). Wrap the helper's CA
path in Some(...) so the new plaintext-service-http tests compile after
rebasing onto current main.
---------
Co-authored-by: Taylor Mutch <taylormutch@gmail.com>
* fix(cli): add auth and TLS support to completion client (#1489)
* fix(scripts): use portable lowercase in normalize_bool for Bash 3.2 (#1493)
* refactor(server): extract shared relay-await and sandbox-scan helpers (#1495)
* fix(sandbox): skip fork-exec socket ambiguity test on SELinux-enforcing hosts (#1449)
Exec'ing /bin/sleep (SELinux label bin_t) from a user_home_t test binary
causes /proc/<pid>/exe readlink to return ENOENT on SELinux-enforcing
hosts due to the cross-domain boundary. Skip the test at runtime when
getenforce reports Enforcing.
Also adds a ChildGuard drop guard for safe child cleanup on panic and
increases the exec-detection deadline from 2s to 5s.
Signed-off-by: Derek Carr <decarr@redhat.com>
* fix(sandbox): allow first-label L7 host wildcards (#1304)
* fix(sandbox): allow first-label L7 host wildcards
* docs(sandbox): document L7 host wildcard contract + add OPA runtime tests
- Add Host Wildcards section to architecture/security-policy.md
describing accepted (first-label *, **, intra-label *-X) and
rejected (bare, TLD, non-first-label, recursive-in-label) forms,
and noting that wildcards never cross '.' boundaries.
- Expand the policy-schema.mdx 'host' field description to reflect
the same contract instead of only mentioning '*.example.com'.
- Add OPA runtime tests asserting '*-aiplatform.googleapis.com'
matches 'us-central1-aiplatform.googleapis.com' and does not match
'us-central1.aiplatform.googleapis.com' (cross-dot boundary). Locks
validator/runtime alignment for intra-label wildcards.
* chore: update mise lockfile
* test(server): tolerate serialized inference upserts
---------
Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* feat(cli): add JSON/YAML output format to gateway list (#1500)
Add -o/--output flag to `openshell gateway list` matching the existing
sandbox list pattern, enabling machine-readable output for scripting.
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
* refactor: deduplicate repeated patterns across crates (#1499)
Remove ~280 lines of duplicated code across 30 files in 5 areas:
- centered_rect: consolidate 5 identical TUI layout helpers into a
single pub fn in openshell-tui/src/ui/mod.rs
- server test helpers: replace ~100 inline Store::connect() calls
with local test_store() helpers; deduplicate test_server_state()
in grpc/service.rs to use the shared test_support version
- rogue PKI: extract 20-line rogue CA+client cert generation block
(duplicated in two integration tests) into generate_rogue_pki()
in tests/common/mod.rs
- provider tests: replace 8 identical 28-line test modules with a
single macro_rules! test_discovers_env_credential! invocation
- label constants: centralize openshell.ai/ container label keys
in openshell-core::driver_utils; update Docker and Kubernetes
drivers to import from there instead of redefining them locally
* fix(ci): resolve mirror gate statuses for fork PRs (#1504)
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
* fix(server): respect OPENSHELL_PODMAN_SOCKET env var in embedded driver (#1483)
The env var was only wired up via clap in the standalone
openshell-driver-podman binary. When the Podman driver runs embedded
in the gateway, config came exclusively from TOML deserialization and
the env var was never consulted. Apply it as a post-deserialization
override, matching the existing OPENSHELL_K8S_WORKSPACE_DEFAULT_STORAGE_SIZE
pattern.
Closes #1446
* refactor(sandbox,driver-vm): Start moving to rustix (esp over libc unsafe) (#1505)
In the Rust ecosystem there's largely three ways to do system calls:
- raw libc
- nix
- rustix
Of the three, libc is almost all `unsafe` and really 95% of use
cases should be either nix or rustix. nix is the original one,
but after having looked at the code of both, I think rustix
is just better designed and organized. It's also reached 1.0,
whereas nix is still making semver-breaking changes (in fact
we're behind here in this project).
Now in practice, we have both *transitively* in the depchain
already, and that's true for quite a lot of projects.
But I think rustix is better, so let's add rustix as
a workspace dependency (process feature) and migrate
a few use cases to it - it's especially better than the raw
libc which is suprisingly widespread.
If we agree to do this, then many other calls can be ported.
Signed-off-by: Colin Walters <walters@verbum.org>
* fix(packaging): add upgrade migration docs and podman socket retry (#1507)
After #1415 ships, users upgrading from previous releases need guidance
on the gateway.env deprecation, port/bind/database path changes, and
the podman.socket restart requirement.
- docs(rpm): add 'Migrating from gateway.env' section to TROUBLESHOOTING
covering backward compatibility, env-to-TOML key mapping, and three
breaking changes (default port 8080->17670, bind address 0.0.0.0->127.0.0.1,
database path move). Add podman.socket restart step to upgrade procedure.
- docs(rpm): add upgrade callout to CONFIGURATION.md pointing at migration
section.
- fix(podman): retry PodmanComputeDriver ping up to 5 times with 2s delay
to tolerate transient socket unavailability after package upgrades.
The systemd unit uses Wants=podman.socket (not Requires) so the gateway
can start while the socket is briefly re-activating after an RPM upgrade
changes its unit file on disk.
- chore(rpm): update EnvironmentFile comment in RPM spec to explain
backward-compatibility intent.
Signed-off-by: Adam Miller <admiller@redhat.com>
* ci: deduplicate e2e workflows (#1512)
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* feat(auth): per-sandbox authentication to gateway (#1404)
* docs(sandboxes): add policy advisor guide (#1480)
Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* fix(docker): use host-gateway callbacks on macOS (#1516)
* ci(e2e): load single-arch images into kind (#1518)
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* docs(rfc): add sandbox resource requirements proposal (#1360)
* docs(rfc): add sandbox resource requirements proposal
Signed-off-by: Evan Lezar <elezar@nvidia.com>
* docs(rfc): finalize sandbox resource requirements
---------
Signed-off-by: Evan Lezar <elezar@nvidia.com>
* ci(canary): keep helm jwt secret generation enabled (#1521)
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* fix(cli): add json output for policy get (#1410)
* fix(cli): add json output for policy get
* test(cli): cover policy get full json output
* fix(cli): address policy get json clippy
---------
Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* feat(providers): derive discovery from profiles (#1503)
* feat(providers): derive discovery from profiles
Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* fix(providers): keep v2 discovery profile-only
Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* docs(providers): update providers v2 behavior
Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* fix(providers): make github profile read-only
Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
---------
Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* docs: update NemoClaw/OpenClaw references (#1529)
* ci: seed shared Rust caches from main (#1530)
* fix(release): build host Linux binaries with glibc floor (#1490)
* fix(homebrew): repair local driver bootstrap state (#1527)
* fix(homebrew): repair local driver bootstrap state
* fix(bootstrap): satisfy default SAN doc lint
* ci: install cargo-zigbuild from release binaries (#1533)
* fix(cli): propagate --gateway-insecure to OIDC auth flows (#1535)
Thread the gateway_insecure flag through gateway_add(), gateway_login(),
and all OIDC HTTP clients so that --gateway-insecure and
OPENSHELL_GATEWAY_INSECURE apply to OIDC discovery, token exchange, and
token refresh requests.
Previously, the flag only affected gRPC connections to the gateway. OIDC
HTTP clients (reqwest::get and http_client) always verified TLS
certificates, causing gateway registration and login to fail when the
OIDC issuer used a self-signed certificate (common on OpenShift with
edge-terminated routes).
Fixes #1534
Signed-off-by: Adel Zaalouk <azaalouk@redhat.com>
* ci(release): smoke test rpm artifacts on fedora (#1558)
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
* chore(deps): bump docker/login-action from 4.1.0 to 4.2.0 (#1554)
Bumps [docker/login-action](https://github.com/docker/login-action) from 4.1.0 to 4.2.0.
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](https://github.com/docker/login-action/compare/4907a6ddec9925e35a0a9e82d7399ccc52663121...650006c6eb7dba73a995cc03b0b2d7f5ca915bee)
---
updated-dependencies:
- dependency-name: docker/login-action
dependency-version: 4.2.0
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* chore(helm): add missing SPDX header to gateway-config template (#1545)
* chore(helm): add missing SPDX header to gateway-config template
* chore(scripts): remove helm templates from license header exclusions
The bypass had no known rationale. Removing it ensures the header
script covers deploy/helm/openshell/templates uniformly going forward.
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
---------
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
* ci(release): skip python rpm in gateway smoke test (#1559)
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
* ci: pin azure/setup-helm and helm/kind-action to commit SHAs (#1544)
* ci: pin azure/setup-helm and helm/kind-action to commit SHAs
* chore(python): add py.typed marker for PEP 561 compliance
* ci: use full semver in pinned action version comments
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
---------
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
* refactor: deduplicate shared code across ocsf builders and driver crates (#1526)
Extract repeated patterns into shared helpers:
- Add impl_builder_setters! macro to openshell-ocsf/builders that
generates the identical severity(), status(), and message() setter
methods present on all 7 OCSF event builders
- Add SandboxContext::apply_common_fields() to consolidate the
four-line build() finalization (set_status, set_message, set_device,
set_container) repeated in every builder
- Add driver_utils::sandbox_token_path() to centralize the XDG state
path construction for sandbox JWT files used by both the Docker and
Podman drivers
- Add driver_utils::build_capabilities_response() to eliminate the
identical GetCapabilitiesResponse struct literal repeated across the
Docker, Podman, and Kubernetes compute drivers
* fix(python): raise SandboxError instead of FileNotFoundError or KeyError (#1547)
* fix(python): raise SandboxError instead of FileNotFoundError or KeyError
* fix(python): suppress exception chaining in SandboxError raises
Add `from None` to both `raise SandboxError(...)` calls inside `except
FileNotFoundError` blocks to satisfy ruff B904.
* fix(scripts): replace mapfile with bash 3.2-compatible read loop in helm-k3s-local (#1539)
macOS ships bash 3.2 which lacks mapfile/readarray. Replace all three
occurrences in configure_ghcr_credentials, cluster_has_image, and
cluster_image_platform with a portable while-read loop, consistent
with the fix applied to docker-build-image.sh in #1334.
* docs: add macOS compiler troubleshooting (#1569)
Signed-off-by: Ann Marie Fred <afred@redhat.com>
* fix(gateway): configure local dev auth (#1575)
This makes it so you can run the dev gateway and sandbox with:
```
mise run gateway
# in another shell
mise run sandbox
```
Signed-off-by: Kris Hicks <khicks@nvidia.com>
* docs: add Pi as supported sandbox (#1572)
* fix(sandbox): add mechanistic smoke test for L4 deny and document the L4/L7 split (#1412)
* fix(sandbox): add mechanistic smoke test for L4 deny and document the L4/L7 split
The old smoke script exercised an L7 PUT which hung because the denial
aggregator is only wired to L4 CONNECT denies, not L7 enforcement.
Add mechanistic-smoke.sh which triggers an L4 deny, waits for the
aggregator to flush, and asserts a pending chunk appears under
openshell rule get --status pending.
Document the intentional L4-only scope of the mechanistic mapper in
architecture/sandbox.md.
Fixes #1333
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
* refactor(smoke): remove redundant variable inits and merge double step call
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
* fix(smoke): wire mechanistic smoke into mise and guard TMP_DIR
- Initialize TMP_DIR before trap to prevent unbound variable on early exit
- Add e2e:mechanistic-smoke mise task with gateway setup
- Document mechanistic smoke in policy-advisor README
* test(proxy): verify L4 deny enqueues a DenialEvent
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
* fix(proxy): remove unnecessary path qualifications in L4 denial smoke test
---------
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
* docs(readme): whitespace (#1578)
Signed-off-by: Kris Hicks <khicks@nvidia.com>
* fix(cli): replace outdated name reference (#1582)
Signed-off-by: Kris Hicks <khicks@nvidia.com>
* fix(sandbox): probe Landlock before build, skip on unsupported kernels (#1585)
On kernels without Landlock (e.g. gVisor's sentry returns ENOSYS for
syscall 444), the previous best_effort path still logged "Applying
Landlock" + "Landlock ruleset built" events even though no enforcement
was happening. Probe at the top of `landlock::prepare` and short-circuit
with a single High-severity "Sandbox Unavailable" finding.
Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
* fix(sandbox): decouple GPU baseline from network policy (#1524)
Signed-off-by: Evan Lezar <elezar@nvidia.com>
* docs(kubernetes): note that Sandbox volumeClaimTemplates is immutable (#1543)
* fix(sandbox): use succinct endpoint denial reason (#1584)
Signed-off-by: Kris Hicks <khicks@nvidia.com>
* feat(docker): add provisioning progress events (#1567)
* docs(kubernetes): add RBAC section to setup page (#1540)
Documents the ServiceAccount, Role, and ClusterRole created by the Helm
chart inline on the setup page, per reviewer feedback on #1250. Reflects
the current chart templates including pods/get for sandbox identity and
tokenreviews/create for projected token validation.
Closes #1018
* fix(sandbox): delegate PID limits to runtimes (#1497)
Signed-off-by: mjamiv <142179942+mjamiv@users.noreply.github.com>
* fix(gateway): make readiness health checks dependency-aware (#1328)
* feat(gateway): add readiness probe metrics and test-only store close
Emit Prometheus readiness metrics for database probes (healthy gauge and
outcome-labeled latency histogram) with coverage in health HTTP tests.
Restrict Store::close behind test support cfg to prevent accidental runtime
pool shutdown under live traffic.
Signed-off-by: Adrien Langou <alangou@nvidia.com>
* test(e2e): add simple e2e test with kubernetes to test /readyz
Signed-off-by: Adrien Langou <alangou@nvidia.com>
---------
Signed-off-by: Adrien Langou <alangou@nvidia.com>
* fix(vm): scope rootfs cache by openshell version (#1587)
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
* fix(cli): preserve symlinks during sandbox upload (#1595)
* fix(cli): preserve symlinks during sandbox upload
* docs(sandboxes): document upload symlink behavior
* fix(core): preserve SSH gateway default ports (#1602)
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* feat(server): declare gRPC auth (mode + scope + role) at the handler, enforce at the router (#1596)
* feat(server): per-handler gRPC auth annotations
Move scope, role, and auth-mode metadata to the handler definition site
via #[rpc_authz] + #[rpc_auth] proc macros. The previously hand-maintained
SCOPED_METHODS, ADMIN_METHODS, UNAUTHENTICATED_METHODS, and
ALLOWED_SANDBOX_METHODS tables are now generated from per-method
annotations on the tonic service impls, with canonical gRPC paths
derived from the service name and method name.
Adds a new openshell-server-macros proc-macro crate, an aggregator in
auth/method_authz.rs, and an exhaustiveness test that decodes the
protobuf FileDescriptorSet (now emitted by openshell-core/build.rs) and
verifies every RPC has an annotation.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
* refactor(server): rename `sandbox-secret` auth mode to `sandbox`
PR #1404 replaced the shared sandbox secret with per-sandbox
gateway-minted JWTs. A handler marked `sandbox` now authenticates as a
specific `Principal::Sandbox`, not as a holder of a shared credential.
Rename `auth = "sandbox-secret"` to `auth = "sandbox"` and
`AuthMode::SandboxSecret` to `AuthMode::Sandbox` so the name matches
the post-#1404 identity model.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
* fix(server): enforce per-handler AuthMode at the router
Addresses review feedback on the per-handler auth-annotation work.
- Router-level enforcement of #[rpc_auth] auth mode (HIGH). The previous
router only checked is_sandbox_callable() for Principal::Sandbox; user
principals still flowed into AuthzPolicy::check() and bypassed the
per-handler declaration. A user with `openshell:all` could therefore
reach `sandbox`-only handlers like GetSandboxProviderEnvironment,
ReportPolicyStatus, PushSandboxLogs, and SubmitPolicyAnalysis even
though their annotations said sandbox-only. Adds an
is_user_callable() predicate and rejects User principals at the
router for `sandbox` / `unauthenticated` methods.
- Proc macro now errors on duplicate keys in #[rpc_auth(...)] (LOW). A
second `auth`, `scope`, or `role` previously silently overwrote the
first value; now it fails to compile.
- Regression tests: a unit test for is_user_callable() and a router
test that proves a user with admin role + openshell:all cannot reach
the nine sandbox-only handlers.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
* docs(server): finish renaming sandbox-secret to sandbox in method_authz doc comments
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
* refactor(server-macros): drop standalone `rpc_auth` stub
The stub was a safety net that fired only when a method had
`#[rpc_auth(...)]` without an enclosing `#[rpc_authz]`. Triggering it
required `rpc_auth` to be imported, which is why both call sites carried
`#[allow(unused_imports)] use openshell_server_macros::{rpc_auth, rpc_authz};`.
Drop the stub and the unused-import workaround. A missing `#[rpc_authz]`
now surfaces as rustc's standard "cannot find attribute `rpc_auth` in
this scope" — clear enough, and one fewer import + lint exception.
Addresses review comment on PR #1596.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
* refactor(server-macros): emit fixed `AUTH_METADATA` const per service
The previous trait-derived const name turned `OpenShell` into
`OPEN_SHELL_AUTH_METADATA`, splitting the project name across an
underscore. Each impl already lives in its own module
(`crate::grpc::`, `crate::inference::`), so the module path is enough
to disambiguate between services — a fixed `AUTH_METADATA` name reads
more naturally.
Aggregator in `auth/method_authz.rs` now references
`crate::grpc::AUTH_METADATA` and `crate::inference::AUTH_METADATA`
directly.
Addresses review comment on PR #1596.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
* docs(server-macros): fix typo in AUTH_METADATA_CONST doc comment
OpenShell is one word; reference name in the doc should be
OPENSHELL_AUTH_METADATA, not OPEN_SHELL_AUTH_METADATA.
Addresses review nit on PR #1596.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
---------
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
* ci(snap): add snap release pipeline (#1600)
* docs: refresh landing terminal demo and apply NVIDIA fern theme (#1615)
- Extract landing-page terminal demo into a reusable <CommandTerminal />
component with inline styles (no global CSS dependency)
- Animate a second command line cycling through claude/opencode/codex
via @keyframes scoped inside the component
- Inline BadgeLinks layout styles so the component renders correctly
without relying on .badge-links from main.css
- Add jsx.d.ts shim so editors do not flag the React global in component
TSX files
- Switch fern instance to global-theme: nvidia with multi-source enabled
- Bump fern CLI to 5.40.0 and drop the basepath-aware experimental flag
- Register fern/components/ as a second mdx-components directory
- Remove the unused Adobe analytics script tag
* build(macos): remove unused import of tracing::warn (#1619)
Signed-off-by: Calum Murray <cmurray@redhat.com>
* chore: align .python-version with mise.toml (#1618)
Signed-off-by: Calum Murray <cmurray@redhat.com>
* feat(helm): add optional PostgreSQL backing store (#1579)
* feat(helm): add optional PostgreSQL backing store with Secret-based credentials
- Add postgres.enabled and postgres.deploy values to control database
backend (SQLite vs PostgreSQL) and subchart deployment independently.
- Introduce db-secret.yaml template for Opaque Secret with assembled
postgresql:// connection string injected via OPENSHELL_DB_URL env var.
- Add Bitnami PostgreSQL as optional subchart dependency keyed on
postgres.deploy to prevent subchart deployment in external mode.
- Externalize JWT signing key file mode via sandboxJwt.secretDefaultMode
with 0400 default matching upstream.
- Add validation guard for postgres.deploy=true without postgres.enabled.
- Add helm unit tests covering internal, external, URL-override, special
character encoding, and misconfiguration error paths.
- Update README with Kubernetes and OpenShift install examples for
bundled and external PostgreSQL configurations.
- Add helm dependency build to lint and unittest tasks.
* fix(helm): add database backend docs to README.md.gotmpl and regenerate
The helm-docs CI check failed because the Database backend section was
added directly to README.md instead of README.md.gotmpl. Move the
content to the template and regenerate so the check passes.
* fix(helm): use Secret-based DB credentials and support existingSecret
Replace the inline db-url stringData pattern with a proper Secret
containing individual fields plus a uri key. When postgres.deploy=true
the Bitnami service-binding secret is referenced directly; when
deploy=false users can supply postgres.external.existingSecret to
bring their own Secret, or let the chart generate one from the external
field values.
Also restructures the README database section for clarity, adds
helm-unittest coverage for the new secret resolution paths, and
fixes a markdown lint issue in the root README.
* refactor(helm): move OpenShift e2e script to e2e/rust/ and add mise task
Move test-openshift-scenarios.sh from deploy/helm/openshell/ci/ to
e2e/rust/e2e-openshift.sh, matching the existing e2e script naming
convention. Register it as `e2e:openshift` in tasks/test.toml — not
wired into the `test` or `e2e` aggregates so it only runs on explicit
invocation against a live OpenShift cluster.
* feat(e2e): add database backend scenarios to Kubernetes e2e
Extend with-kube-gateway.sh with an optional multi-scenario loop gated
by OPENSHELL_E2E_KUBE_DB_SCENARIOS=1. When enabled, the script installs
the Helm chart three times — SQLite (default), bundled PostgreSQL, and
external PostgreSQL with existingSecret — running the full test suite
against each backend. When unset, existing single-install behavior is
unchanged.
Also adds helm dependency build before helm install, fixing CI failures
caused by the missing PostgreSQL subchart dependency.
* refactor(helm): simplify PostgreSQL config to two orthogonal controls
Replace postgres.deploy and postgres.external.* with two simple controls:
- postgres.enabled: deploy the bundled Bitnami PostgreSQL subchart
- server.externalDbSecret: name of a pre-existing Secret with a uri key
Delete db-secret.yaml — the chart no longer generates Secrets from
individual credential fields. Users either get the Bitnami service-binding
secret (bundled) or bring their own via server.externalDbSecret.
Add validation that postgres.serviceBindings.enabled must stay true
when using bundled PostgreSQL, preventing a confusing runtime failure.
* docs(config): update gateway config reference (#1624)
* feat(flake): add Nix development shell (#1592)
* feat(build): add simple nix flake with formatter for nix code
* feat(flake): setup rust toolchain, able to build and run unit tests
* feat(flake): add support for arm linux and macos
* feat(toolchain): add rust-src and rust-analyzer to the toolchain
* refactor(proto): move phase and current_policy_version into status (#1565)
* refactor(proto): move phase and current_policy_version into SandboxStatus
Move phase and current_policy_version from SandboxSpec into
SandboxStatus to correctly model mutable runtime state. Update all
callers in the gateway server, TUI, and Python SDK to read and write
these fields through SandboxStatus accessors.
Signed-off-by: Derek Carr <decarr@redhat.com>
* fix(server): preserve sandbox status on statusless driver updates
When a driver update arrives without a status payload (e.g. before
Kubernetes populates the status subresource), preserve the stored
phase, conditions, and current policy version instead of resetting
them. Adds a regression test covering the edge case.
Signed-off-by: Derek Carr <decarr@redhat.com>
---------
Signed-off-by: Derek Carr <decarr@redhat.com>
* feat(python-sdk): support OIDC Bearer auth on SandboxClient (#1621)
* feat(python-sdk): support OIDC Bearer auth on SandboxClient
PR #1596 hardened the gateway side of the OIDC story; the Python SDK
was the remaining gap — it only supported plaintext or mTLS, with no
Bearer metadata anywhere. Deployments with OIDC enabled (the
recommended posture since PR #935 / PR #1404) were unreachable from
the SDK.
Adds:
- `bearer_token: str | Callable[[], str] | None` kwarg on
`SandboxClient`. Static strings or zero-arg callables (the latter
is invoked per RPC, so callers can drop in a refresh loop or
token-file watcher without reconstructing the client). Composes
with `tls` for OIDC-over-mTLS deployments.
- `_BearerAuthInterceptor` implementing all four
`grpc.{Unary,Stream}{Unary,Stream}ClientInterceptor` types.
Appends `authorization: Bearer <token>` to outgoing metadata.
Implemented as an interceptor (not call credentials) so it works
on both plaintext (`disableTls=true` dev) and TLS channels without
`grpc.composite_channel_credentials`.
- `TlsConfig` ergonomics: all three fields (`ca_path`, `cert_path`,
`key_path`) are now optional with `cert_path` / `key_path`
required-together-or-not-at-all (enforced in `__post_init__`). This
unlocks three transport profiles from one dataclass:
* full mTLS (all three)
* CA-only trust (`ca_path` only)
* system roots (`TlsConfig()` — for OIDC gateways behind a
public CA)
- `from_active_cluster` mirrors `crates/openshell-tui/src/lib.rs`
`build_oidc_channel`:
* For any `https://` gateway, always build a secure channel.
Pick the strongest TLS profile available in `mtls/` (full
mTLS → CA-only → system roots). No more `insecure_channel`
fallback for HTTPS.
* Gate OIDC bearer attachment on
`metadata.json["auth_mode"] == "oidc"`. Matches
`crates/openshell-cli/src/main.rs:132` and the TUI; a stale
`oidc_token.json` next to a non-OIDC gateway no longer causes
the SDK to attach a bearer.
- `_OidcRefresher` — thread-safe, in-process native OAuth2 refresh
modeled on `google.oauth2.credentials.Credentials` and
`botocore.tokens.SSOTokenProvider`. Lazily checks expiry on every
RPC; when stale, re-reads disk first (the CLI may have rotated
the bundle), and only then exchanges the refresh_token against
the IdP's token endpoint discovered via OIDC discovery
(`/.well-known/openid-configuration`, cached after first call).
Concurrent RPCs share a single refresh via `threading.Lock` (no
IdP stampede). Honors refresh-token rotation. Surfaces IdP
failures as `SandboxError` with the RFC 6749 error body included
for diagnostics.
Mirrors the Rust CLI's HTTP-policy posture from
`crates/openshell-cli/src/oidc_auth.rs`:
* `follow_redirects=False` so a 3xx during discovery can't
steer us to an attacker-controlled token endpoint.
* Discovery `issuer` is validated against the configured
issuer; a discovery document claiming a different issuer is
rejected, preventing the SDK from POSTing the refresh_token
to a malicious endpoint.
* `insecure: bool` flag plumbed through to httpx's
`verify=` so self-signed-cert deployments work the same way
they do in the Rust CLI.
Built on `httpx` (chosen over `urllib` specifically for
follow_redirects + verify control as kwargs). The OAuth2
refresh-token grant itself (RFC 6749 §6) is one form-encoded
POST — handled inline rather than via a dedicated OAuth library;
tried `authlib`'s `OAuth2Client` first but it auto-injects an
Authorization header on every request, which breaks the
unauthenticated discovery GET.
- `_make_cluster_bearer_provider(..., auto_refresh=True,
write_back=True, insecure=False)` factory. Defaults to the
refresher path with write-back enabled; `auto_refresh=False`
falls back to the read-only fail-closed behavior for callers that
don't want the SDK to make outbound HTTP calls to the IdP.
`write_back=True` is the default (changed from the first round of
review): IdPs with refresh-token rotation (Keycloak with
rotation, Entra in strict mode) invalidate the old refresh_token
on each refresh, so an in-memory-only refresh would leave the
on-disk bundle pointing at an invalidated value — any second
process starting from disk would `invalid_grant`. With write-back
enabled by default, the SDK keeps the shared cache consistent
with the IdP.
- `from_active_cluster` exposes `auto_refresh`, `write_back`, and
`insecure` kwargs (defaults: True / True / False). The
high-level `Sandbox` context manager surfaces the same three
kwargs and forwards them through, so callers using the wrapper
have parity with `SandboxClient` for OIDC-protected gateways.
- `SandboxClient.close()` chains to a `_bearer_close` hook so the
`_OidcRefresher`'s underlying `httpx.Client` is released
deterministically instead of leaking sockets/FDs until GC runs
`__del__`. Idempotent.
- `_OidcRefresher._write_to_disk` uses `tempfile.mkstemp` (PID +
random suffix) instead of a fixed `.oidc_token.json.tmp` path,
so two writers racing on the same gateway directory don't
trample each other's tmp content. Success path atomically
replaces; failure path unlinks the orphan.
OAuth2 refresh policy and write-back semantics deliberately mirror
what the major Python SDKs do — see
github.com/googleapis/google-auth-library-python (`Credentials`)
and github.com/boto/botocore (`SSOTokenProvider`):
| Library | Native refresh | Writes back |
|-------------------------------|----------------|-------------|
| google-auth Credentials | yes | no |
| botocore SSOTokenProvider | yes | yes |
| openshell SandboxClient (here)| yes (opt-out) | yes (opt-out)|
OpenShell sits between the two; chose write-back-by-default because
the rotation invariant matters more for our deployments than the
"CLI is the only writer" assumption that fits google-auth.
Adds `httpx>=0.27` as a runtime dependency. No new OAuth2 library —
the refresh grant is a single POST.
Tested:
- 42 sandbox_test.py tests pass (5 pre-existing + 37 new across
the bearer interceptor, fail-closed provider, refresher
behavior, TlsConfig validation, from_active_cluster auth ladder,
security-review regressions, Sandbox-wrapper kwarg forwarding,
and lifecycle / concurrency probes).
`mise run test:python` → 47 passed total across the python
suite.
- `mise run python:lint` (ruff) clean.
- End-to-end against a Keycloak-protected gateway on OpenShift:
* unauthenticated `Health` bypass works
* admin + `openshell:all` reaches user-callable methods
* reader (`sandbox:read`) denied on `CreateSandbox` by scope
* admin + `openshell:all` denied on PR #1596 sandbox-only
methods at the router (the new gate is honored from the SDK)
* full provider CRUD lifecycle via the SDK
* callable token provider rotates per RPC as expected
- Regression-probed against three pre-review security findings:
* **Discovery issuer validation** — a discovery document
claiming a different `issuer` than the configured one is
rejected with a clear `SandboxError` before any refresh POST
can reach the attacker-controlled endpoint.
* **Redirect during discovery** — `follow_redirects=False` on
the underlying httpx client means a 3xx during discovery
surfaces as a SandboxError rather than silently chasing the
redirect.
* **Cross-process rotation** — a two-process simulation shows
process B starting from disk and successfully refreshing
with the rotated refresh_token, because process A's
write-back updated the shared cache.
- Refresher unit tests cover: cached-fresh fast path, disk-rotated
re-read before refresh, OAuth2 exchange against the discovered
token endpoint, refresh-token rotation, atomic write-back at
0600 mode (default), default-on write_back proven by test,
concurrent N-thread coordination (one refresh shared across 8
threads), IdP failure surfaced with the error body, the
client_credentials / no-refresh_token error path, issuer-
mismatch rejection, redirect-during-discovery rejection,
insecure flag plumbing.
- Lifecycle / concurrency regression tests added: `close()`
invokes the `_bearer_close` hook (idempotent), the refresher's
`httpx.Client` is marked closed after `SandboxClient.close()`,
and 16 concurrent writers don't leave orphan tmp files behind
while producing a valid final bundle. The `Sandbox` wrapper has
direct forwarding tests proving `auto_refresh`, `write_back`,
and `insecure` reach `from_active_cluster` (both explicit
values and defaults).
- End-to-end against a real OpenShift + Keycloak cluster from
inside a pod: real OIDC discovery against
`keycloak.keycloak.svc.cluster.local:8080`, refresh-token grant
POST, atomic write-back of the rotated bundle at 0600, and a
follow-up RPC reusing the freshly-rotated in-memory token —
full round-trip in ~170ms.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
* fix(python-sdk): adopt newer on-disk OIDC bundle before refreshing
_OidcRefresher.current_access_token() only adopted the on-disk
oidc_token.json when its access token was still fresh; otherwise it
refreshed using the in-memory bundle. With refresh-token rotation
enabled (Keycloak with rotation, Entra strict mode), this let a process
keep using an invalidated refresh_token:
1. Process A holds a stale in-memory bundle with refresh_token=r1.
2. Process B refreshes first and writes a rotated (r2) but now
near-expiry bundle to disk.
3. Process A re-reads disk, sees the access token is not fresh, ignores
the disk bundle, and POSTs the stale r1 — which the IdP has already
invalidated, yielding invalid_grant.
Fix: when the cached bundle is stale, adopt the on-disk bundle if it was
refreshed more recently than ours, even when its access token is also
stale. "More recently" is decided by expires_at — a refresh mints a new
access token with a forward expiry alongside the rotated refresh_token,
so the later expiry carries the newest refresh_token. Comparing by
expiry (rather than unconditionally preferring disk) preserves the
write_back=False case, where the in-memory bundle has already rotated
past the on-disk copy and must not be clobbered. When the adopted
bundle's issuer differs, the cached token endpoint is reset so the
refresh re-discovers against the new issuer.
Adds regression tests for the cross-process rotation race and the
issuer-change re-discovery path.
* fix(python-sdk): recover from invalid_grant on lost rotation race
The expiry-based disk re-read narrows but does not fully close the
cross-process refresh-token rotation race: two processes sharing a
gateway directory can both enter their refresh window, both POST their
copy of the refresh_token, and with rotation enabled the IdP invalidates
the loser's token (invalid_grant). Neither google-auth nor botocore
close this window without an OS file lock; a Python-only flock would not
coordinate with the Rust CLI/TUI that also write oidc_token.json, so
locking is not worth its cost here.
Recover instead of prevent: distinguish an OAuth2 invalid_grant (the
refresh_token was rejected) from transport/5xx failures via a private
_InvalidGrantError, and on invalid_grant re-read oidc_token.json once. If
a peer wrote a different refresh_token (it won the race), adopt and retry
with it — returning early if it is already fresh — so the loser succeeds
transparently instead of forcing a re-authenticate. If disk offers no new
token, the rejection is genuine and surfaces the re-authenticate hint as
before. The retry is single-shot; a second invalid_grant propagates.
Adds tests for the peer-rotation recovery and the genuine-rejection
(no-retry) paths.
---------
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
* fix(helm): vendor chart dependencies before release packaging (#1627)
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* fix(driver-podman): bind gateway to 0.0.0.0 in rootless mode (#1623)
Rootless Podman sandbox containers reach the host through pasta's local
connection bypass, which translates L2 frames to L4 host sockets. The
dev gateway script binds to 127.0.0.1 by default, which is not routable
through pasta. Auto-detect rootless mode and bind to 0.0.0.0 so sandbox
containers can connect to the gateway.
- Auto-detect rootless Podman in gateway.sh and export
OPENSHELL_BIND_ADDRESS=0.0.0.0 when not explicitly set
- Add e2e:podman:rootless mise task and CI matrix entry to validate
rootless Podman networking end-to-end
- CI creates a non-root user inside the privileged container to trigger
Podman's rootless code paths (pasta, user namespace isolation)
Signed-off-by: Naveen Malik <nmalik@redhat.com>
* docs(providers): note that ANTHROPIC_API_KEY requires an API account, not a subscription (#1542)
* docs(providers): note that ANTHROPIC_API_KEY requires an API account, not a subscription
Anthropic subscription users authenticate via OAuth, not an API key,
causing a silent failure when creating the provider. Adds a Note callout
in the provider type table and quickstart guide directing subscription
users to generate an API key from console.anthropic.com.
Closes #620
* docs(providers): fix Note placement and remove subscription brand names
Move the Note callout in manage-providers.mdx to after the complete
provider type table so it does not break table rendering. Remove
subscription brand names from both Note callouts.
* fix(podman): avoid host-gateway on macOS machines (#1637)
Closes #1307
Default the Podman host gateway alias override to gvproxy's host-loopback IP on macOS while preserving host-gateway resolution on Linux. Wire the setting through Podman config, gateway TOML inheritance, and the standalone driver, and document the platform behavior.
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* chore(vm): generalize crate for multi-device PCIe passthrough (#1573)
* generalize crate for multi-device PCIe passthrough
Signed-off-by: Patrick Riel <priel@nvidia.com>
* add adopt apis which allow for devices already bound to vfio-pci during restart reconciliation, without rebinding or mutating sysfs.
Signed-off-by: Patrick Riel <priel@nvidia.com>
* refactor(vfio): generalize GPU passthrough sysfs handling
Signed-off-by: Patrick Riel <priel@nvidia.com>
* fix(vfio): centralize vfio ID refcounting
Signed-off-by: Evan Lezar <elezar@nvidia.com>
---------
Signed-off-by: Patrick Riel <priel@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>
* fix(sandbox): trust exact declared private endpoints (#1560)
* fix(sandbox): trust exact declared private endpoints
* fix(sandbox): preserve advisor endpoint provenance
* fix(sandbox): repair advisor provenance lint failures
---------
Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* feat(policy): add agentic approval loop (#1528)
* fix(e2e): clean up temp files in sandbox-runner on exit (#1647)
* ci(kubernetes): add HA e2e workflow (#1598)
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* ci(release): use bundled Z3 for macOS gateway build (#1658)
* fix(gateway): align package TLS bootstrap path (#1601)
* fix(gateway): align package TLS bootstrap path
Closes #1593
Default package-managed gateway services to a stable local TLS directory and use that same value for certificate generation and runtime startup.
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* test(packaging): validate package asset paths exist
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* ci(e2e): pin mise in kubernetes job
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
---------
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* feat(tui): add PageUp/PageDown scrolling to all panes (#1656)
Add PageUp/PageDown key support to the policy, logs, and draft/rules
views. All three panes now scroll by one viewport height per keypress.
Also fix scroll_policy() clamping to stop at the last viewport of
content instead of the last line, preventing a blank-screen overshoot
on G and PageDown.
Signed-off-by: Major Hayden <major@redhat.com>
* feat(telemetry): add anonymous opt-out OpenShell usage telemetry (#1433)
* feat(telemetry): add anonymous opt-out usage telemetry
Signed-off-by: Kirit93 <kthadaka@nvidia.com>
* Removed enums from schema
Signed-off-by: Kirit Thadaka <kthadaka@nvidia.com>
* Updated telemetry URL
Signed-off-by: Kirit93 <kthadaka@nvidia.com>
* ci(kubernetes): pin mise installer for e2e
---------
Signed-off-by: Kirit93 <kthadaka@nvidia.com>
Signed-off-by: Kirit Thadaka <kthadaka@nvidia.com>
Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>
* ci(release): gate helm/oci artifact publishing on release (#1662)
release-helm and tag-ghcr-release now depend on the release job.
This is to prevent a GHCR image or helm chart from being published when some
other aspect of the release fails.
Signed-off-by: Kris Hicks <khicks@nvidia.com>
* ci(kubernetes): stabilize HA e2e setup (#1659)
* ci(kubernetes): pin mise in e2e workflow
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* ci(kubernetes): mirror postgres image for ha e2e
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* ci(kubernetes): reuse e2e workflow for ha
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
---------
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
* fix(gateway): place supervisor_image under podman driver TOML table (#1661)
The gateway.sh script appended supervisor_image after the
[openshell.gateway.gateway_jwt] table header, so TOML parsed it as a
gateway_jwt field. The Podman driver never saw the override and fell
back to the default ghcr.io/nvidia/openshell/supervisor:latest.
Move supervisor_image into [openshell.drivers.podman] where the driver
config deserializer expects it.
* refactor: deduplicate shared utilities across driver crates (#1660)
Move three duplicated definitions into openshell-core so every
consumer has a single canonical source:
- format_bytes: identical 14-line function existed in docker,
kubernetes, and vm drivers. Moved to openshell-core::progress
where all three already imported from.
- DEFAULT_SANDBOX_PIDS_LIMIT: i64 constant (2048) duplicated in
docker driver and podman config. Moved to openshell-core::config
alongside other shared defaults. Podman re-exports it for
internal call-site compatibility.
- current_time_ms: secrets.rs in openshell-sandbox reimplemented
the same logic as openshell-core::time::now_ms(). Remove the
local copy and call now_ms() directly via the existing dep.
* fix(config): reject unknown fields in nested gateway config tables (#1666)
* fix(config): reject unknown fields in nested gateway config tables
The gateway TOML loader silently ignored keys placed under the wrong
table header. PR #1661 fixed one instance of this (supervisor_image
under [openshell.gateway.gateway_jwt]) but the root cause remained: the
nested gateway config tables did not deny unknown fields, so a misplaced
key was accepted and dropped instead of erroring.
Concretely, tasks/scripts/gateway.sh emitted `sandbox_namespace` right
after the [openshell.gateway.gateway_jwt] heredoc, so it landed inside
the gateway_jwt table rather than [openshell.gateway]. The k8s driver
already receives the namespace via [openshell.drivers.kubernetes], so
the stray line was dead config that parsed without complaint.
Changes:
- Add #[serde(deny_unknown_fields)] to the nested gateway config tables
that are part of the config-file parse tree: TlsConfig, OidcConfig,
MtlsAuthConfig, GatewayAuthConfig, GatewayJwtConfig.
- Remove the misplaced sandbox_namespace line from gateway.sh.
- Drop the unused Serialize/Deserialize derives from Config and
ServiceRoutingConfig (see below).
- Add a regression test asserting a key under the wrong nested table is
rejected.
* feat(kubernetes): support sandbox image pull secrets (#1671)
* refactor(driver): trim compute capability response (#1402)
Signed-off-by: Evan Lezar <elezar@nvidia.com>
* feat(providers): add Google Vertex AI inference provider (#1568)
* feat(providers): add Google Vertex AI provider
Adds Vertex AI provider profiles, routing, credential refresh plumbing, CLI support, docs, and regression coverage. Keeps the related NETLINK_ROUTE seccomp allowance needed by Vertex client tooling that calls getifaddrs.
* docs: add Vertex AI sandbox usage for Claude Code and OpenCode
Cover the full end-to-end setup for running Claude Code and OpenCode
inside an OpenShell sandbox via inference.local with a Vertex AI backend:
- google-vertex-ai.mdx: add 'Use from a Sandbox' section with tabbed
examples for Claude Code (--bare flag, no /v1 suffix) and OpenCode
(/v1 suffix required). Add providers_v2_enabled prerequisite and
--no-verify note for global region. Document policy proposals table
covering metadata.google.internal (always blocked), downloads.claude.ai,
and storage.googleapis.com.
- inference-routing.mdx: expand 'Use the Local Endpoint' section with
tabbed examples for Claude Code, OpenCode, Python OpenAI SDK, and
Python Anthropic SDK. Add notes explaining the /v1 path suffix
difference between clients.
- supported-agents.mdx: update Claude Code and OpenCode rows to mention
inference.local support and correct base URL requirements.
* fix: address vertex review findings
* test(sandbox): retry on spurious Ok in fork-exec ambiguity test
On arm64 under heavy CI load, the /proc fd scan in
find_socket_inode_owners can transiently miss the parent process's
socket fd entry, returning only the child as an owner. This causes
resolve_process_identity to return Ok (single owner, no ambiguity
check fires) instead of the expected ambiguous-ownership Err.
Extend the retry loop to also handle unexpected Ok results, mirroring
the existing retry for transient Err results. 10 retries at 50ms gives
a 500ms settling window, which is sufficient for procfs to stabilize
on loaded arm64 runners.
* fix: address vertex review regressions
* docs(router): clarify stream_response semantics for Vertex rawPredict routing
Document the three call sites of prepare_backend_request and their
stream_response values in a caller table:
- send_backend_request: false → :rawPredict (unary endpoint)
- send_backend_request_streaming: true → :streamRawPredict
- verify_backend_endpoint: explicitly false to probe the unary endpoint
Cross-reference the table from build_provider_url and
is_vertex_anthropic_rawpredict_route so the stream_response=true guard
in the suffix upgrade branch is understood in full context.
Also note that is_vertex_anthropic_rawpredict_route is a structural
predicate (model_in_path + anthropic_messages + :rawPredict suffix),
not a named-provider check, so any future provider with the same route
shape inherits the transforms automatically.
* fix: correct example paths in local-inference README (#1676)
* fix: correct example paths in local-inference README
* fix: correct example paths in local-inference routes.yaml
* ci(release): bring Fedora RPM canary to parity (#1688)
The RPM canary needs to exercise the install.sh user-service path, but a GitHub
Actions job container does not boot with systemd as PID 1. The Fedora RPM
canary needs to exercise the install.sh user-service path, but a GitHub Actions
job container does not boot with systemd as PID 1. This means the Fedora RPM
canary was incomplete as compared to the others.
With this change, we run Fedora as a nested privileged systemd container
instead, wait for systemd to become reachable, then start the root user manager
so systemctl --user works for the RPM gateway unit, achieving parity with the
other canary tests.
Signed-off-by: Kris Hicks <khicks@nvidia.com>
* fix: update RFC link in agent-driven-policy-management README (#1677)
* feat(providers): add profile-backed policy visibility (#1640)
* chore: wip providers v2 tui and codex profile
* chore: wip effective policy get and codex profile
* chore: wip provider profiles and tui detail views
* feat(tui): annotate policy proposal review status
* ci(release): fix Ubuntu Snap canary install and registration (#1699)
Install the Snap built by the triggering Release Dev workflow by setting
merge-multiple: true on the artifact download. actions/download-artifact
otherwise extracts each artifact into its own subdirectory, leaving the
package at release/snap-linux-amd64/*.snap, so the install glob
./release/*.snap matched nothing. Merging flattens the artifact's contents
directly into release/ where the dangerous local snap install expects it.
Harden the Snap canary setup by enabling snapd.socket, waiting for snap
seeding (snap wait system seed.loaded), and running every step with strict
shell options (set -euo pipefail) so failures surface immediately.
Register the snapped gateway with the CLI as the documented local plaintext
snap-docker gateway, and print version and snap services, before running
openshell status so the canary verifies a configured and reachable gateway
instead of only the install.
Signed-off-by: Kris Hicks <khicks@nvidia.com>
* feat(snap): add openshell.term desktop app (#1693)
Add a desktop launcher for the OpenShell TUI so users can launch
"openshell term" from their desktop environment application menu.
The change adds three files:
- snap/local/term.desktop: desktop entry file for the application launcher
- snap/local/icon.png: application icon (copied from snap store data)
- snapcraft.yaml: new "term" app entry that runs "openshell term"
with home, network, ssh-keys, and system-observe plugs, plus install
rules to stage the desktop file and icon under meta/gui/
The desktop file references the icon via ${SNAP} which is resolved
at runtime to the snap installation directory. The term app reuses
the same connection plugs as the main openshell app.
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
* fix(sandbox): restore GPU procfs baseline (#1522)
Signed-off-by: Evan Lezar <elezar@nvidia.com>
* fix(gateway): try harder to detect Podman (#1536)
Auto-detection previously treated Podman as available only when the podman CLI
was visible on PATH. However, package manager services can run with a
restricted PATH, which lets Docker be selected even when a Podman API socket is
reachable. Additionally, podman may symlink /var/run/docker.sock to podman's
machine unix socket, which would be incorrectly detected as Docker. Worse
still: the podman machine may not even be running.
This replaces the Podman binary check with a functional HTTP probe against the
standard Podman socket paths. The probe requires /_ping to answer with a
Libpod-Api-Version header before treating the socket as Podman, which lets the
gateway select the embedded Podman driver only when the API is usable.
Signed-off-by: Kris Hicks <khicks@nvidia.com>
* chore(mise): refresh tool lockfile (#1712)
Signed-off-by: Kris Hicks <khicks@nvidia.com>
* ci(release): authenticate snap canary artifact download (#1711)
The Ubuntu Snap canary downloads its artifact from a different workflow run
(the triggering Release Dev run) via run-id. Cross-run downloads require
authentication, so pass github.token to actions/download-artifact.
Signed-off-by: Kris Hicks <khicks@nvidia.com>
* docs(container-gateway): fix Docker driver setup for containerized gateway (#1419)
The existing docs omitted or misstated several requirements when running
the gateway as a container with the Docker compute driver:
- OPENSHELL_GRPC_ENDPOINT is required; the Docker driver uses only the
scheme (http/https) — host and port are substituted automatically with
host.openshell.internal and the gateway's own bind port
- Supervisor binary must be extracted to a host path before starting the
gateway; bind-mount sources are resolved by the host Docker daemon so
the path must be identical inside and outside the gateway container
- Docker socket access requires adding the docker group (UID 1000 default)
- Port binding should remain 127.0.0.1; Docker driver adds a bridge
listener automatically
- add --server-san host.openshell.internal to generate-certs for mTLS
- Complete the mTLS docker run with all Docker driver requirements
- Add deploy/docker/gateway.toml — TOML config for the Docker driver
- Add deploy/docker/docker-compose.yml referencing the TOML
- Add docs/get-started/tutorials/docker-compose.mdx tutorial page
- Remote gateway registration instructions (--remote flag)
Address reviewer feedback:
- Move Docker Compose tutorials card to the bottom of the list
- Replace inline YAML snippet in Docker Compose section with a reference
to deploy/docker/ to avoid drift
- Clarify OPENSHELL_DB_URL is safe in compose.yml (plain SQLite path,
no credentials); the TOML block targets credential-bearing DSNs
- Note that ./ in source: resolves relative to the compose file directory
- Clarify that only the scheme from OPENSHELL_GRPC_ENDPOINT matters
- Add note that the tilde volume mount resolves to the same absolute
path on both host and container
* refactor(server): deduplicate test helpers and grpc utilities (#1708)
Remove three groups of copy-pasted code in openshell-server:
1. grpc/mod.rs had a private current_time_ms() wrapper identical to the
one already exported from persistence/mod.rs. Remove the duplicate
and update the three grpc sub-modules (policy, sandbox, service) to
import directly from crate::persistence.
2. test_store() was repeated verbatim in seven #[cfg(test)] blocks.
Promote a single canonical version to persistence/mod.rs (cfg-gated)
and replace all copies with crate::persistence::test_store() calls or
a thin Arc wrapper in supervisor_session.
3. grpc_client_mtls() and build_tls_root() were copy-pasted across
edge_tunnel_auth.rs and multiplex_tls_integration.rs. Move both into
the existing tests/common/mod.rs shared module and import from there.
* fix(gateway): allow local sandbox jwt to not expire (#1721)
* fix(helm): create sandbox JWT secret when cert-manager is enabled (#1700)
* fix(helm): create sandbox JWT secret under cert-manager
The cert-manager install path (certManager.enabled=true,
pkiInitJob.enabled=false) left the gateway StatefulSet unable to start
because nothing created the openshell-jwt-keys Secret: cert-manager owns
TLS Secrets but does not mint the sandbox JWT signing key, and the
certgen hook only rendered when pkiInitJob.enabled was true.
Separate JWT signing-key provisioning from TLS PKI provisioning:
- certgen: add a --jwt-only mode that creates only the Opaque JWT
signing Secret, for use when another controller owns TLS Secrets.
- certgen.yaml: render the hook when pkiInitJob.enabled OR
certManager.enabled is true. cert-manager takes precedence and runs
the hook with --jwt-only even if pkiInitJob.enabled remains true.
Remove the mutual-exclusion failure between the two values.
- _helpers.tpl: add openshell.sandboxJwtSecretName, shared by the hook
and the StatefulSet mount.
- Update values, README, docs, architecture, and the
debug-openshell-cluster skill to reflect the new precedence; the
documented cert-manager install no longer needs pkiInitJob.enabled=false.
Closes #1691
* fix(helm): honor cert-manager precedence for client CA volume
The client CA volume logic treated pkiInitJob.enabled as proof that
built-in PKI owns the client CA. With cert-manager precedence now
allowing certManager.enabled=true alongside the default
pkiInitJob.enabled=true, that assumption mounts the server TLS cert
secret as the client CA and ignores
certManager.clientCaFromServerTlsSecret=false, which can break mTLS or
trust the wrong CA.
Gate the pkiInitJob.enabled term with (not certManager.enabled) in all
three client CA conditions (volume mount, volume definition, and secret
selection) so cert-manager owns TLS when enabled. Add a Helm test suite
covering built-in PKI, cert-manager shared CA, the regression config
(cert-manager + clientCaFromServerTlsSecret=false + default pkiInitJob),
and the no-client-CA case.
* feat(k8s-driver): add default_runtime_class_name config for sandbox pods (#1729)
Allow operators to configure a default Kubernetes runtimeClassName that
is applied to sandbox pods when the CreateSandbox request does not
specify one. This avoids requiring every API caller to explicitly set the
runtime class for clusters that always need a specific RuntimeClass
(e.g. kata-containers, nvidia).
The fallback is applied in the Kubernetes driver only — per-request
values still take priority, and an empty default (the built-in) preserves
existing behavior (field omitted, cluster default applies).
* docs: add Hermes Agent to supported agents (#1735)
* fix(cli): roll back gateway registration when auth fails during gateway add (#1538)
* refactor: deduplicate shared driver and TUI helpers (#1741)
* feat(cli): support multiple --upload flags on sandbox create (#1635) (#1645)
Closes #1635
Signed-off-by: Philippe Martin <phmartin@redhat.com>
* updates for new containers
---------
Signed-off-by: Derek Carr <decarr@redhat.com>
Signed-off-by: Florent Benoit <fbenoit@redhat.com>
Signed-off-by: Piotr Mlocek <pmlocek@nvidia.com>
Signed-off-by: Colin Walters <walters@verbum.org>
Signed-off-by: Adam Miller <admiller@redhat.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: John Myers <9696606+johntmyers@users.noreply.github.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Adel Zaalouk <azaalouk@redhat.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
Signed-off-by: Ann Marie Fred <afred@redhat.com>
Signed-off-by: Kris Hicks <khicks@nvidia.com>
Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
Signed-off-by: mjamiv <142179942+mjamiv@users.noreply.github.com>
Signed-off-by: Adrien Langou <alangou@nvidia.com>
Signed-off-by: Drew Newberry <anewberry@nvidia.com>
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Signed-off-by: Calum Murray <cmurray@redhat.com>
Signed-off-by: Naveen Malik <nmalik@redhat.com>
Signed-off-by: Patrick Riel <priel@nvidia.com>
Signed-off-by: Major Hayden <major@redhat.com>
Signed-off-by: Kirit93 <kthadaka@nvidia.com>
Signed-off-by: Kirit Thadaka <kthadaka@nvidia.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Philippe Martin <phmartin@redhat.com>
Co-authored-by: Mesut Oezdil <114185853+mesutoezdil@users.noreply.github.com>
Co-authored-by: Drew Newberry <anewberry@nvidia.com>
Co-authored-by: Taylor Mutch <taylormutch@gmail.com>
Co-authored-by: Seth Jennings <sjenning@redhat.com>
Co-authored-by: Florent BENOIT <fbenoit@redhat.com>
Co-authored-by: Eric Curtin <eric.curtin@docker.com>
Co-authored-by: Derek Carr <decarr@redhat.com>
Co-authored-by: mjamiv <142179942+mjamiv@users.noreply.github.com>
Co-authored-by: John Myers <9696606+johntmyers@users.noreply.github.com>
Co-authored-by: Piotr Mlocek <pmlocek@nvidia.com>
Co-authored-by: Russell Bryant <russell.bryant@gmail.com>
Co-authored-by: Colin Walters <walters@verbum.org>
Co-authored-by: Adam Miller <admiller@redhat.com>
Co-authored-by: Taylor Mutch <tmutch@nvidia.com>
Co-authored-by: Evan Lezar <elezar@nvidia.com>
Co-authored-by: Adel Zaalouk <azaalouk@redhat.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ann Marie Fred <afred@redhat.com>
Co-authored-by: krishicks <kris@krishicks.com>
Co-authored-by: Vegard Stikbakke <vegard.stikbakke@gmail.com>
Co-authored-by: krishicks <khicks@nvidia.com>
Co-authored-by: Davanum Srinivas <davanum@gmail.com>
Co-authored-by: alangou <alangou@nvidia.com>
Co-authored-by: Mrunal Patel <mrunalp@gmail.com>
Co-authored-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Co-authored-by: Calum Murray <cmurray@redhat.com>
Co-authored-by: Saurabh Agarwal <sauagarw@redhat.com>
Co-authored-by: Simon Scatton <44714756+SDAChess@users.noreply.github.com>
Co-authored-by: Naveen Malik <nmalik@redhat.com>
Co-authored-by: Patrick Riel <71560045+cheese-head@users.noreply.github.com>
Co-authored-by: Alexander Watson <zredlined@users.noreply.github.com>
Co-authored-by: Major Hayden <major@mhtx.net>
Co-authored-by: Kirit Thadaka <kirit.thadaka@gmail.com>
Co-authored-by: Jesse Jaggars <jhjaggars@gmail.com>
Co-authored-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Co-authored-by: shannonsands <shannon.sands.1979@gmail.com>
Co-authored-by: Philippe Martin <feloy1@gmail.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com> (cherry picked from commit 5102cb9) Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com> (cherry picked from commit 5102cb9) Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…ork` subcrates. (#1650) * refactor(sandbox): extract run_networking from run_sandbox Lifts TLS state generation, network namespace setup, proxy startup, bypass monitor spawn, and SSH-side proxy URL / netns FD computation out of run_sandbox into a sibling async fn `run_networking` that returns a Networking struct. The identity cache moves with it (only consumed by the proxy). Entrypoint PID allocation moves just above the call site so it can be passed in. No behavior changes — same OCSF emits, same async order, same RAII lifetimes for the proxy and bypass-monitor handles, now held by the returned Networking value in run_sandbox's frame. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(sandbox): extract run_process and lift netns to run_sandbox Lifts the post-networking tail of `run_sandbox` (zombie reaper, SSH server, supervisor session, process spawn, OPA probe, policy poll loop, denial aggregator, wait/exit) into a sibling async fn `run_process`. Also moves network namespace creation out of `run_networking` into a new `create_netns_for_proxy` helper invoked from `run_sandbox`, so `run_networking` is purely the proxy component (OPA evaluation, TLS interception, credential injection, inference routing, gRPC control API). The netns is then borrowed into both `run_networking` and `run_process`. No behavior change. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * chore(workspace): scaffold openshell-supervisor-networking and openshell-supervisor-process crates Add empty placeholder crates that subsequent commits will populate as the sandbox decomposition proceeds. Both crates compile clean as part of the workspace and are picked up automatically by the existing `members = ["crates/*"]` glob. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift DenialEvent to openshell-core The DenialEvent struct is emitted by both the proxy/L7 layer (networking-side) and the bypass monitor (process-side), and crosses the run_networking -> run_process API boundary. Move it to openshell-core so the eventual supervisor-networking and supervisor-process crates can both reference it without depending on each other. DenialAggregator and the channel/flush helpers stay in openshell-sandbox for now. A thin `pub use openshell_core::DenialEvent;` re-export from denial_aggregator.rs keeps every existing `crate::denial_aggregator::DenialEvent` call site resolving without further edits. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift normalize_path to openshell-core Move the lexical path-normalization helper from openshell-policy to openshell-core::paths so it can be reached from crates that sit below openshell-policy in the dependency graph. openshell-policy keeps its existing public API via a `pub use` re-export, so all current call sites (e.g. openshell-sandbox/src/policy.rs) continue to resolve unchanged. This is a prerequisite for lifting openshell-sandbox/src/policy.rs into openshell-core: that file's `From<ProtoFilesystemPolicy>` impl calls normalize_path, and lifting it as-is would cycle through openshell-policy. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift SandboxPolicy and friends to openshell-core Move openshell-sandbox/src/policy.rs (SandboxPolicy, NetworkPolicy, ProxyPolicy, FilesystemPolicy, LandlockPolicy, ProcessPolicy, NetworkMode, LandlockCompatibility, plus their Proto* TryFrom/From impls) to openshell-core/src/policy.rs. Both prospective supervisor leaves (networking and process) dispatch on SandboxPolicy. Hosting it in openshell-core lets either leaf reach for it without depending on the other (or on the future orchestrator). The From<ProtoFilesystemPolicy> impl now calls the in-crate openshell_core::paths::normalize_path lifted in the previous commit, which is what made this move cycle-free. Update all crate::policy::* call sites in openshell-sandbox to openshell_core::policy::*. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move child_env from openshell-sandbox child_env (proxy_env_vars, tls_env_vars) is process-side only — consumed by process.rs and ssh.rs. With the orchestrator staying in openshell-sandbox (Shape A), openshell-sandbox depends on the new leaf crates, so process-only modules can land in openshell-supervisor-process directly. Add openshell-supervisor-process as a path dependency of openshell-sandbox. Update process.rs and ssh.rs to import from openshell_supervisor_process::child_env. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move skills from openshell-sandbox Move the static skills installer (and its embedded resource directory) out of openshell-sandbox into openshell-supervisor-process. The module is process-side only — invoked once during sandbox start to drop agent skill files into the workspace — and has no cross-leaf consumers. Adds miette as a dependency and tempfile as a dev-dependency on openshell-supervisor-process. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move mechanistic_mapper from openshell-sandbox Move the mechanistic mapper (HTTP method/path → operation classifier that derives policy proposals from connection summaries) out of openshell-sandbox into openshell-supervisor-networking. Single internal caller (run_policy_poll_loop in lib.rs) and only depends on openshell-core + tracing — no cross-leaf entanglement. First population of the openshell-supervisor-networking crate; adds openshell-core and tracing as dependencies. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift procfs to openshell-core Move procfs (PID lookups, ancestor walking, /proc/net/tcp socket-owner resolution, file SHA256 hashing) from openshell-sandbox into openshell-core. The module is consumed cross-leaf — by bypass_monitor on the process side and by identity / proxy on the networking side — so it has to sit below both leaves. Adds tracing, sha2, and hex as dependencies on openshell-core. Updates the three call sites in openshell-sandbox to import from openshell_core::procfs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move identity from openshell-sandbox Move BinaryIdentityCache (path → SHA256 cache used to identify the process behind an outbound connection) from openshell-sandbox into openshell-supervisor-networking. The cache is consumed only by the networking-side proxy and the orchestrator; with procfs already in openshell-core there are no remaining cross-leaf dependencies. Adds miette as a dependency and tempfile as a dev-dependency on openshell-supervisor-networking. Adds a Default impl for BinaryIdentityCache to satisfy clippy::new_without_default now that the type is publicly exposed across crates. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move agent-proposals flag from openshell-sandbox Move AGENT_PROPOSALS_ENABLED, agent_proposals_enabled(), and the test-only ProposalsFlagGuard out of openshell-sandbox into openshell-supervisor-process::proposals. The flag is read only by the process-side policy_local route handler and the orchestrator; lifting it to openshell-core would have made core carry sandbox-owned runtime state without buying anything. The test-only ProposalsFlagGuard is still consumed from networking-side l7/rest tests today (until the wider Q2 OCSF/gRPC injection work lands). Expose it via a new optional `test-helpers` feature on openshell-supervisor-process so test crates opt in explicitly without pulling tokio sync primitives into production builds. openshell-sandbox keeps its existing crate-private path (`crate::AGENT_PROPOSALS_ENABLED`, `crate::test_helpers`) via re-exports so call sites and tests are unchanged. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift secrets to openshell-core Move crates/openshell-sandbox/src/secrets.rs to crates/openshell-core/src/secrets.rs so both supervisor leaves can reach SecretResolver and the placeholder helpers without depending on openshell-sandbox. Add base64 to openshell-core deps (only stdlib + base64 are used). Promote previously pub(crate) constructors and methods on SecretResolver to pub since cross-crate callers (provider_credentials, proxy/L7 tests) now name them across the crate boundary. Update import paths in proxy.rs, l7/{rest,relay,websocket}.rs, and provider_credentials.rs from crate::secrets to openshell_core::secrets. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift provider_credentials to openshell-core Move crates/openshell-sandbox/src/provider_credentials.rs to crates/openshell-core/src/provider_credentials.rs. Both supervisor leaves now name ProviderCredentialState in their function signatures (run_networking takes &ProviderCredentialState, run_process takes ProviderCredentialState by value), and under Shape A leaves can't depend on openshell-sandbox, so the type must live in openshell-core. The orchestrator (run_sandbox in openshell-sandbox) remains the only writer: it constructs ProviderCredentialState::from_environment and the policy poll loop calls install_environment on credential rotation. Both leaves stay pure readers via snapshot()/resolver(). Update import paths in proxy.rs, ssh.rs, and lib.rs from crate::provider_credentials to openshell_core::provider_credentials. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * style: rustfmt import ordering Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(ocsf): move SandboxContext singleton from openshell-sandbox Move the process-wide OCSF SandboxContext OnceLock + LazyLock fallback + getter from openshell-sandbox/src/lib.rs into a new openshell-ocsf::ctx module. The type already lives in openshell-ocsf, so its singleton lives next to it. Add openshell_ocsf::ctx::set_ctx() and openshell_ocsf::ctx::ctx(). The orchestrator (run_sandbox) now calls set_ctx during startup. Sandbox keeps a pub(crate) use openshell_ocsf::ctx::ctx as ocsf_ctx; re-export so the 138 existing crate::ocsf_ctx() call sites resolve unchanged. When the sandbox modules themselves migrate into the leaf crates, they'll import openshell_ocsf::ctx directly and the re-export goes away. Under Shape A neither leaf can depend on openshell-sandbox; both already depend on openshell-ocsf to construct events, so this adds no new dep edge. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift grpc_client to openshell-core Both prospective leaves (supervisor-networking and supervisor-process) need CachedOpenShellClient, AuthedChannel, and the connect/fetch helpers. Under Shape A the leaves cannot depend on openshell-sandbox, so the type has to live below them. openshell-core already pulls in tonic and miette; this enables tonic's channel/tls features and adds tokio as a direct dep. Updates all crate::grpc_client::* call sites in openshell-sandbox to openshell_core::grpc_client::*. No re-export shim — the call-site count was small enough to update directly. See architecture/plans/sandbox-split-design-choices.md for the full rationale and trade-offs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move denial_aggregator from openshell-sandbox DenialAggregator and FlushableDenialSummary belong with the proxy and L7 layer that emit denials. Moves the file into openshell-supervisor-networking; adds tokio as a regular dep there since DenialAggregator uses tokio::sync::mpsc. Drops the pub use openshell_core::DenialEvent re-export inside the moved file (no longer needed cross-crate). Updates bypass_monitor.rs, proxy.rs, and lib.rs to import openshell_core::DenialEvent directly. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move log_push from openshell-sandbox LogPushLayer is a process-side tracing layer that streams sandbox logs to the gateway via gRPC. Moves into openshell-supervisor-process; adds openshell-core, openshell-ocsf, tokio-stream, tracing, and tracing-subscriber as direct deps there. Updates the only external call site (openshell-sandbox/src/main.rs) to import from openshell_supervisor_process::log_push. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move bypass_monitor from openshell-sandbox bypass_monitor reads /dev/kmsg for nftables drop log lines and emits denial events. Pure process-side concern, called only from run_networking which spawns it on the netns. Moves into openshell-supervisor-process; all deps (openshell-core, openshell-ocsf, tokio, tracing) were already declared there. Replaces crate::ocsf_ctx() shim calls inside the moved file with openshell_ocsf::ctx::ctx() — first leaf-side caller to import the OCSF context singleton directly instead of going through openshell-sandbox's re-export. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move debug_rpc from openshell-sandbox debug_rpc is the CLI subcommand handler that exercises authenticated gRPC calls (issue-token, refresh-token, get-config, etc.). Pure process-side concern, called only from openshell-sandbox/main.rs. Adds base64, hex, serde_json, sha2, and tonic (with channel/tls features) as direct deps on openshell-supervisor-process. Updates the single call site in main.rs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move supervisor_session from openshell-sandbox supervisor_session opens a bidirectional gRPC stream that lets the gateway initiate shells inside the sandbox. Pure process-side concern, called only from run_process. Adds uuid as a direct dep on openshell-supervisor-process. Replaces crate::ocsf_ctx() shim calls inside the moved file with openshell_ocsf::ctx::ctx() — same pattern as bypass_monitor. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): lift managed_children tracker from openshell-sandbox The MANAGED_CHILDREN set tracks PIDs of supervisor-spawned children (entrypoint + SSH sessions) so the orchestrator's SIGCHLD reaper can distinguish them from incidental zombies. Pure process-side concern, moves to openshell_supervisor_process::managed_children with three public fns: register, unregister, is_managed. Updates lib.rs reaper, process.rs, and ssh.rs to call through the new module path. Drops the now-unused HashSet import from lib.rs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move sandbox hardening from openshell-sandbox Lift the process-only hardening pieces (landlock, seccomp, PreparedSandbox, prepare/enforce, log_sandbox_readiness, top-level apply, and apply_supervisor_startup_hardening) from crates/openshell-sandbox/src/sandbox/ to crates/openshell-supervisor-process/src/sandbox/. Leave netns.rs and nft_ruleset.rs in openshell-sandbox for now, since both eventual leaf crates (supervisor-networking and supervisor-process) read from NetworkNamespace and its final home is decided when run_networking and run_process are extracted. Replace crate::ocsf_ctx() shims in landlock.rs and the new linux/mod.rs with direct openshell_ocsf::ctx::ctx() calls. Update call sites in lib.rs, process.rs, and ssh.rs to import sandbox from openshell_supervisor_process while keeping the netns import unchanged. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift proposals flag from openshell-supervisor-process Move proposals.rs (AGENT_PROPOSALS_ENABLED OnceLock + agent_proposals_enabled reader + test_helpers::ProposalsFlagGuard) from openshell-supervisor-process to openshell-core so both eventual leaf crates can read it without depending on each other. The flag is process-wide singleton state initialised once during sandbox startup and read by both the policy.local route (networking-side) and the skills installer (process-side) — same shape as openshell_ocsf::ctx. Move the test-helpers Cargo feature alongside it: openshell-core gains the feature, openshell-supervisor-process loses it, and openshell-sandbox's dev-dependency now enables openshell-core/test-helpers. Update the sandbox re-export shim to point at openshell_core::proposals. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(core): lift netns + nft_ruleset from openshell-sandbox Move NetworkNamespace and the nft_ruleset bypass-rule generator from crates/openshell-sandbox/src/sandbox/linux/ to crates/openshell-core/src/netns/. Both eventual leaf crates (supervisor-networking and supervisor-process) read from NetworkNamespace, so it must live somewhere both can depend on without violating the Shape A no-leaf-to-leaf rule. Replace crate::ocsf_ctx() shims in netns with direct openshell_ocsf::ctx::ctx() calls, matching the pattern used in already-migrated process modules. Update super::nft_ruleset references inside netns to nft_ruleset since the module is now a sibling sub-module of netns/mod.rs. Add openshell-ocsf and uuid as linux-only dependencies of openshell-core, and gate pub mod netns on target_os = "linux" since the implementation uses netlink, ip(8), and namespace fds. Delete the now-empty sandbox/{mod.rs, linux/mod.rs} stubs and update NetworkNamespace import paths in lib.rs and process.rs to point at openshell_core::netns. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move process.rs and ssh.rs from openshell-sandbox Lift the entrypoint process spawn module and the embedded SSH server module into openshell-supervisor-process. openshell-sandbox now re-exports ProcessHandle/ProcessStatus and calls openshell_supervisor_process::ssh::run_ssh_server directly. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move proxy, l7, opa, policy_local from openshell-sandbox Lift the egress proxy, L7 enforcement modules, OPA engine, and policy.local advisor API into openshell-supervisor-networking. Move accompanying data files (sandbox-policy.rego), test fixtures (testdata/), and integration tests (system_inference, websocket_upgrade). Sandbox lib.rs now references these via openshell_supervisor_networking::* and ProxyHandle::start_with_bind_addr is exposed as pub for the orchestrator call site. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(sandbox): hoist policy poll loop and denial aggregator into orchestrator Move the symlink-resolver, policy poll loop, and denial-aggregator flush spawns out of run_process and into run_sandbox so run_process no longer needs OpaEngine, retained_proto, the local policy context, the sandbox name, the gateway endpoint for telemetry, the OCSF flag, or the denial receiver. These long-running orchestrator-owned tasks now live alongside the other sandbox-startup wiring, matching the design log decision in architecture/plans/sandbox-split-design-choices.md (Q5). Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move run_process from openshell-sandbox Lift the workload supervision entry point (zombie reaper, SSH server spawn, supervisor session, entrypoint child spawn, exit-with-timeout) into its own module in openshell-supervisor-process. The orchestrator in openshell-sandbox now calls openshell_supervisor_process::run::run_process directly. With this move run_process names only types from openshell-core, openshell-ocsf, openshell-supervisor-process itself, std, and tokio — no openshell-supervisor-networking dependency. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move bypass_monitor from supervisor-process Bypass detection is network-policy enforcement: it parses nftables LOG entries from /dev/kmsg and emits OCSF NetworkActivity / DetectionFinding events plus DenialEvents into the same channel the proxy feeds. Its lifetime is tied to the network namespace, not to the workload child. Moving it to openshell-supervisor-networking puts it next to the proxy and the denial aggregator that consume its output, and unblocks moving run_networking out of openshell-sandbox without a leaf-to-leaf dep. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move inference route helpers from openshell-sandbox Move build_inference_context, partition_routes, bundle_to_resolved_routes, spawn_route_refresh, the InferenceRouteSource enum, and the route refresh interval helpers into a new openshell-supervisor-networking::inference_routes module along with their unit tests. The orchestrator now calls into the networking leaf for inference context construction; the leaf owns its own route bundle resolution end-to-end. The new module is named inference_routes to avoid colliding with the existing l7::inference module, which handles request-time HTTP parsing and pattern matching rather than route bundle setup. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-networking): move run_networking from openshell-sandbox Move the Networking handle struct, run_networking, and the Linux-only create_netns_for_proxy helper into a new openshell-supervisor-networking::run module. The orchestrator in openshell-sandbox now invokes openshell_supervisor_networking::run::{create_netns_for_proxy, run_networking} and reads the Networking fields directly; the leaf owns the entire networking-stack startup path (CA generation, proxy task, bypass monitor, inference context, denial channel) end-to-end. The Networking RAII handle fields (proxy, bypass_monitor) are now public without leading underscores so the public API satisfies clippy's pub_underscore_fields lint while still serving as drop guards held by the orchestrator's frame. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * fix(workspace): align Cargo deps and call sites for split crates The recent module lifts left two Linux-only gaps that the macOS host workspace check skipped: - openshell-core's netns module needs libc, tempfile, and nix on Linux, but only openshell-ocsf and uuid were carried over. - openshell-supervisor-process's seccomp/landlock modules need landlock and seccompiler, which still lived on openshell-sandbox. - openshell-sandbox's runtime_pid_limit branch referenced an unqualified process:: that pointed at the old in-crate module. Move landlock/seccompiler to supervisor-process, add the missing core deps, qualify the call sites, and drop sandbox deps that no longer have runtime users (landlock, seccompiler, target-gated tempfile/uuid, the unix libc/rustix block). Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): rename openshell-supervisor-networking to openshell-supervisor-network Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): own denial-aggregator flush end-to-end Move the denial-aggregator spawn and flush_proposals_to_gateway out of run_sandbox and into run_networking. The networking leaf already owns every other input (proxy + bypass_monitor as producers, denial channel, mechanistic_mapper, denial_aggregator) and already opens its own gRPC connections (inference_routes, policy_local) — the orchestrator was the only piece left straddling the boundary. Networking now drives the full path: producers -> channel -> aggregator -> flush -> gateway. Drops denial_rx from Networking; adds sandbox_name to run_networking so SubmitPolicyAnalysis can resolve by sandbox name (falls back to ID when unset). Same shape as log_push in the process leaf. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): own symlink-resolution task Move the OPA binary-symlink resolver out of run_sandbox and into run_networking. The task probes /proc/<entrypoint_pid>/root/ until the workload's mount namespace is accessible, then rebuilds the OPA engine with resolved binary paths so policy rules match canonical names instead of symlinks. Both inputs (Arc<OpaEngine>, retained_proto) are networking-leaf concerns and were already plumbed into run_networking; the entrypoint_pid Arc is read lazily after the process leaf populates it. Adds retained_proto as a parameter and spawns the resolver early in run_networking so the probe loop starts before the proxy comes up. Same shape as the denial-flush move: networking owns its own background task end-to-end; the orchestrator stops hosting work that doesn't conceptually belong to it. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move seccomp install into run_process The supervisor seccomp prelude is part of "set up the workload-side process tree", not part of orchestration. Move the call site from run_sandbox into the top of run_process and drop the now-unused re-export from openshell-sandbox::lib. Timing is preserved: by the time the orchestrator calls run_process, run_networking has already returned, so netns + nftables setup is complete. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move check_runtime_pid_limit into run_process The PID-limit precondition is process-side: it gates whether the workload child can be spawned at all. Move the call from run_sandbox into the top of run_process, alongside the seccomp prelude. Same shape as the seccomp move — function already lives process-side, only the call site relocates. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move validate_sandbox_user to process crate The sandbox-user check is a precondition for privilege-dropping the workload child; it has no relevance to networking. Move the function next to drop_privileges in openshell-supervisor-process::process and call it from the top of run_process. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move prepare_filesystem to process crate Creating and chowning read_write directories is workload-side preparation, not orchestration. Move prepare_filesystem and its prepare_read_write_path helper (plus tests) into openshell-supervisor-process::process and call from run_process, alongside validate_sandbox_user. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-process): move startup skill install into run_process The eager initial-settings fetch + agent skill install is process-side: the install materializes files the workload's filesystem sees. The orchestrator still owns the AGENT_PROPOSALS_ENABLED OnceLock init because the policy poll loop also reads it; only the early fetch and install hop into run_process. Behavior unchanged. Best-effort: any RPC or install failure is logged but does not fail startup. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): own PolicyLocalContext construction Move the PolicyLocalContext construction from run_sandbox into run_networking. The orchestrator was building it solely to thread it into the networking leaf and to share it with the policy poll loop; now run_networking builds it from inputs it already takes (retained_proto, openshell_endpoint, sandbox_name|sandbox_id) and exposes it on the returned Networking struct. The orchestrator's poll loop now grabs the Arc clone from networking.policy_local_ctx, so the orchestrator no longer imports openshell_supervisor_network::policy_local. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * feat(supervisor): add --mode flag to gate network/process leaves Add a --mode flag (default "network,process") that selects which supervisor leaves run in the current process. Two new shapes are unlocked without splitting the binary: --mode=network # network-only sidecar --mode=process # process-only supervisor --mode=network,process # combined (default; current behavior) In network-only mode the orchestrator skips run_process and waits on SIGINT/SIGTERM before tearing down the proxy. The entrypoint PID stays at 0 for the lifetime of the process, which silently degrades the proxy's binary-identity TOFU and the bypass monitor's PID enrichment; this is correct in a split-pod topology where the workload's /proc lives in another pod. In process-only mode run_networking is skipped entirely. SSH sessions get no proxy URL, no netns FD, and no CA paths, matching what a split-pod consumer would expect when network enforcement is delegated to a sidecar. The policy poll loop continues to run unconditionally; its OPA-reload and policy.local hooks already gate on the resources only present when network is enabled, and the env-refresh / proposals-toggle hooks remain active in process mode. Closes a step toward the RFC-0001 supervisor topology proposed in issue #1305 by drew. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * style(supervisor-process): rustfmt long debug! line Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): pull DenialEvent down from core DenialEvent is only emitted and consumed inside openshell-supervisor-network (proxy, bypass monitor, denial aggregator). It never crossed the leaf boundary, so the earlier lift to openshell-core was speculative. Move it back into the network crate where its only callers live. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): pull procfs down from core procfs was lifted to openshell-core under the assumption it would be shared cross-leaf, but on the current branch all three callers (bypass_monitor, identity, proxy) live in openshell-supervisor-network. No file in openshell-supervisor-process imports it. Move the module to the network crate and drop sha2/hex from openshell-core, which were pulled in only for procfs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * style(supervisor-network): run cargo fmt Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * fix(supervisor-network): add libc dev-dependency for procfs tests The procfs/bypass_monitor/proxy test modules use libc::{fork, exec, fcntl, kill, waitpid} but the dep wasn't declared in this crate's Cargo.toml. It was previously satisfied transitively when these modules lived in openshell-core; the move left the test target unable to resolve libc. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(sandbox): move denial aggregator to orchestrator The denial aggregator and mechanistic mapper consume denial events produced by the proxy and (subsequently) the bypass monitor. With both supervisor leaves becoming pure producers of `DenialEvent`, the consumer-side aggregation belongs in the orchestrator, not in either leaf. Move `denial_aggregator.rs` and `mechanistic_mapper.rs` from `openshell-supervisor-network` to `openshell-sandbox` (the orchestrator). The orchestrator now owns the unbounded denial channel: it constructs `(tx, rx)`, hands `tx` to `run_networking` for the proxy to clone, drains `rx` via the aggregator task, and runs the gateway flush helper itself. `run_networking`'s signature gains a `denial_tx` parameter and loses its internal channel construction, aggregator spawn, and `flush_proposals_to_gateway` helper. `DenialEvent` stays in `openshell-supervisor-network` for now; a follow-up commit will lift it to `openshell-core` alongside the bypass monitor relocation. * refactor(supervisor-process): pull bypass monitor down from network `bypass_monitor` is process-isolation machinery: it tails the kernel log via `dmesg --follow`, parses nftables LOG lines emitted from the workload's network namespace, resolves PIDs via `/proc`, and emits OCSF events plus optional `DenialEvent`s. None of this touches the proxy, OPA, TLS, or any other supervisor-network state — it only shared the denial channel because both feed the same aggregator. Move `bypass_monitor.rs` from `openshell-supervisor-network` to `openshell-supervisor-process` (as `bypass_monitor/mod.rs`). Spawn it in `run_process` where the netns name and entrypoint PID are already in scope. The orchestrator hands an extra `bypass_denial_tx` clone of the denial channel sender to `run_process` for this purpose. Lift `DenialEvent` from `openshell-supervisor-network` to `openshell-core`. Both supervisor leaves now produce it, so it needs a shared location that neither leaf depends on. This reverses an earlier commit that pulled the type into the network leaf when it was the only producer. Copy the minimal subset of `/proc` parsers used by `bypass_monitor` into a private `bypass_monitor::procfs` submodule. The alternative — extracting a shared procfs crate — is a much larger refactor that this commit does not need; supervisor-network's `procfs.rs` continues to serve the proxy and identity cache. * refactor(supervisor-process): derive ssh netns fd inside run_process The ssh_netns_fd was computed in run_networking purely to forward it through the Networking struct and back into run_process. supervisor-network never read it. Move the derivation to run_process where the NetworkNamespace handle is already in scope. * refactor(supervisor-process): derive ssh proxy url inside run_process The ssh_proxy_url was computed in run_networking purely to forward it through the Networking struct and back into run_process. supervisor-network never read it. Move the derivation to run_process where the NetworkNamespace handle and SandboxPolicy are already in scope. After this commit the Networking struct no longer carries any SSH-shaped fields, and supervisor-network reads only host_ip from the netns (for the proxy bind address). * refactor(supervisor-network): take proxy bind ip directly instead of netns run_networking only ever read host_ip from the netns it was passed (the SSH plumbing reads moved to run_process in earlier commits). Replace the NetworkNamespace parameter with a plain Option<IpAddr> the orchestrator extracts. supervisor-network's run module no longer references the netns type for any consumer, only for create_netns_for_proxy (which still lives in this crate; relocates next). * refactor(supervisor-process): move netns ownership out of core Relocates the NetworkNamespace handle, nft ruleset builder, and create_netns_for_proxy constructor into openshell-supervisor-process. The orchestrator (openshell-sandbox) phantom-owns the RAII handle for the duration of run_sandbox; supervisor-network no longer references the type at all. Drops uuid, libc, nix, openshell-ocsf, and tempfile from core's Linux target deps (all were exclusive to netns). tempfile becomes a Linux runtime dep on supervisor-process for nft ruleset application. * chore(sandbox): prune leaf-only deps from orchestrator manifest cargo-machete flagged 26 direct dependencies that were carried over from the pre-split monolith and are no longer used by the orchestrator itself: regorus, russh, rcgen, tokio-rustls, ipnet, apollo-parser, openshell-router, anyhow, base64, bytes, flate2, glob, hex, hmac, nix, rand_core, rustls-pemfile, serde, serde_yml, sha1, sha2, thiserror, tokio-stream, uuid, webpki-roots. These now live (transitively) in openshell-supervisor-network and openshell-supervisor-process where they are actually consumed. * chore(deps): prune unused deps from supervisor crates - Drop unused `url` from openshell-supervisor-network. - Mark `prost` and `prost-types` as cargo-machete-ignored in openshell-core: they have no source-level `use`, but the tonic- generated proto code references them via `::prost::Message` etc. - openshell-supervisor-process is already clean. * fix(supervisor-network): wait for entrypoint PID before symlink probe The OPA symlink-resolution task reads entrypoint_pid once at the top of the spawned closure. Because the spawn happens before run_process publishes the workload PID, the load returns 0, the probe path bakes in as /proc/0/root/, and the loop exhausts its retries against a path that does not exist on Linux. The reload never fires, so policies that whitelist symlinked binaries (e.g. /usr/bin/python3 → python3.11) get silent denials when the workload exec's the realpath. Split the wait into two phases: 5s polling entrypoint_pid for a non-zero value, then the existing 5s window probing /proc/<pid>/root/. Distinct warn messages on each timeout so future debugging can tell "PID never published" apart from "container fs never appeared". * fix(sandbox): restore GPU procfs baseline (#1522) Signed-off-by: Evan Lezar <elezar@nvidia.com> (cherry picked from commit 5102cb9) Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * fix(supervisor-process): use renamed tonic tls-native-roots feature Upstream renamed the tonic `tls` feature to `tls-native-roots`. The supervisor-process Cargo.toml still referenced the old name, which broke the workspace build after merging upstream. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * refactor(supervisor-network): relocate token_grant and spiffe_endpoint Upstream's SPIFFE-backed token grant feature landed in crates/openshell-sandbox/src/. After the supervisor split, the L7 enforcement code in supervisor-network calls into token_grant, which would require supervisor-network to depend back on sandbox. Move token_grant.rs and spiffe_endpoint.rs into supervisor-network where the only callers live, add the reqwest and spiffe deps to supervisor-network's Cargo.toml, and drop them from sandbox. Also fix two stale `openshell_core::proto::` self-references in openshell-core (a pre-existing breakage that surfaced once the rest of the merge compiled). Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> * fix(supervisor-process): broaden Path import cfg to all unix targets The `Path` import was gated on `cfg(any(test, target_os = "linux"))`, but `prepare_read_write_path` is gated on `cfg(unix)` — broader. On non-Linux unix the function still referenced `&std::path::Path` explicitly, so upstream's qualified path was load-bearing. After the supervisor split, lint runs on Linux where `Path` IS in scope, so `unused_qualifications` fires. Broaden the import cfg to match the function's cfg and use the bare `Path` name everywhere. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> --------- Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com> Signed-off-by: Evan Lezar <elezar@nvidia.com> Co-authored-by: Evan Lezar <elezar@nvidia.com>
Summary
Restore CUDA GPU startup compatibility by promoting
/procfromfilesystem_policy.read_onlytofilesystem_policy.read_writewhen/procis part of the active GPU runtime baseline.
This keeps the change intentionally narrow. The existing baseline enrichment
already places
/procin the GPU read-write baseline because CUDA writes/proc/<pid>/task/<tid>/commduring initialization. The missing behavior wasthat an existing read-only
/procentry caused enrichment to skip theread-write baseline path. This PR restores that promotion and emits an
informational log message when it happens.
Broader handling for user-supplied policy conflicts and explicit baseline
conflict controls is left to follow-up work such as #1629.
Related Issue
Fixes #1486
Related follow-up: #1629
Changes
/procfromread_onlytoread_writewhen the GPU read-writebaseline requires it.
/procis promoted for GPU runtimecompatibility.
policy.
Testing
mise exec -- cargo fmt --allmise exec -- cargo test -p openshell-sandbox --lib baseline_tests -- --nocapturemise run pre-commitcompleted Helm lint, Rust format, Rust check, Rust clippy, markdown lint, and license checks;python:protofailed in the parallel run becausegrpc_toolswas missing after.venvrecreation.mise run python:protoChecklist