fix(sandbox): attribute supervisor-helper peers via pid-1 netns fallback by Kh4L · Pull Request #956 · NVIDIA/OpenShell

Kh4L · 2026-04-24T03:25:00Z

Summary

Fix peer attribution for policy-proxy CONNECTs that originate from supervisor helpers.

parse_proc_net_tcp currently searches only /proc/<entrypoint_pid>/net/tcp to identify the peer process behind a CONNECT. Supervisor helpers introduced in #953 share the supervisor's network namespace, not the workload's, so helper-originated connections never appear in the workload procfs view. The proxy logs "failed to resolve peer binary: No ESTABLISHED TCP connection found" and denies the request even when the host is allowlisted.

This PR adds a narrow fallback: check the workload's procfs view first, then fall back to /proc/1/net/tcp, where pid 1 is the sandbox supervisor.

This is stacked on #953 and should land after that PR. Until #953 lands, the diff against main will include the supervisor-helpers commit as well as this procfs fix.

Related Issue

Related to #681 and #684.

Depends on #953.

Why pid 1

The previous #684 proposal used /proc/net/tcp, which is broader and raised a port-collision concern because it searches the init-namespace global view.

This PR keeps the fallback inside the sandbox pod boundary:

Approach	Search space	Port-collision risk
Current behavior	`/proc/<workload_pid>/net/tcp`	None, but misses helpers and some redirected flows
#684	`/proc/net/tcp`	Broader host/global view
This PR	`/proc/1/net/tcp`	Same pod namespace; workload checked first

Inside a sandbox pod, pid 1 is the supervisor. Helpers share the supervisor network namespace by design. Workload peers still win because the workload procfs view is checked first, so existing workload behavior is unchanged unless the original lookup already missed.

Downstream identity checks in proxy::resolve_process_identity apply unchanged. Binary hash TOFU and ancestor walking still catch binary swaps or unexpected process lineage.

Changes

Extended crates/openshell-sandbox/src/procfs.rs to search [entrypoint_pid, 1].
Updated the peer-resolution error message to mention both lookup paths.
Added comments explaining the lookup order and security boundary.

Testing

mise run pre-commit
Existing unit tests still pass
Manual macOS Docker Desktop arm64 validation with supervisor helpers
WSL2 repro validation for [Bug] L7 egress proxy denies all CONNECT requests on Docker Desktop + WSL2 (amd64) #681

Manual validation confirmed helper-originated CONNECTs to api.telegram.org and openrouter.ai are attributed and allowed by policy instead of failing with "No ESTABLISHED TCP connection found".

Checklist

Follows Conventional Commits
Commits are signed off (DCO)

Declarative JSON config for daemons the supervisor should spawn alongside the workload with operator-declared ambient capabilities. Each listed helper is forked by the supervisor before the workload, has its requested caps raised into the inheritable and ambient sets via capset(2) and prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, ...) in the pre-exec child, and is then execve'd. After execve the ambient set becomes part of the helper's permitted and effective sets, so it keeps its capabilities even though the workload path still runs with PR_SET_NO_NEW_PRIVS=1. Helpers run in the supervisor's context and do not inherit the per-workload seccomp filter, PR_SET_NO_NEW_PRIVS, Landlock, or the uid drop. The supervisor seccomp prelude added in NVIDIA#891 is applied after helpers spawn and only blocks mount/fs*/pivot_root/bpf/kexec/module/userfaultfd syscalls, so capset, prctl, clone, and execve remain available to the helper spawn path. Validation rejects empty names, duplicate names, empty commands, non-absolute command[0], and unknown capability names. A cap outside the supervisor's permitted set (which the pod's bounding set clamps) fails at capset with EPERM and the supervisor exits before the workload starts. The SSH handshake secret is scrubbed from the helper's inherited environment, matching what the workload sees. Each helper start emits an OCSF AppLifecycle Start event with the helper name and pid. This generalizes the hardcoded DNS-proxy-shaped pattern into a public, declarative API so operators can register audited daemons (capability brokers, privileged IPC bridges) without patching the supervisor binary. Intentionally out of scope for this v0 landing (tracked for follow-ups): per-helper Landlock, workload rendezvous, restart policy, readiness fd, stdio routing into OCSF JSONL, per-helper cgroup limits, and runAsUser for dropping to non-root while retaining caps. All additive on the v0 schema. Flag: --helpers-config <path> (env: OPENSHELL_HELPERS_CONFIG). Architecture: architecture/supervisor-helpers.md. User docs: docs/sandboxes/supervisor-helpers.mdx. Signed-off-by: Serge Panev <[email protected]> Signed-off-by: Serge Panev <[email protected]>

copy-pr-bot · 2026-04-24T03:25:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Helpers share the supervisor's network namespace by design, but that netns was previously empty of iptables rules — anything spawned by a helper that ignored HTTPS_PROXY (notably Node's built-in https.request, which OpenClaw's web_fetch tool uses) could egress directly to any host the pod could reach, with no policy check and no audit trail. Install a small OUTPUT-chain rule set in the supervisor netns before spawning the first helper: 1. ACCEPT -o lo (mediator UDS, etc.) 2. ACCEPT -m conntrack --ctstate ESTABLISHED,RELATED 3. ACCEPT -d <proxy>/32 -p tcp --dport 3128 (proxy-aware path) 4. ACCEPT -m owner --uid-owner 0 (supervisor + proxy upstream forwarding) 5. LOG --log-prefix "openshell:helper-bypass:" (bypass_monitor hook) 6. REJECT --reject-with icmp-port-unreachable (fast-fail) A helper that drops privileges (gosu, runner.c setresuid, etc.) and then attempts a direct outbound TCP connection lands in the catch-all REJECT and gets ECONNREFUSED plus a kernel LOG entry that bypass_monitor parses into an OCSF DetectionFinding. Trade-offs documented in the module-level comment block: * UID 0 ACCEPT covers the supervisor's own gRPC to control plane, log push, and the proxy's upstream forwarding (which would otherwise loop). Helpers haven't dropped privileges yet are also covered — a small bounded window between exec and uid switch. * Proxy IP/port hard-coded to 10.200.0.1:3128 to match sandbox::linux::netns::{SUBNET_PREFIX, HOST_IP_SUFFIX}; helpers spawn before policy load so the runtime address isn't available yet. A v1 refactor that moves spawn after policy load would replace these constants with the live values. * Degrades gracefully when iptables is missing (ConfigStateChange OCSF event with severity=Medium, helpers still spawn). Verified end-to-end: - iptables -L OUTPUT in supervisor netns shows the 6 rules with matching packet counters. - `gosu gateway curl --noproxy '*' https://<ip>/` from inside the pod returns "Connection refused" (catch-all REJECT) where it previously connected directly. - Proxy-aware paths (sclaw → httpx, gateway → HTTPS_PROXY) continue to flow through the proxy and get attributed by OPA as before. Signed-off-by: Serge Panev <[email protected]>

parse_proc_net_tcp only searches /proc/<entrypoint_pid>/net/tcp to identify the peer of a proxy CONNECT. Supervisor helpers (declared via --helpers-config, spawned pre-workload by helpers::spawn_helpers) share the supervisor's netns, not the workload's, so their connections never appear in the workload's procfs view. The policy proxy logs "failed to resolve peer binary: No ESTABLISHED TCP connection found" and denies every allowlisted host, breaking helpers that legitimately need to reach external services (capability brokers, inference gateways running alongside the sandboxed agent). Add a fallback: try the workload's netns first, then fall back to pid 1's (the supervisor's). Workload-originated connections always win the race because the workload netns is checked first, and downstream identity checks (binary hash TOFU + ancestor walk) apply unchanged — a helper swapped for a different binary still trips integrity verification the same way a workload peer would. This closes the peer-attribution gap that kept sclaw's mediator- spawned openclaw gateway helper from reaching api.telegram.org through the policy proxy. Signed-off-by: Serge Panev <[email protected]> Signed-off-by: Serge Panev <[email protected]>

This was referenced Apr 24, 2026

fix(sandbox): fall back to /proc/net/tcp for peer identity resolution… #684

Closed

[Bug] L7 egress proxy denies all CONNECT requests on Docker Desktop + WSL2 (amd64) #681

Open

Kh4L marked this pull request as ready for review April 24, 2026 03:28

Kh4L requested a review from a team as a code owner April 24, 2026 03:28

Kh4L added 2 commits April 25, 2026 19:41

Kh4L force-pushed the fix/helper-netns-peer-attribution branch from 0a139ca to 5b9e87e Compare April 26, 2026 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sandbox): attribute supervisor-helper peers via pid-1 netns fallback#956

fix(sandbox): attribute supervisor-helper peers via pid-1 netns fallback#956
Kh4L wants to merge 3 commits intoNVIDIA:mainfrom
Kh4L:fix/helper-netns-peer-attribution

Kh4L commented Apr 24, 2026

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kh4L commented Apr 24, 2026

Summary

Related Issue

Why pid 1

Changes

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant