Skip to content

fix(droid-control): kill full true-input process tree on close#30

Merged
factory-ain3sh merged 1 commit into
masterfrom
ainesh/cli-1171-tctl-close-leak
Jul 2, 2026
Merged

fix(droid-control): kill full true-input process tree on close#30
factory-ain3sh merged 1 commit into
masterfrom
ainesh/cli-1171-tctl-close-leak

Conversation

@factory-ain3sh

Copy link
Copy Markdown
Contributor

Description

tctl close on a recorded true-input session killed only the wf-recorder and deleted the session state dir, leaving the entire cage/ghostty/script/CLI process tree running orphaned; one dev host had accumulated 9 stale compositors and 48 leaked temp workspace dirs. Root cause: launch_true_input persisted session meta with CAGE_PID="" before starting cage, and for --record launches (the apps/cli true-input harness always records) start_true_input_recording's load_meta clobbered the in-memory PID back to empty, so cmd_close skipped the compositor kill entirely. Restoring the PID alone would not fix teardown either: script(1) moves its children into new sessions, so even a correct kill $CAGE_PID left ghostty/script/CLI behind.

This PR persists CAGE_PID immediately after the wayland socket wait (before the recording branch can reload the meta), launches cage under setsid, and replaces the bare kill with a full-stack teardown: descendant enumeration plus process-group kill with TERM->poll->KILL escalation, INT->poll->KILL for the recorder so recordings finalize, and a stray sweep on the session's run-*.sh argv that also closes pre-fix sessions whose metas carry an empty CAGE_PID.

Related Issue

Closes CLI-1171

Reviewer Guide

Read order: plugins/droid-control/bin/tctl only -- launch_true_input (PID persistence, setsid, socket-timeout cleanup) > terminate_true_input_stack + collect_tree_pids > terminate_session_strays + pid_is_self_or_ancestor > cmd_close.
Review depth: Standard -- signal-sending code with a self-kill guard deserves a careful read, but it is one file with no external contract.
Open for pushback: the fixed grace budgets (2s tree, 3s recorder, 1s per stray) mean close can block a few seconds worst case; chosen over a tunable to keep the CLI surface unchanged.

Risk & Impact

The new paths send signals, so mis-scoping would kill unrelated processes. Containment: the group kill targets only cage's own setsid group; enumeration walks descendants of the recorded cage PID; the stray sweep matches only the session's run-*.sh runner argv (not the bare session dir path, which an inspecting shell or editor can carry) and skips self/ancestors, so a close invoked from inside a session cannot kill its caller. close now blocks up to the escalation deadlines instead of returning instantly.

Verification

Behavior verified. verified @ adff9fa on a host with cage/ghostty/wtype/grim/wf-recorder: recorded launch -> type -> close leaves zero processes referencing the session, session and runtime dirs removed, mp4 finalized (ffprobe duration 1.83s); a session whose meta was hand-edited to the pre-fix CAGE_PID="" shape (5 live processes) sweeps clean on close; tuistory launch/snapshot/close unaffected. Standalone record start/stop and the self/ancestor guard (a close whose caller argv contains the runner path survives) were verified on the same host during fix development.
Regression coverage. None automated: the repo has no shell test harness (CI's check-skills.yml validates skill metadata only). The manual matrix above plus the Repro Recipe stand in.
Not tested. The KILL-escalation branch under a process that traps and ignores TERM -- genuinely contrived to set up; the branch only runs when the grace poll expires.
Standard validators. bash -n clean; shellcheck not installed on the verification host; no formatter/linter configured in this repo.

Repro Recipe

On master, a recorded session leaks its whole tree; on this branch the same steps leave nothing:

tctl launch bash --backend true-input -s leak-test --record /tmp/leak-test.mp4
tctl -s leak-test close
pgrep -fa 'tctl-sessions/leak-test'   # master: cage/ghostty/script/bash survive; this PR: empty
ffprobe -v error -show_entries format=duration -of csv=p=0 /tmp/leak-test.mp4  # this PR: valid duration (recorder got INT, container finalized)

…171)

launch_true_input saved meta with CAGE_PID="" before starting cage, and for
--record launches start_true_input_recording's load_meta clobbered the
in-memory CAGE_PID back to empty, so cmd_close skipped the compositor kill and
leaked the whole cage/ghostty/script/CLI tree (9 stale compositors on one host).

Persist CAGE_PID right after the wayland socket wait, before the recording
branch can reload the meta. Launch cage under setsid so it leads its own
process group, and kill the compositor before die on socket timeout. Teardown
now enumerates descendants and group-kills with TERM->poll->KILL escalation:
each half covers what the other misses (script(1) children escape the group
into new sessions; init-reparented members escape enumeration). The recorder
gets INT->poll->KILL so recordings finalize. A stray sweep on the session's
run-*.sh argv closes pre-fix sessions whose metas have empty CAGE_PID, guarded
by pid_is_self_or_ancestor so it never kills the caller. wf-recorder is
required up front when recording so a missing binary fails before cage starts.
@factory-ain3sh factory-ain3sh self-assigned this Jul 2, 2026
@factory-ain3sh factory-ain3sh added the bug Something isn't working label Jul 2, 2026
@factory-ain3sh factory-ain3sh merged commit e8801fa into master Jul 2, 2026
1 check passed
@factory-ain3sh factory-ain3sh deleted the ainesh/cli-1171-tctl-close-leak branch July 2, 2026 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants