fix(droid-control): kill full true-input process tree on close#30
Merged
Conversation
…171) launch_true_input saved meta with CAGE_PID="" before starting cage, and for --record launches start_true_input_recording's load_meta clobbered the in-memory CAGE_PID back to empty, so cmd_close skipped the compositor kill and leaked the whole cage/ghostty/script/CLI tree (9 stale compositors on one host). Persist CAGE_PID right after the wayland socket wait, before the recording branch can reload the meta. Launch cage under setsid so it leads its own process group, and kill the compositor before die on socket timeout. Teardown now enumerates descendants and group-kills with TERM->poll->KILL escalation: each half covers what the other misses (script(1) children escape the group into new sessions; init-reparented members escape enumeration). The recorder gets INT->poll->KILL so recordings finalize. A stray sweep on the session's run-*.sh argv closes pre-fix sessions whose metas have empty CAGE_PID, guarded by pid_is_self_or_ancestor so it never kills the caller. wf-recorder is required up front when recording so a missing binary fails before cage starts.
markattar-factory
approved these changes
Jul 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
tctl closeon a recorded true-input session killed only the wf-recorder and deleted the session state dir, leaving the entire cage/ghostty/script/CLI process tree running orphaned; one dev host had accumulated 9 stale compositors and 48 leaked temp workspace dirs. Root cause:launch_true_inputpersisted session meta withCAGE_PID=""before starting cage, and for--recordlaunches (the apps/cli true-input harness always records)start_true_input_recording'sload_metaclobbered the in-memory PID back to empty, socmd_closeskipped the compositor kill entirely. Restoring the PID alone would not fix teardown either:script(1)moves its children into new sessions, so even a correctkill $CAGE_PIDleft ghostty/script/CLI behind.This PR persists
CAGE_PIDimmediately after the wayland socket wait (before the recording branch can reload the meta), launches cage undersetsid, and replaces the bare kill with a full-stack teardown: descendant enumeration plus process-group kill with TERM->poll->KILL escalation, INT->poll->KILL for the recorder so recordings finalize, and a stray sweep on the session'srun-*.shargv that also closes pre-fix sessions whose metas carry an emptyCAGE_PID.Related Issue
Closes CLI-1171
Reviewer Guide
Read order:
plugins/droid-control/bin/tctlonly --launch_true_input(PID persistence, setsid, socket-timeout cleanup) >terminate_true_input_stack+collect_tree_pids>terminate_session_strays+pid_is_self_or_ancestor>cmd_close.Review depth: Standard -- signal-sending code with a self-kill guard deserves a careful read, but it is one file with no external contract.
Open for pushback: the fixed grace budgets (2s tree, 3s recorder, 1s per stray) mean
closecan block a few seconds worst case; chosen over a tunable to keep the CLI surface unchanged.Risk & Impact
The new paths send signals, so mis-scoping would kill unrelated processes. Containment: the group kill targets only cage's own setsid group; enumeration walks descendants of the recorded cage PID; the stray sweep matches only the session's
run-*.shrunner argv (not the bare session dir path, which an inspecting shell or editor can carry) and skips self/ancestors, so acloseinvoked from inside a session cannot kill its caller.closenow blocks up to the escalation deadlines instead of returning instantly.Verification
Behavior verified. verified @ adff9fa on a host with cage/ghostty/wtype/grim/wf-recorder: recorded launch -> type -> close leaves zero processes referencing the session, session and runtime dirs removed, mp4 finalized (ffprobe duration 1.83s); a session whose meta was hand-edited to the pre-fix
CAGE_PID=""shape (5 live processes) sweeps clean on close; tuistory launch/snapshot/close unaffected. Standalone record start/stop and the self/ancestor guard (a close whose caller argv contains the runner path survives) were verified on the same host during fix development.Regression coverage. None automated: the repo has no shell test harness (CI's
check-skills.ymlvalidates skill metadata only). The manual matrix above plus the Repro Recipe stand in.Not tested. The KILL-escalation branch under a process that traps and ignores TERM -- genuinely contrived to set up; the branch only runs when the grace poll expires.
Standard validators.
bash -nclean; shellcheck not installed on the verification host; no formatter/linter configured in this repo.Repro Recipe
On master, a recorded session leaks its whole tree; on this branch the same steps leave nothing: