From 3d54d679d340e8595aef102f106c49d0be96caec Mon Sep 17 00:00:00 2001 From: Vassiliy Yegorov Date: Mon, 15 Jun 2026 15:05:21 +0700 Subject: [PATCH] docs: session persistence (resurrect + resume) design spec Co-Authored-By: Claude Opus 4.8 (1M context) --- .../2026-06-15-session-persistence-design.md | 216 ++++++++++++++++++ 1 file changed, 216 insertions(+) create mode 100644 DOCS/superpowers/specs/2026-06-15-session-persistence-design.md diff --git a/DOCS/superpowers/specs/2026-06-15-session-persistence-design.md b/DOCS/superpowers/specs/2026-06-15-session-persistence-design.md new file mode 100644 index 0000000..bfa2413 --- /dev/null +++ b/DOCS/superpowers/specs/2026-06-15-session-persistence-design.md @@ -0,0 +1,216 @@ +# Session Persistence (resurrect + resume) — Design + +**Date:** 2026-06-15 +**Status:** Approved, ready for implementation plan + +## Goal + +Let a workspace survive both GUI loss and full power loss. Closing a tab or +the whole GUI already keeps agents running (daemon owns PTYs, reattach via live +grid snapshot — M1). This design adds the missing half: after the daemon itself +dies (reboot, battery death, `kill -9`), the user can bring panels back — +panels show their last on-screen state and offer a one-click **Resume** that +restarts the agent with its session-continue flag (e.g. `claude --continue`). + +## Scope decisions (locked) + +- **Reboot behavior:** resurrect + resume. A live process cannot survive a + power-off — not even tmux does that. After a daemon restart we respawn the + panel from its persisted spec (cwd intact) and, for agents that support it, + relaunch with a resume flag so the conversation continues in a *new* process. +- **Resurrect trigger:** manual, per-panel. After a daemon restart panels are + shown stopped with their last screen; nothing spawns until the user clicks. + This avoids surprise token burn from auto-launching many agents. +- **Persisted scrollback:** visible screen only. We reuse the existing + `snapshot_ansi()` serializer (the same one that powers live reattach) and + write its output to disk. No scrollback history beyond the visible grid. +- **Resume command source:** a `[resume]` table in `~/.spacesh/config.toml` + mapping a command basename to resume args, merged over built-in defaults. +- **Snapshot cadence:** periodic + shutdown. A background task dumps changed + grids every N seconds (default 5), plus a full pass on graceful shutdown and + a final dump when an actor exits. This survives `kill -9` / battery (you lose + at most N seconds of the last screen). + +## What already exists (do not rebuild) + +- `state.json` (via `JsonStateStore` + debounced `Persister`) already persists + structure: groups, workspaces, layout, zoom, pinned, and per-surface + `SurfaceSpec` (`command`, `args`, `cwd`, `cols`, `rows`, `agent_label`, + `autostart`). On cold start `Registry::restore()` loads this; the `live` + actor map is empty, so every surface is "stopped" (spec present, no process). +- `SurfaceView.running: bool` already tells the client a surface is stopped. +- `spacesh_core::snapshot::snapshot_ansi(&GridSurface) -> Snapshot` serializes + the visible grid to an ANSI dump (`ansi`, `cols`, `rows`, `cursor_row`, + `cursor_col`). `Snapshot` currently derives `Serialize` only. +- The surface actor already answers `SurfaceMsg::AttachSnapshot` by calling + `snapshot_ansi(&grid)`; the grid is the authoritative screen model. + +## Components + +### 1. Snapshot store — `crates/spaceshd/src/snapshot_store.rs` (new) + +Per-surface JSON file `~/.spacesh/snapshots/.json` holding the +serialized visible-screen snapshot. Atomic write (temp file → `sync_all` → +rename), mirroring `state_store::JsonStateStore`. + +```rust +pub trait SnapshotStore: Send + Sync { + fn save(&self, sid: &SurfaceId, snap: &Snapshot) -> anyhow::Result<()>; + fn load(&self, sid: &SurfaceId) -> Option; + fn remove(&self, sid: &SurfaceId); +} +``` + +The store persists the core `spacesh_core::snapshot::Snapshot` directly +(`ansi`, `cols`, `rows`, `cursor_row`, `cursor_col`) — `spaceshd` already +depends on `spacesh-core`, so no separate daemon record type is introduced. A +corrupt/missing file yields `None` (never an error that blocks resurrect). +`remove` deletes the file and is called when a surface is closed or removed +from the tree. + +### 2. On-demand snapshot from the actor — `crates/spaceshd/src/surface.rs` + +Add a message that returns the current snapshot without subscribing: + +```rust +SurfaceMsg::Snapshot { reply: oneshot::Sender<(Snapshot, bool)> } // (snapshot, dirty) +``` + +The actor tracks a `dirty` flag: set inside `flush` whenever bytes are fed into +the grid (`grid.feed`), cleared when a `Snapshot` reply is produced. The bool +lets the periodic dumper skip unchanged grids. + +On actor exit (after `pty.wait()`), the actor takes a final `snapshot_ansi` +and forwards `(id, snapshot)` to the writer channel (a cloned +`mpsc::UnboundedSender<(SurfaceId, Snapshot)>` passed into the actor), so the +last screen of a finished process is persisted even between ticker ticks. + +### 3. Writer task + periodic ticker — `crates/spaceshd/src/server.rs` / `main.rs` + +- **Writer task:** the sole owner of `Arc`. Receives + `(SurfaceId, Snapshot)` on an unbounded channel and writes to disk. Keeps all + snapshot disk I/O off the actor/PTY hot path and serializes writes. +- **Periodic ticker:** every `snapshot_interval_secs` (config, default 5) the + router iterates live surface handles, sends `SurfaceMsg::Snapshot`, awaits the + reply, and forwards to the writer channel only when `dirty` is true. +- **Graceful shutdown:** before the daemon exits it does one final synchronous + pass over all live surfaces into the writer, then flushes the writer. + +### 4. Resume config — `crates/spaceshd/src/config.rs` + +```toml +[resume] +commands = { claude = ["--continue"], codex = ["resume"] } +``` + +```rust +#[derive(Debug, Clone, Default, Deserialize, Serialize)] +pub struct ResumeConfig { + #[serde(default)] + pub commands: std::collections::HashMap>, +} +``` + +Added to `Config` as `#[serde(default)] pub resume: ResumeConfig`. A method +`resume_args(command: &str) -> Option>` resolves by command +basename: user map first, then a built-in default table +(`claude → ["--continue"]`, `codex → ["resume"]`), then `None`. The default +table is a `const`/static, not inline literals in branching logic. + +### 5. Protocol — `crates/spacesh-proto/src/message.rs` + +- `Cmd::StartSurface { surface_id: SurfaceId, resume: bool }` — start a stopped + surface. `resume = true` builds `command + resume_args(command)` (falling + back to the original args when no resume mapping exists); `resume = false` + builds the original `command + args`. cwd and geometry come from the spec. +- `Cmd::GetSnapshot { surface_id: SurfaceId }` → response carries + `Option`. +- `SnapshotView { ansi, cols, rows, cursor_row, cursor_col }` — a proto-level + mirror of core `Snapshot`, so `spacesh-proto` does not depend on + `spacesh-core`. The daemon converts core `Snapshot` into `SnapshotView` at + the protocol boundary. + +`spacesh-core::snapshot::Snapshot` gains `Deserialize` (alongside `Serialize`) +so it can be loaded back from disk into `SnapshotRecord` conversions in tests +and the store. + +### 6. Server handlers — `crates/spaceshd/src/server.rs` + +- `StartSurface`: look up the spec; if missing → error response. Build a + `SpawnSpec` with resume-or-plain args, the spec's cwd, and current geometry; + `spawn_surface_deferred(...)`; `registry.set_live(handle)`; broadcast + `workspace_changed` so all clients flip `running` to true. +- `GetSnapshot`: read from the snapshot store, convert to `SnapshotView`, + return `Option`. +- On surface close/remove: call `snapshot_store.remove(sid)` (via the writer or + a direct store handle) so stale files do not accumulate. + +### 7. App — `app/src` and `app/src-tauri` + +- `socketBridge.ts`: `startSurface(id, resume)`, `getSnapshot(id)`, and a + `SnapshotView` type. +- `app/src-tauri/src/bridge.rs`: `start_surface` and `get_snapshot` invoke + handlers forwarding to the daemon, wired into the Tauri `invoke_handler` and + the JS bridge. +- `LayoutEngine.tsx` / `TerminalView.tsx`: when a surface's `running === false`, + render a stopped overlay instead of a live terminal: + - fetch `getSnapshot(id)` and paint the ANSI into a read-only, dimmed + `xterm` instance for visual context; + - centered controls: **Resume** → `startSurface(id, true)`, + **Restart fresh** → `startSurface(id, false)`; + - on success the daemon's `workspace_changed` sets `running = true`, the + overlay unmounts, and the normal live `TerminalView` mounts. + - a small "stopped" indicator in the panel header. + +## Data flow + +``` +running surface ──(every 5s, if dirty)──▶ ticker ──▶ writer task ──▶ .json +running surface ──(on exit)─────────────────────────▶ writer task ──▶ .json +daemon shutdown ──(final pass over live)────────────▶ writer task ──▶ .json + +reboot ▶ daemon cold start ▶ Registry::restore(state.json) ▶ all surfaces stopped +client ▶ GetSnapshot(sid) ▶ paint dimmed read-only screen + Resume/Restart +user clicks Resume ▶ StartSurface{resume:true} ▶ spawn(command + resume_args, cwd) + ▶ workspace_changed(running=true) ▶ live TerminalView mounts +``` + +## Error handling + +- Missing/corrupt snapshot file → `GetSnapshot` returns `None`; the overlay + shows an empty dimmed panel with the Resume/Restart controls (still usable). +- `StartSurface` on an unknown/already-running surface → error response; client + ignores or surfaces a toast. No duplicate actor: guard on + `registry.is_running(sid)`. +- Resume command for an agent without a mapping → falls back to the original + spec args (plain restart), never fails the spawn. +- Writer task failure to write one file is logged and dropped; it must not stall + the daemon or other surfaces. + +## Performance + +- A visible-screen snapshot is ≈ rows × cols bytes of ANSI; at a 5s cadence with + the `dirty` debounce, idle panels write nothing. All disk writes happen in the + single writer task, off the PTY/actor hot path, so the keypress→echo (<16 ms) + and output-batching budgets are untouched. + +## Testing + +- **snapshot_store:** save→load round-trip; atomic write; missing file → `None`; + corrupt file → `None`; `remove` deletes the file. +- **config:** parse `[resume]` table; `resume_args` returns user override, then + built-in default, then `None`; missing section defaults cleanly. +- **surface actor:** `SurfaceMsg::Snapshot` returns the current grid contents; + `dirty` is true after output and false immediately after a snapshot. +- **server:** `StartSurface{resume:true}` builds `command + resume_args`; + `{resume:false}` builds `command + args`; `GetSnapshot` returns the saved + view; `is_running` guard prevents a second actor. +- **registry:** starting a stopped surface re-populates the live map and the + view flips `running` to true. + +## Out of scope + +- Resuming the literal in-flight process across power loss (impossible). +- Scrollback history beyond the visible screen. +- Auto-resume on daemon start (manual trigger chosen). +- Per-surface resume command stored in the spec/wizard (config map chosen).