# Session Persistence (resurrect + resume) — Design **Date:** 2026-06-15 **Status:** Approved, ready for implementation plan ## Goal Let a workspace survive both GUI loss and full power loss. Closing a tab or the whole GUI already keeps agents running (daemon owns PTYs, reattach via live grid snapshot — M1). This design adds the missing half: after the daemon itself dies (reboot, battery death, `kill -9`), the user can bring panels back — panels show their last on-screen state and offer a one-click **Resume** that restarts the agent with its session-continue flag (e.g. `claude --continue`). ## Scope decisions (locked) - **Reboot behavior:** resurrect + resume. A live process cannot survive a power-off — not even tmux does that. After a daemon restart we respawn the panel from its persisted spec (cwd intact) and, for agents that support it, relaunch with a resume flag so the conversation continues in a *new* process. - **Resurrect trigger:** manual, per-panel. After a daemon restart panels are shown stopped with their last screen; nothing spawns until the user clicks. This avoids surprise token burn from auto-launching many agents. - **Persisted scrollback:** visible screen only. We reuse the existing `snapshot_ansi()` serializer (the same one that powers live reattach) and write its output to disk. No scrollback history beyond the visible grid. - **Resume command source:** a `[resume]` table in `~/.spacesh/config.toml` mapping a command basename to resume args, merged over built-in defaults. - **Snapshot cadence:** periodic + shutdown. A background task dumps changed grids every N seconds (default 5), plus a full pass on graceful shutdown and a final dump when an actor exits. This survives `kill -9` / battery (you lose at most N seconds of the last screen). ## What already exists (do not rebuild) - `state.json` (via `JsonStateStore` + debounced `Persister`) already persists structure: groups, workspaces, layout, zoom, pinned, and per-surface `SurfaceSpec` (`command`, `args`, `cwd`, `cols`, `rows`, `agent_label`, `autostart`). On cold start `Registry::restore()` loads this; the `live` actor map is empty, so every surface is "stopped" (spec present, no process). - `SurfaceView.running: bool` already tells the client a surface is stopped. - `spacesh_core::snapshot::snapshot_ansi(&GridSurface) -> Snapshot` serializes the visible grid to an ANSI dump (`ansi`, `cols`, `rows`, `cursor_row`, `cursor_col`). `Snapshot` currently derives `Serialize` only. - The surface actor already answers `SurfaceMsg::AttachSnapshot` by calling `snapshot_ansi(&grid)`; the grid is the authoritative screen model. ## Components ### 1. Snapshot store — `crates/spaceshd/src/snapshot_store.rs` (new) Per-surface JSON file `~/.spacesh/snapshots/.json` holding the serialized visible-screen snapshot. Atomic write (temp file → `sync_all` → rename), mirroring `state_store::JsonStateStore`. ```rust pub trait SnapshotStore: Send + Sync { fn save(&self, sid: &SurfaceId, snap: &Snapshot) -> anyhow::Result<()>; fn load(&self, sid: &SurfaceId) -> Option; fn remove(&self, sid: &SurfaceId); } ``` The store persists the core `spacesh_core::snapshot::Snapshot` directly (`ansi`, `cols`, `rows`, `cursor_row`, `cursor_col`) — `spaceshd` already depends on `spacesh-core`, so no separate daemon record type is introduced. A corrupt/missing file yields `None` (never an error that blocks resurrect). `remove` deletes the file and is called when a surface is closed or removed from the tree. ### 2. On-demand snapshot from the actor — `crates/spaceshd/src/surface.rs` Add a message that returns the current snapshot without subscribing: ```rust SurfaceMsg::Snapshot { reply: oneshot::Sender<(Snapshot, bool)> } // (snapshot, dirty) ``` The actor tracks a `dirty` flag: set inside `flush` whenever bytes are fed into the grid (`grid.feed`), cleared when a `Snapshot` reply is produced. The bool lets the periodic dumper skip unchanged grids. On actor exit (after `pty.wait()`), the actor takes a final `snapshot_ansi` and forwards `(id, snapshot)` to the writer channel (a cloned `mpsc::UnboundedSender<(SurfaceId, Snapshot)>` passed into the actor), so the last screen of a finished process is persisted even between ticker ticks. ### 3. Writer task + periodic ticker — `crates/spaceshd/src/server.rs` / `main.rs` - **Writer task:** the sole owner of `Arc`. Receives `(SurfaceId, Snapshot)` on an unbounded channel and writes to disk. Keeps all snapshot disk I/O off the actor/PTY hot path and serializes writes. - **Periodic ticker:** every `snapshot_interval_secs` (config, default 5) the router iterates live surface handles, sends `SurfaceMsg::Snapshot`, awaits the reply, and forwards to the writer channel only when `dirty` is true. - **Graceful shutdown:** before the daemon exits it does one final synchronous pass over all live surfaces into the writer, then flushes the writer. ### 4. Resume config — `crates/spaceshd/src/config.rs` ```toml [resume] commands = { claude = ["--continue"], codex = ["resume"] } ``` ```rust #[derive(Debug, Clone, Default, Deserialize, Serialize)] pub struct ResumeConfig { #[serde(default)] pub commands: std::collections::HashMap>, } ``` Added to `Config` as `#[serde(default)] pub resume: ResumeConfig`. A method `resume_args(command: &str) -> Option>` resolves by command basename: user map first, then a built-in default table (`claude → ["--continue"]`, `codex → ["resume"]`), then `None`. The default table is a `const`/static, not inline literals in branching logic. ### 5. Protocol — `crates/spacesh-proto/src/message.rs` The codebase already has `Cmd::RestartSurface { surface_id }` (starts a stopped surface from its spec, guarded by `is_running`) and an `Attach` response that already carries `{ snapshot, cols, rows, cursor_row, cursor_col, stopped }`. So no new command or wire type is needed beyond one field: - Extend `Cmd::RestartSurface` with `#[serde(default)] resume: bool`. `resume = true` builds `command + resume_args(command)` (falling back to the original args when no mapping exists); `resume = false` keeps the original `command + args` (today's behavior). The `#[serde(default)]` keeps old frames decoding to `resume = false`. - No `GetSnapshot`, no `StartSurface`, no `SnapshotView`: a stopped-panel `Attach` returns the **disk** snapshot (see §6) using the existing response shape. `spacesh-core::snapshot::Snapshot` gains `Deserialize` (alongside `Serialize`) so the store can load it back from disk. ### 6. Server handlers — `crates/spaceshd/src/server.rs` - `RestartSurface { surface_id, resume }`: unchanged flow (spec lookup, `spawn_from_spec`, `set_live`, `SurfaceRestarted` broadcast). When `resume`, spawn with a spec whose `args` are replaced by `config.resume_args(command)` (when present); otherwise spawn the original spec. - `Attach` for a **stopped** surface: instead of returning the empty `{ snapshot: "", stopped: true }`, load the disk snapshot via the snapshot store and return `{ snapshot: , cols, rows, cursor_row, cursor_col, stopped: true }`. Missing file → empty snapshot, still `stopped: true`. - Surface close/remove (`Close`, `CloseWorkspace`, `remove_surface` paths): send a remove to the snapshot writer so stale `.json` files do not accumulate. ### 7. App — `app/src` and `app/src-tauri` - `socketBridge.ts`: `restartSurface(id, resume = false)` gains the `resume` arg; `AttachResult` gains optional `cursor_row`/`cursor_col`/`stopped`. - `app/src-tauri/src/bridge.rs`: `restart_surface` forwards a `resume: bool` arg into `Cmd::RestartSurface`. - `LayoutEngine.tsx` stopped branch (`running[id] === false`): paint the disk snapshot into a dimmed, read-only `xterm` behind the controls, and offer two buttons — **Resume** → `restartSurface(id, true)` and **Restart fresh** → `restartSurface(id, false)`. On success the daemon's `workspace_changed` flips `running` to true, the overlay unmounts, and the live `TerminalView` mounts. ## Data flow ``` running surface ──(every 5s, if dirty)──▶ ticker ──▶ writer task ──▶ .json running surface ──(on exit)─────────────────────────▶ writer task ──▶ .json daemon shutdown ──(final pass over live)────────────▶ writer task ──▶ .json reboot ▶ daemon cold start ▶ Registry::restore(state.json) ▶ all surfaces stopped client ▶ Attach(sid) [stopped] ▶ disk snapshot ▶ paint dimmed read-only screen user clicks Resume ▶ RestartSurface{resume:true} ▶ spawn(command + resume_args, cwd) ▶ SurfaceRestarted + running=true ▶ live TerminalView mounts ``` ## Error handling - Missing/corrupt snapshot file → stopped `Attach` returns an empty snapshot; the overlay shows an empty dimmed panel with the Resume/Restart controls. - `RestartSurface` on an already-running surface → no-op ok (existing `is_running` guard); unknown surface → `NOT_FOUND`. - Resume command for an agent without a mapping → falls back to the original spec args (plain restart), never fails the spawn. - Writer task failure to write one file is logged and dropped; it must not stall the daemon or other surfaces. ## Performance - A visible-screen snapshot is ≈ rows × cols bytes of ANSI; at a 5s cadence with the `dirty` debounce, idle panels write nothing. All disk writes happen in the single writer task, off the PTY/actor hot path, so the keypress→echo (<16 ms) and output-batching budgets are untouched. ## Testing - **snapshot_store:** save→load round-trip; atomic write; missing file → `None`; corrupt file → `None`; `remove` deletes the file. - **config:** parse `[resume]` table; `resume_args` returns user override, then built-in default, then `None`; missing section defaults cleanly. - **surface actor:** `SurfaceMsg::Snapshot` returns the current grid contents; `dirty` is true after output and false immediately after a snapshot. - **server:** `RestartSurface{resume:true}` spawns with `command + resume_args`; `{resume:false}` spawns with `command + args`; stopped `Attach` returns the saved disk snapshot; `is_running` guard prevents a second actor. - **registry:** starting a stopped surface re-populates the live map and the view flips `running` to true. ## Out of scope - Resuming the literal in-flight process across power loss (impossible). - Scrollback history beyond the visible screen. - Auto-resume on daemon start (manual trigger chosen). - Per-surface resume command stored in the spec/wizard (config map chosen).