Files
spaceshell/DOCS/superpowers/specs/2026-06-15-session-persistence-design.md
2026-06-15 15:20:02 +07:00

11 KiB
Raw Permalink Blame History

Session Persistence (resurrect + resume) — Design

Date: 2026-06-15 Status: Approved, ready for implementation plan

Goal

Let a workspace survive both GUI loss and full power loss. Closing a tab or the whole GUI already keeps agents running (daemon owns PTYs, reattach via live grid snapshot — M1). This design adds the missing half: after the daemon itself dies (reboot, battery death, kill -9), the user can bring panels back — panels show their last on-screen state and offer a one-click Resume that restarts the agent with its session-continue flag (e.g. claude --continue).

Scope decisions (locked)

  • Reboot behavior: resurrect + resume. A live process cannot survive a power-off — not even tmux does that. After a daemon restart we respawn the panel from its persisted spec (cwd intact) and, for agents that support it, relaunch with a resume flag so the conversation continues in a new process.
  • Resurrect trigger: manual, per-panel. After a daemon restart panels are shown stopped with their last screen; nothing spawns until the user clicks. This avoids surprise token burn from auto-launching many agents.
  • Persisted scrollback: visible screen only. We reuse the existing snapshot_ansi() serializer (the same one that powers live reattach) and write its output to disk. No scrollback history beyond the visible grid.
  • Resume command source: a [resume] table in ~/.spacesh/config.toml mapping a command basename to resume args, merged over built-in defaults.
  • Snapshot cadence: periodic + shutdown. A background task dumps changed grids every N seconds (default 5), plus a full pass on graceful shutdown and a final dump when an actor exits. This survives kill -9 / battery (you lose at most N seconds of the last screen).

What already exists (do not rebuild)

  • state.json (via JsonStateStore + debounced Persister) already persists structure: groups, workspaces, layout, zoom, pinned, and per-surface SurfaceSpec (command, args, cwd, cols, rows, agent_label, autostart). On cold start Registry::restore() loads this; the live actor map is empty, so every surface is "stopped" (spec present, no process).
  • SurfaceView.running: bool already tells the client a surface is stopped.
  • spacesh_core::snapshot::snapshot_ansi(&GridSurface) -> Snapshot serializes the visible grid to an ANSI dump (ansi, cols, rows, cursor_row, cursor_col). Snapshot currently derives Serialize only.
  • The surface actor already answers SurfaceMsg::AttachSnapshot by calling snapshot_ansi(&grid); the grid is the authoritative screen model.

Components

1. Snapshot store — crates/spaceshd/src/snapshot_store.rs (new)

Per-surface JSON file ~/.spacesh/snapshots/<surface_id>.json holding the serialized visible-screen snapshot. Atomic write (temp file → sync_all → rename), mirroring state_store::JsonStateStore.

pub trait SnapshotStore: Send + Sync {
    fn save(&self, sid: &SurfaceId, snap: &Snapshot) -> anyhow::Result<()>;
    fn load(&self, sid: &SurfaceId) -> Option<Snapshot>;
    fn remove(&self, sid: &SurfaceId);
}

The store persists the core spacesh_core::snapshot::Snapshot directly (ansi, cols, rows, cursor_row, cursor_col) — spaceshd already depends on spacesh-core, so no separate daemon record type is introduced. A corrupt/missing file yields None (never an error that blocks resurrect). remove deletes the file and is called when a surface is closed or removed from the tree.

2. On-demand snapshot from the actor — crates/spaceshd/src/surface.rs

Add a message that returns the current snapshot without subscribing:

SurfaceMsg::Snapshot { reply: oneshot::Sender<(Snapshot, bool)> } // (snapshot, dirty)

The actor tracks a dirty flag: set inside flush whenever bytes are fed into the grid (grid.feed), cleared when a Snapshot reply is produced. The bool lets the periodic dumper skip unchanged grids.

On actor exit (after pty.wait()), the actor takes a final snapshot_ansi and forwards (id, snapshot) to the writer channel (a cloned mpsc::UnboundedSender<(SurfaceId, Snapshot)> passed into the actor), so the last screen of a finished process is persisted even between ticker ticks.

3. Writer task + periodic ticker — crates/spaceshd/src/server.rs / main.rs

  • Writer task: the sole owner of Arc<dyn SnapshotStore>. Receives (SurfaceId, Snapshot) on an unbounded channel and writes to disk. Keeps all snapshot disk I/O off the actor/PTY hot path and serializes writes.
  • Periodic ticker: every snapshot_interval_secs (config, default 5) the router iterates live surface handles, sends SurfaceMsg::Snapshot, awaits the reply, and forwards to the writer channel only when dirty is true.
  • Graceful shutdown: before the daemon exits it does one final synchronous pass over all live surfaces into the writer, then flushes the writer.

4. Resume config — crates/spaceshd/src/config.rs

[resume]
commands = { claude = ["--continue"], codex = ["resume"] }
#[derive(Debug, Clone, Default, Deserialize, Serialize)]
pub struct ResumeConfig {
    #[serde(default)]
    pub commands: std::collections::HashMap<String, Vec<String>>,
}

Added to Config as #[serde(default)] pub resume: ResumeConfig. A method resume_args(command: &str) -> Option<Vec<String>> resolves by command basename: user map first, then a built-in default table (claude → ["--continue"], codex → ["resume"]), then None. The default table is a const/static, not inline literals in branching logic.

5. Protocol — crates/spacesh-proto/src/message.rs

The codebase already has Cmd::RestartSurface { surface_id } (starts a stopped surface from its spec, guarded by is_running) and an Attach response that already carries { snapshot, cols, rows, cursor_row, cursor_col, stopped }. So no new command or wire type is needed beyond one field:

  • Extend Cmd::RestartSurface with #[serde(default)] resume: bool. resume = true builds command + resume_args(command) (falling back to the original args when no mapping exists); resume = false keeps the original command + args (today's behavior). The #[serde(default)] keeps old frames decoding to resume = false.
  • No GetSnapshot, no StartSurface, no SnapshotView: a stopped-panel Attach returns the disk snapshot (see §6) using the existing response shape.

spacesh-core::snapshot::Snapshot gains Deserialize (alongside Serialize) so the store can load it back from disk.

6. Server handlers — crates/spaceshd/src/server.rs

  • RestartSurface { surface_id, resume }: unchanged flow (spec lookup, spawn_from_spec, set_live, SurfaceRestarted broadcast). When resume, spawn with a spec whose args are replaced by config.resume_args(command) (when present); otherwise spawn the original spec.
  • Attach for a stopped surface: instead of returning the empty { snapshot: "", stopped: true }, load the disk snapshot via the snapshot store and return { snapshot: <ansi>, cols, rows, cursor_row, cursor_col, stopped: true }. Missing file → empty snapshot, still stopped: true.
  • Surface close/remove (Close, CloseWorkspace, remove_surface paths): send a remove to the snapshot writer so stale <sid>.json files do not accumulate.

7. App — app/src and app/src-tauri

  • socketBridge.ts: restartSurface(id, resume = false) gains the resume arg; AttachResult gains optional cursor_row/cursor_col/stopped.
  • app/src-tauri/src/bridge.rs: restart_surface forwards a resume: bool arg into Cmd::RestartSurface.
  • LayoutEngine.tsx stopped branch (running[id] === false): paint the disk snapshot into a dimmed, read-only xterm behind the controls, and offer two buttons — ResumerestartSurface(id, true) and Restart freshrestartSurface(id, false). On success the daemon's workspace_changed flips running to true, the overlay unmounts, and the live TerminalView mounts.

Data flow

running surface ──(every 5s, if dirty)──▶ ticker ──▶ writer task ──▶ <sid>.json
running surface ──(on exit)─────────────────────────▶ writer task ──▶ <sid>.json
daemon shutdown ──(final pass over live)────────────▶ writer task ──▶ <sid>.json

reboot ▶ daemon cold start ▶ Registry::restore(state.json) ▶ all surfaces stopped
client ▶ Attach(sid) [stopped] ▶ disk snapshot ▶ paint dimmed read-only screen
user clicks Resume ▶ RestartSurface{resume:true} ▶ spawn(command + resume_args, cwd)
                  ▶ SurfaceRestarted + running=true ▶ live TerminalView mounts

Error handling

  • Missing/corrupt snapshot file → stopped Attach returns an empty snapshot; the overlay shows an empty dimmed panel with the Resume/Restart controls.
  • RestartSurface on an already-running surface → no-op ok (existing is_running guard); unknown surface → NOT_FOUND.
  • Resume command for an agent without a mapping → falls back to the original spec args (plain restart), never fails the spawn.
  • Writer task failure to write one file is logged and dropped; it must not stall the daemon or other surfaces.

Performance

  • A visible-screen snapshot is ≈ rows × cols bytes of ANSI; at a 5s cadence with the dirty debounce, idle panels write nothing. All disk writes happen in the single writer task, off the PTY/actor hot path, so the keypress→echo (<16 ms) and output-batching budgets are untouched.

Testing

  • snapshot_store: save→load round-trip; atomic write; missing file → None; corrupt file → None; remove deletes the file.
  • config: parse [resume] table; resume_args returns user override, then built-in default, then None; missing section defaults cleanly.
  • surface actor: SurfaceMsg::Snapshot returns the current grid contents; dirty is true after output and false immediately after a snapshot.
  • server: RestartSurface{resume:true} spawns with command + resume_args; {resume:false} spawns with command + args; stopped Attach returns the saved disk snapshot; is_running guard prevents a second actor.
  • registry: starting a stopped surface re-populates the live map and the view flips running to true.

Out of scope

  • Resuming the literal in-flight process across power loss (impossible).
  • Scrollback history beyond the visible screen.
  • Auto-resume on daemon start (manual trigger chosen).
  • Per-surface resume command stored in the spec/wizard (config map chosen).