Files

T

docs: sync session-persistence spec to leaner RestartSurface-based design

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-15 15:20:02 +07:00

11 KiB

Raw Blame History

Session Persistence (resurrect + resume) — Design

Date: 2026-06-15 Status: Approved, ready for implementation plan

Goal

Let a workspace survive both GUI loss and full power loss. Closing a tab or the whole GUI already keeps agents running (daemon owns PTYs, reattach via live grid snapshot — M1). This design adds the missing half: after the daemon itself dies (reboot, battery death, kill -9), the user can bring panels back — panels show their last on-screen state and offer a one-click Resume that restarts the agent with its session-continue flag (e.g. claude --continue).

Scope decisions (locked)

Reboot behavior: resurrect + resume. A live process cannot survive a power-off — not even tmux does that. After a daemon restart we respawn the panel from its persisted spec (cwd intact) and, for agents that support it, relaunch with a resume flag so the conversation continues in a new process.
Resurrect trigger: manual, per-panel. After a daemon restart panels are shown stopped with their last screen; nothing spawns until the user clicks. This avoids surprise token burn from auto-launching many agents.
Persisted scrollback: visible screen only. We reuse the existing snapshot_ansi() serializer (the same one that powers live reattach) and write its output to disk. No scrollback history beyond the visible grid.
Resume command source: a [resume] table in ~/.spacesh/config.toml mapping a command basename to resume args, merged over built-in defaults.
Snapshot cadence: periodic + shutdown. A background task dumps changed grids every N seconds (default 5), plus a full pass on graceful shutdown and a final dump when an actor exits. This survives kill -9 / battery (you lose at most N seconds of the last screen).

What already exists (do not rebuild)

state.json (via JsonStateStore + debounced Persister) already persists structure: groups, workspaces, layout, zoom, pinned, and per-surface SurfaceSpec (command, args, cwd, cols, rows, agent_label, autostart). On cold start Registry::restore() loads this; the live actor map is empty, so every surface is "stopped" (spec present, no process).
SurfaceView.running: bool already tells the client a surface is stopped.
spacesh_core::snapshot::snapshot_ansi(&GridSurface) -> Snapshot serializes the visible grid to an ANSI dump (ansi, cols, rows, cursor_row, cursor_col). Snapshot currently derives Serialize only.
The surface actor already answers SurfaceMsg::AttachSnapshot by calling snapshot_ansi(&grid); the grid is the authoritative screen model.

Components

1. Snapshot store — `crates/spaceshd/src/snapshot_store.rs` (new)

Per-surface JSON file ~/.spacesh/snapshots/<surface_id>.json holding the serialized visible-screen snapshot. Atomic write (temp file → sync_all → rename), mirroring state_store::JsonStateStore.

pub trait SnapshotStore: Send + Sync {
    fn save(&self, sid: &SurfaceId, snap: &Snapshot) -> anyhow::Result<()>;
    fn load(&self, sid: &SurfaceId) -> Option<Snapshot>;
    fn remove(&self, sid: &SurfaceId);
}

The store persists the core spacesh_core::snapshot::Snapshot directly (ansi, cols, rows, cursor_row, cursor_col) — spaceshd already depends on spacesh-core, so no separate daemon record type is introduced. A corrupt/missing file yields None (never an error that blocks resurrect). remove deletes the file and is called when a surface is closed or removed from the tree.

2. On-demand snapshot from the actor — `crates/spaceshd/src/surface.rs`

Add a message that returns the current snapshot without subscribing:

SurfaceMsg::Snapshot { reply: oneshot::Sender<(Snapshot, bool)> } // (snapshot, dirty)

The actor tracks a dirty flag: set inside flush whenever bytes are fed into the grid (grid.feed), cleared when a Snapshot reply is produced. The bool lets the periodic dumper skip unchanged grids.

On actor exit (after pty.wait()), the actor takes a final snapshot_ansi and forwards (id, snapshot) to the writer channel (a cloned mpsc::UnboundedSender<(SurfaceId, Snapshot)> passed into the actor), so the last screen of a finished process is persisted even between ticker ticks.

3. Writer task + periodic ticker — `crates/spaceshd/src/server.rs` / `main.rs`

Writer task: the sole owner of Arc<dyn SnapshotStore>. Receives (SurfaceId, Snapshot) on an unbounded channel and writes to disk. Keeps all snapshot disk I/O off the actor/PTY hot path and serializes writes.
Periodic ticker: every snapshot_interval_secs (config, default 5) the router iterates live surface handles, sends SurfaceMsg::Snapshot, awaits the reply, and forwards to the writer channel only when dirty is true.
Graceful shutdown: before the daemon exits it does one final synchronous pass over all live surfaces into the writer, then flushes the writer.

4. Resume config — `crates/spaceshd/src/config.rs`

[resume]
commands = { claude = ["--continue"], codex = ["resume"] }

#[derive(Debug, Clone, Default, Deserialize, Serialize)]
pub struct ResumeConfig {
    #[serde(default)]
    pub commands: std::collections::HashMap<String, Vec<String>>,
}

Added to Config as #[serde(default)] pub resume: ResumeConfig. A method resume_args(command: &str) -> Option<Vec<String>> resolves by command basename: user map first, then a built-in default table (claude → ["--continue"], codex → ["resume"]), then None. The default table is a const/static, not inline literals in branching logic.

5. Protocol — `crates/spacesh-proto/src/message.rs`

The codebase already has Cmd::RestartSurface { surface_id } (starts a stopped surface from its spec, guarded by is_running) and an Attach response that already carries { snapshot, cols, rows, cursor_row, cursor_col, stopped }. So no new command or wire type is needed beyond one field:

Extend Cmd::RestartSurface with #[serde(default)] resume: bool. resume = true builds command + resume_args(command) (falling back to the original args when no mapping exists); resume = false keeps the original command + args (today's behavior). The #[serde(default)] keeps old frames decoding to resume = false.
No GetSnapshot, no StartSurface, no SnapshotView: a stopped-panel Attach returns the disk snapshot (see §6) using the existing response shape.

spacesh-core::snapshot::Snapshot gains Deserialize (alongside Serialize) so the store can load it back from disk.

6. Server handlers — `crates/spaceshd/src/server.rs`

RestartSurface { surface_id, resume }: unchanged flow (spec lookup, spawn_from_spec, set_live, SurfaceRestarted broadcast). When resume, spawn with a spec whose args are replaced by config.resume_args(command) (when present); otherwise spawn the original spec.
Attach for a stopped surface: instead of returning the empty { snapshot: "", stopped: true }, load the disk snapshot via the snapshot store and return { snapshot: <ansi>, cols, rows, cursor_row, cursor_col, stopped: true }. Missing file → empty snapshot, still stopped: true.
Surface close/remove (Close, CloseWorkspace, remove_surface paths): send a remove to the snapshot writer so stale <sid>.json files do not accumulate.

7. App — `app/src` and `app/src-tauri`

socketBridge.ts: restartSurface(id, resume = false) gains the resume arg; AttachResult gains optional cursor_row/cursor_col/stopped.
app/src-tauri/src/bridge.rs: restart_surface forwards a resume: bool arg into Cmd::RestartSurface.
LayoutEngine.tsx stopped branch (running[id] === false): paint the disk snapshot into a dimmed, read-only xterm behind the controls, and offer two buttons — Resume → restartSurface(id, true) and Restart fresh → restartSurface(id, false). On success the daemon's workspace_changed flips running to true, the overlay unmounts, and the live TerminalView mounts.

Data flow

running surface ──(every 5s, if dirty)──▶ ticker ──▶ writer task ──▶ <sid>.json
running surface ──(on exit)─────────────────────────▶ writer task ──▶ <sid>.json
daemon shutdown ──(final pass over live)────────────▶ writer task ──▶ <sid>.json

reboot ▶ daemon cold start ▶ Registry::restore(state.json) ▶ all surfaces stopped
client ▶ Attach(sid) [stopped] ▶ disk snapshot ▶ paint dimmed read-only screen
user clicks Resume ▶ RestartSurface{resume:true} ▶ spawn(command + resume_args, cwd)
                  ▶ SurfaceRestarted + running=true ▶ live TerminalView mounts

Error handling

Missing/corrupt snapshot file → stopped Attach returns an empty snapshot; the overlay shows an empty dimmed panel with the Resume/Restart controls.
RestartSurface on an already-running surface → no-op ok (existing is_running guard); unknown surface → NOT_FOUND.
Resume command for an agent without a mapping → falls back to the original spec args (plain restart), never fails the spawn.
Writer task failure to write one file is logged and dropped; it must not stall the daemon or other surfaces.

Performance

A visible-screen snapshot is ≈ rows × cols bytes of ANSI; at a 5s cadence with the dirty debounce, idle panels write nothing. All disk writes happen in the single writer task, off the PTY/actor hot path, so the keypress→echo (<16 ms) and output-batching budgets are untouched.

Testing

snapshot_store: save→load round-trip; atomic write; missing file → None; corrupt file → None; remove deletes the file.
config: parse [resume] table; resume_args returns user override, then built-in default, then None; missing section defaults cleanly.
surface actor: SurfaceMsg::Snapshot returns the current grid contents; dirty is true after output and false immediately after a snapshot.
server: RestartSurface{resume:true} spawns with command + resume_args; {resume:false} spawns with command + args; stopped Attach returns the saved disk snapshot; is_running guard prevents a second actor.
registry: starting a stopped surface re-populates the live map and the view flips running to true.

Out of scope

Resuming the literal in-flight process across power loss (impossible).
Scrollback history beyond the visible screen.
Auto-resume on daemon start (manual trigger chosen).
Per-surface resume command stored in the spec/wizard (config map chosen).

11 KiB Raw Blame History Unescape Escape