Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
11 KiB
Session Persistence (resurrect + resume) — Design
Date: 2026-06-15 Status: Approved, ready for implementation plan
Goal
Let a workspace survive both GUI loss and full power loss. Closing a tab or
the whole GUI already keeps agents running (daemon owns PTYs, reattach via live
grid snapshot — M1). This design adds the missing half: after the daemon itself
dies (reboot, battery death, kill -9), the user can bring panels back —
panels show their last on-screen state and offer a one-click Resume that
restarts the agent with its session-continue flag (e.g. claude --continue).
Scope decisions (locked)
- Reboot behavior: resurrect + resume. A live process cannot survive a power-off — not even tmux does that. After a daemon restart we respawn the panel from its persisted spec (cwd intact) and, for agents that support it, relaunch with a resume flag so the conversation continues in a new process.
- Resurrect trigger: manual, per-panel. After a daemon restart panels are shown stopped with their last screen; nothing spawns until the user clicks. This avoids surprise token burn from auto-launching many agents.
- Persisted scrollback: visible screen only. We reuse the existing
snapshot_ansi()serializer (the same one that powers live reattach) and write its output to disk. No scrollback history beyond the visible grid. - Resume command source: a
[resume]table in~/.spacesh/config.tomlmapping a command basename to resume args, merged over built-in defaults. - Snapshot cadence: periodic + shutdown. A background task dumps changed
grids every N seconds (default 5), plus a full pass on graceful shutdown and
a final dump when an actor exits. This survives
kill -9/ battery (you lose at most N seconds of the last screen).
What already exists (do not rebuild)
state.json(viaJsonStateStore+ debouncedPersister) already persists structure: groups, workspaces, layout, zoom, pinned, and per-surfaceSurfaceSpec(command,args,cwd,cols,rows,agent_label,autostart). On cold startRegistry::restore()loads this; theliveactor map is empty, so every surface is "stopped" (spec present, no process).SurfaceView.running: boolalready tells the client a surface is stopped.spacesh_core::snapshot::snapshot_ansi(&GridSurface) -> Snapshotserializes the visible grid to an ANSI dump (ansi,cols,rows,cursor_row,cursor_col).Snapshotcurrently derivesSerializeonly.- The surface actor already answers
SurfaceMsg::AttachSnapshotby callingsnapshot_ansi(&grid); the grid is the authoritative screen model.
Components
1. Snapshot store — crates/spaceshd/src/snapshot_store.rs (new)
Per-surface JSON file ~/.spacesh/snapshots/<surface_id>.json holding the
serialized visible-screen snapshot. Atomic write (temp file → sync_all →
rename), mirroring state_store::JsonStateStore.
pub trait SnapshotStore: Send + Sync {
fn save(&self, sid: &SurfaceId, snap: &Snapshot) -> anyhow::Result<()>;
fn load(&self, sid: &SurfaceId) -> Option<Snapshot>;
fn remove(&self, sid: &SurfaceId);
}
The store persists the core spacesh_core::snapshot::Snapshot directly
(ansi, cols, rows, cursor_row, cursor_col) — spaceshd already
depends on spacesh-core, so no separate daemon record type is introduced. A
corrupt/missing file yields None (never an error that blocks resurrect).
remove deletes the file and is called when a surface is closed or removed
from the tree.
2. On-demand snapshot from the actor — crates/spaceshd/src/surface.rs
Add a message that returns the current snapshot without subscribing:
SurfaceMsg::Snapshot { reply: oneshot::Sender<(Snapshot, bool)> } // (snapshot, dirty)
The actor tracks a dirty flag: set inside flush whenever bytes are fed into
the grid (grid.feed), cleared when a Snapshot reply is produced. The bool
lets the periodic dumper skip unchanged grids.
On actor exit (after pty.wait()), the actor takes a final snapshot_ansi
and forwards (id, snapshot) to the writer channel (a cloned
mpsc::UnboundedSender<(SurfaceId, Snapshot)> passed into the actor), so the
last screen of a finished process is persisted even between ticker ticks.
3. Writer task + periodic ticker — crates/spaceshd/src/server.rs / main.rs
- Writer task: the sole owner of
Arc<dyn SnapshotStore>. Receives(SurfaceId, Snapshot)on an unbounded channel and writes to disk. Keeps all snapshot disk I/O off the actor/PTY hot path and serializes writes. - Periodic ticker: every
snapshot_interval_secs(config, default 5) the router iterates live surface handles, sendsSurfaceMsg::Snapshot, awaits the reply, and forwards to the writer channel only whendirtyis true. - Graceful shutdown: before the daemon exits it does one final synchronous pass over all live surfaces into the writer, then flushes the writer.
4. Resume config — crates/spaceshd/src/config.rs
[resume]
commands = { claude = ["--continue"], codex = ["resume"] }
#[derive(Debug, Clone, Default, Deserialize, Serialize)]
pub struct ResumeConfig {
#[serde(default)]
pub commands: std::collections::HashMap<String, Vec<String>>,
}
Added to Config as #[serde(default)] pub resume: ResumeConfig. A method
resume_args(command: &str) -> Option<Vec<String>> resolves by command
basename: user map first, then a built-in default table
(claude → ["--continue"], codex → ["resume"]), then None. The default
table is a const/static, not inline literals in branching logic.
5. Protocol — crates/spacesh-proto/src/message.rs
The codebase already has Cmd::RestartSurface { surface_id } (starts a stopped
surface from its spec, guarded by is_running) and an Attach response that
already carries { snapshot, cols, rows, cursor_row, cursor_col, stopped }.
So no new command or wire type is needed beyond one field:
- Extend
Cmd::RestartSurfacewith#[serde(default)] resume: bool.resume = truebuildscommand + resume_args(command)(falling back to the original args when no mapping exists);resume = falsekeeps the originalcommand + args(today's behavior). The#[serde(default)]keeps old frames decoding toresume = false. - No
GetSnapshot, noStartSurface, noSnapshotView: a stopped-panelAttachreturns the disk snapshot (see §6) using the existing response shape.
spacesh-core::snapshot::Snapshot gains Deserialize (alongside Serialize)
so the store can load it back from disk.
6. Server handlers — crates/spaceshd/src/server.rs
RestartSurface { surface_id, resume }: unchanged flow (spec lookup,spawn_from_spec,set_live,SurfaceRestartedbroadcast). Whenresume, spawn with a spec whoseargsare replaced byconfig.resume_args(command)(when present); otherwise spawn the original spec.Attachfor a stopped surface: instead of returning the empty{ snapshot: "", stopped: true }, load the disk snapshot via the snapshot store and return{ snapshot: <ansi>, cols, rows, cursor_row, cursor_col, stopped: true }. Missing file → empty snapshot, stillstopped: true.- Surface close/remove (
Close,CloseWorkspace,remove_surfacepaths): send a remove to the snapshot writer so stale<sid>.jsonfiles do not accumulate.
7. App — app/src and app/src-tauri
socketBridge.ts:restartSurface(id, resume = false)gains theresumearg;AttachResultgains optionalcursor_row/cursor_col/stopped.app/src-tauri/src/bridge.rs:restart_surfaceforwards aresume: boolarg intoCmd::RestartSurface.LayoutEngine.tsxstopped branch (running[id] === false): paint the disk snapshot into a dimmed, read-onlyxtermbehind the controls, and offer two buttons — Resume →restartSurface(id, true)and Restart fresh →restartSurface(id, false). On success the daemon'sworkspace_changedflipsrunningto true, the overlay unmounts, and the liveTerminalViewmounts.
Data flow
running surface ──(every 5s, if dirty)──▶ ticker ──▶ writer task ──▶ <sid>.json
running surface ──(on exit)─────────────────────────▶ writer task ──▶ <sid>.json
daemon shutdown ──(final pass over live)────────────▶ writer task ──▶ <sid>.json
reboot ▶ daemon cold start ▶ Registry::restore(state.json) ▶ all surfaces stopped
client ▶ Attach(sid) [stopped] ▶ disk snapshot ▶ paint dimmed read-only screen
user clicks Resume ▶ RestartSurface{resume:true} ▶ spawn(command + resume_args, cwd)
▶ SurfaceRestarted + running=true ▶ live TerminalView mounts
Error handling
- Missing/corrupt snapshot file → stopped
Attachreturns an empty snapshot; the overlay shows an empty dimmed panel with the Resume/Restart controls. RestartSurfaceon an already-running surface → no-op ok (existingis_runningguard); unknown surface →NOT_FOUND.- Resume command for an agent without a mapping → falls back to the original spec args (plain restart), never fails the spawn.
- Writer task failure to write one file is logged and dropped; it must not stall the daemon or other surfaces.
Performance
- A visible-screen snapshot is ≈ rows × cols bytes of ANSI; at a 5s cadence with
the
dirtydebounce, idle panels write nothing. All disk writes happen in the single writer task, off the PTY/actor hot path, so the keypress→echo (<16 ms) and output-batching budgets are untouched.
Testing
- snapshot_store: save→load round-trip; atomic write; missing file →
None; corrupt file →None;removedeletes the file. - config: parse
[resume]table;resume_argsreturns user override, then built-in default, thenNone; missing section defaults cleanly. - surface actor:
SurfaceMsg::Snapshotreturns the current grid contents;dirtyis true after output and false immediately after a snapshot. - server:
RestartSurface{resume:true}spawns withcommand + resume_args;{resume:false}spawns withcommand + args; stoppedAttachreturns the saved disk snapshot;is_runningguard prevents a second actor. - registry: starting a stopped surface re-populates the live map and the
view flips
runningto true.
Out of scope
- Resuming the literal in-flight process across power loss (impossible).
- Scrollback history beyond the visible screen.
- Auto-resume on daemon start (manual trigger chosen).
- Per-surface resume command stored in the spec/wizard (config map chosen).