docs: session persistence (resurrect + resume) design spec
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,216 @@
|
|||||||
|
# Session Persistence (resurrect + resume) — Design
|
||||||
|
|
||||||
|
**Date:** 2026-06-15
|
||||||
|
**Status:** Approved, ready for implementation plan
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Let a workspace survive both GUI loss and full power loss. Closing a tab or
|
||||||
|
the whole GUI already keeps agents running (daemon owns PTYs, reattach via live
|
||||||
|
grid snapshot — M1). This design adds the missing half: after the daemon itself
|
||||||
|
dies (reboot, battery death, `kill -9`), the user can bring panels back —
|
||||||
|
panels show their last on-screen state and offer a one-click **Resume** that
|
||||||
|
restarts the agent with its session-continue flag (e.g. `claude --continue`).
|
||||||
|
|
||||||
|
## Scope decisions (locked)
|
||||||
|
|
||||||
|
- **Reboot behavior:** resurrect + resume. A live process cannot survive a
|
||||||
|
power-off — not even tmux does that. After a daemon restart we respawn the
|
||||||
|
panel from its persisted spec (cwd intact) and, for agents that support it,
|
||||||
|
relaunch with a resume flag so the conversation continues in a *new* process.
|
||||||
|
- **Resurrect trigger:** manual, per-panel. After a daemon restart panels are
|
||||||
|
shown stopped with their last screen; nothing spawns until the user clicks.
|
||||||
|
This avoids surprise token burn from auto-launching many agents.
|
||||||
|
- **Persisted scrollback:** visible screen only. We reuse the existing
|
||||||
|
`snapshot_ansi()` serializer (the same one that powers live reattach) and
|
||||||
|
write its output to disk. No scrollback history beyond the visible grid.
|
||||||
|
- **Resume command source:** a `[resume]` table in `~/.spacesh/config.toml`
|
||||||
|
mapping a command basename to resume args, merged over built-in defaults.
|
||||||
|
- **Snapshot cadence:** periodic + shutdown. A background task dumps changed
|
||||||
|
grids every N seconds (default 5), plus a full pass on graceful shutdown and
|
||||||
|
a final dump when an actor exits. This survives `kill -9` / battery (you lose
|
||||||
|
at most N seconds of the last screen).
|
||||||
|
|
||||||
|
## What already exists (do not rebuild)
|
||||||
|
|
||||||
|
- `state.json` (via `JsonStateStore` + debounced `Persister`) already persists
|
||||||
|
structure: groups, workspaces, layout, zoom, pinned, and per-surface
|
||||||
|
`SurfaceSpec` (`command`, `args`, `cwd`, `cols`, `rows`, `agent_label`,
|
||||||
|
`autostart`). On cold start `Registry::restore()` loads this; the `live`
|
||||||
|
actor map is empty, so every surface is "stopped" (spec present, no process).
|
||||||
|
- `SurfaceView.running: bool` already tells the client a surface is stopped.
|
||||||
|
- `spacesh_core::snapshot::snapshot_ansi(&GridSurface) -> Snapshot` serializes
|
||||||
|
the visible grid to an ANSI dump (`ansi`, `cols`, `rows`, `cursor_row`,
|
||||||
|
`cursor_col`). `Snapshot` currently derives `Serialize` only.
|
||||||
|
- The surface actor already answers `SurfaceMsg::AttachSnapshot` by calling
|
||||||
|
`snapshot_ansi(&grid)`; the grid is the authoritative screen model.
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
### 1. Snapshot store — `crates/spaceshd/src/snapshot_store.rs` (new)
|
||||||
|
|
||||||
|
Per-surface JSON file `~/.spacesh/snapshots/<surface_id>.json` holding the
|
||||||
|
serialized visible-screen snapshot. Atomic write (temp file → `sync_all` →
|
||||||
|
rename), mirroring `state_store::JsonStateStore`.
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub trait SnapshotStore: Send + Sync {
|
||||||
|
fn save(&self, sid: &SurfaceId, snap: &Snapshot) -> anyhow::Result<()>;
|
||||||
|
fn load(&self, sid: &SurfaceId) -> Option<Snapshot>;
|
||||||
|
fn remove(&self, sid: &SurfaceId);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The store persists the core `spacesh_core::snapshot::Snapshot` directly
|
||||||
|
(`ansi`, `cols`, `rows`, `cursor_row`, `cursor_col`) — `spaceshd` already
|
||||||
|
depends on `spacesh-core`, so no separate daemon record type is introduced. A
|
||||||
|
corrupt/missing file yields `None` (never an error that blocks resurrect).
|
||||||
|
`remove` deletes the file and is called when a surface is closed or removed
|
||||||
|
from the tree.
|
||||||
|
|
||||||
|
### 2. On-demand snapshot from the actor — `crates/spaceshd/src/surface.rs`
|
||||||
|
|
||||||
|
Add a message that returns the current snapshot without subscribing:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
SurfaceMsg::Snapshot { reply: oneshot::Sender<(Snapshot, bool)> } // (snapshot, dirty)
|
||||||
|
```
|
||||||
|
|
||||||
|
The actor tracks a `dirty` flag: set inside `flush` whenever bytes are fed into
|
||||||
|
the grid (`grid.feed`), cleared when a `Snapshot` reply is produced. The bool
|
||||||
|
lets the periodic dumper skip unchanged grids.
|
||||||
|
|
||||||
|
On actor exit (after `pty.wait()`), the actor takes a final `snapshot_ansi`
|
||||||
|
and forwards `(id, snapshot)` to the writer channel (a cloned
|
||||||
|
`mpsc::UnboundedSender<(SurfaceId, Snapshot)>` passed into the actor), so the
|
||||||
|
last screen of a finished process is persisted even between ticker ticks.
|
||||||
|
|
||||||
|
### 3. Writer task + periodic ticker — `crates/spaceshd/src/server.rs` / `main.rs`
|
||||||
|
|
||||||
|
- **Writer task:** the sole owner of `Arc<dyn SnapshotStore>`. Receives
|
||||||
|
`(SurfaceId, Snapshot)` on an unbounded channel and writes to disk. Keeps all
|
||||||
|
snapshot disk I/O off the actor/PTY hot path and serializes writes.
|
||||||
|
- **Periodic ticker:** every `snapshot_interval_secs` (config, default 5) the
|
||||||
|
router iterates live surface handles, sends `SurfaceMsg::Snapshot`, awaits the
|
||||||
|
reply, and forwards to the writer channel only when `dirty` is true.
|
||||||
|
- **Graceful shutdown:** before the daemon exits it does one final synchronous
|
||||||
|
pass over all live surfaces into the writer, then flushes the writer.
|
||||||
|
|
||||||
|
### 4. Resume config — `crates/spaceshd/src/config.rs`
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[resume]
|
||||||
|
commands = { claude = ["--continue"], codex = ["resume"] }
|
||||||
|
```
|
||||||
|
|
||||||
|
```rust
|
||||||
|
#[derive(Debug, Clone, Default, Deserialize, Serialize)]
|
||||||
|
pub struct ResumeConfig {
|
||||||
|
#[serde(default)]
|
||||||
|
pub commands: std::collections::HashMap<String, Vec<String>>,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Added to `Config` as `#[serde(default)] pub resume: ResumeConfig`. A method
|
||||||
|
`resume_args(command: &str) -> Option<Vec<String>>` resolves by command
|
||||||
|
basename: user map first, then a built-in default table
|
||||||
|
(`claude → ["--continue"]`, `codex → ["resume"]`), then `None`. The default
|
||||||
|
table is a `const`/static, not inline literals in branching logic.
|
||||||
|
|
||||||
|
### 5. Protocol — `crates/spacesh-proto/src/message.rs`
|
||||||
|
|
||||||
|
- `Cmd::StartSurface { surface_id: SurfaceId, resume: bool }` — start a stopped
|
||||||
|
surface. `resume = true` builds `command + resume_args(command)` (falling
|
||||||
|
back to the original args when no resume mapping exists); `resume = false`
|
||||||
|
builds the original `command + args`. cwd and geometry come from the spec.
|
||||||
|
- `Cmd::GetSnapshot { surface_id: SurfaceId }` → response carries
|
||||||
|
`Option<SnapshotView>`.
|
||||||
|
- `SnapshotView { ansi, cols, rows, cursor_row, cursor_col }` — a proto-level
|
||||||
|
mirror of core `Snapshot`, so `spacesh-proto` does not depend on
|
||||||
|
`spacesh-core`. The daemon converts core `Snapshot` into `SnapshotView` at
|
||||||
|
the protocol boundary.
|
||||||
|
|
||||||
|
`spacesh-core::snapshot::Snapshot` gains `Deserialize` (alongside `Serialize`)
|
||||||
|
so it can be loaded back from disk into `SnapshotRecord` conversions in tests
|
||||||
|
and the store.
|
||||||
|
|
||||||
|
### 6. Server handlers — `crates/spaceshd/src/server.rs`
|
||||||
|
|
||||||
|
- `StartSurface`: look up the spec; if missing → error response. Build a
|
||||||
|
`SpawnSpec` with resume-or-plain args, the spec's cwd, and current geometry;
|
||||||
|
`spawn_surface_deferred(...)`; `registry.set_live(handle)`; broadcast
|
||||||
|
`workspace_changed` so all clients flip `running` to true.
|
||||||
|
- `GetSnapshot`: read from the snapshot store, convert to `SnapshotView`,
|
||||||
|
return `Option`.
|
||||||
|
- On surface close/remove: call `snapshot_store.remove(sid)` (via the writer or
|
||||||
|
a direct store handle) so stale files do not accumulate.
|
||||||
|
|
||||||
|
### 7. App — `app/src` and `app/src-tauri`
|
||||||
|
|
||||||
|
- `socketBridge.ts`: `startSurface(id, resume)`, `getSnapshot(id)`, and a
|
||||||
|
`SnapshotView` type.
|
||||||
|
- `app/src-tauri/src/bridge.rs`: `start_surface` and `get_snapshot` invoke
|
||||||
|
handlers forwarding to the daemon, wired into the Tauri `invoke_handler` and
|
||||||
|
the JS bridge.
|
||||||
|
- `LayoutEngine.tsx` / `TerminalView.tsx`: when a surface's `running === false`,
|
||||||
|
render a stopped overlay instead of a live terminal:
|
||||||
|
- fetch `getSnapshot(id)` and paint the ANSI into a read-only, dimmed
|
||||||
|
`xterm` instance for visual context;
|
||||||
|
- centered controls: **Resume** → `startSurface(id, true)`,
|
||||||
|
**Restart fresh** → `startSurface(id, false)`;
|
||||||
|
- on success the daemon's `workspace_changed` sets `running = true`, the
|
||||||
|
overlay unmounts, and the normal live `TerminalView` mounts.
|
||||||
|
- a small "stopped" indicator in the panel header.
|
||||||
|
|
||||||
|
## Data flow
|
||||||
|
|
||||||
|
```
|
||||||
|
running surface ──(every 5s, if dirty)──▶ ticker ──▶ writer task ──▶ <sid>.json
|
||||||
|
running surface ──(on exit)─────────────────────────▶ writer task ──▶ <sid>.json
|
||||||
|
daemon shutdown ──(final pass over live)────────────▶ writer task ──▶ <sid>.json
|
||||||
|
|
||||||
|
reboot ▶ daemon cold start ▶ Registry::restore(state.json) ▶ all surfaces stopped
|
||||||
|
client ▶ GetSnapshot(sid) ▶ paint dimmed read-only screen + Resume/Restart
|
||||||
|
user clicks Resume ▶ StartSurface{resume:true} ▶ spawn(command + resume_args, cwd)
|
||||||
|
▶ workspace_changed(running=true) ▶ live TerminalView mounts
|
||||||
|
```
|
||||||
|
|
||||||
|
## Error handling
|
||||||
|
|
||||||
|
- Missing/corrupt snapshot file → `GetSnapshot` returns `None`; the overlay
|
||||||
|
shows an empty dimmed panel with the Resume/Restart controls (still usable).
|
||||||
|
- `StartSurface` on an unknown/already-running surface → error response; client
|
||||||
|
ignores or surfaces a toast. No duplicate actor: guard on
|
||||||
|
`registry.is_running(sid)`.
|
||||||
|
- Resume command for an agent without a mapping → falls back to the original
|
||||||
|
spec args (plain restart), never fails the spawn.
|
||||||
|
- Writer task failure to write one file is logged and dropped; it must not stall
|
||||||
|
the daemon or other surfaces.
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
- A visible-screen snapshot is ≈ rows × cols bytes of ANSI; at a 5s cadence with
|
||||||
|
the `dirty` debounce, idle panels write nothing. All disk writes happen in the
|
||||||
|
single writer task, off the PTY/actor hot path, so the keypress→echo (<16 ms)
|
||||||
|
and output-batching budgets are untouched.
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
- **snapshot_store:** save→load round-trip; atomic write; missing file → `None`;
|
||||||
|
corrupt file → `None`; `remove` deletes the file.
|
||||||
|
- **config:** parse `[resume]` table; `resume_args` returns user override, then
|
||||||
|
built-in default, then `None`; missing section defaults cleanly.
|
||||||
|
- **surface actor:** `SurfaceMsg::Snapshot` returns the current grid contents;
|
||||||
|
`dirty` is true after output and false immediately after a snapshot.
|
||||||
|
- **server:** `StartSurface{resume:true}` builds `command + resume_args`;
|
||||||
|
`{resume:false}` builds `command + args`; `GetSnapshot` returns the saved
|
||||||
|
view; `is_running` guard prevents a second actor.
|
||||||
|
- **registry:** starting a stopped surface re-populates the live map and the
|
||||||
|
view flips `running` to true.
|
||||||
|
|
||||||
|
## Out of scope
|
||||||
|
|
||||||
|
- Resuming the literal in-flight process across power loss (impossible).
|
||||||
|
- Scrollback history beyond the visible screen.
|
||||||
|
- Auto-resume on daemon start (manual trigger chosen).
|
||||||
|
- Per-surface resume command stored in the spec/wizard (config map chosen).
|
||||||
Reference in New Issue
Block a user