Files
spaceshell/DOCS/superpowers/specs/2026-06-15-session-persistence-design.md
T
2026-06-15 15:05:21 +07:00

217 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Session Persistence (resurrect + resume) — Design
**Date:** 2026-06-15
**Status:** Approved, ready for implementation plan
## Goal
Let a workspace survive both GUI loss and full power loss. Closing a tab or
the whole GUI already keeps agents running (daemon owns PTYs, reattach via live
grid snapshot — M1). This design adds the missing half: after the daemon itself
dies (reboot, battery death, `kill -9`), the user can bring panels back —
panels show their last on-screen state and offer a one-click **Resume** that
restarts the agent with its session-continue flag (e.g. `claude --continue`).
## Scope decisions (locked)
- **Reboot behavior:** resurrect + resume. A live process cannot survive a
power-off — not even tmux does that. After a daemon restart we respawn the
panel from its persisted spec (cwd intact) and, for agents that support it,
relaunch with a resume flag so the conversation continues in a *new* process.
- **Resurrect trigger:** manual, per-panel. After a daemon restart panels are
shown stopped with their last screen; nothing spawns until the user clicks.
This avoids surprise token burn from auto-launching many agents.
- **Persisted scrollback:** visible screen only. We reuse the existing
`snapshot_ansi()` serializer (the same one that powers live reattach) and
write its output to disk. No scrollback history beyond the visible grid.
- **Resume command source:** a `[resume]` table in `~/.spacesh/config.toml`
mapping a command basename to resume args, merged over built-in defaults.
- **Snapshot cadence:** periodic + shutdown. A background task dumps changed
grids every N seconds (default 5), plus a full pass on graceful shutdown and
a final dump when an actor exits. This survives `kill -9` / battery (you lose
at most N seconds of the last screen).
## What already exists (do not rebuild)
- `state.json` (via `JsonStateStore` + debounced `Persister`) already persists
structure: groups, workspaces, layout, zoom, pinned, and per-surface
`SurfaceSpec` (`command`, `args`, `cwd`, `cols`, `rows`, `agent_label`,
`autostart`). On cold start `Registry::restore()` loads this; the `live`
actor map is empty, so every surface is "stopped" (spec present, no process).
- `SurfaceView.running: bool` already tells the client a surface is stopped.
- `spacesh_core::snapshot::snapshot_ansi(&GridSurface) -> Snapshot` serializes
the visible grid to an ANSI dump (`ansi`, `cols`, `rows`, `cursor_row`,
`cursor_col`). `Snapshot` currently derives `Serialize` only.
- The surface actor already answers `SurfaceMsg::AttachSnapshot` by calling
`snapshot_ansi(&grid)`; the grid is the authoritative screen model.
## Components
### 1. Snapshot store — `crates/spaceshd/src/snapshot_store.rs` (new)
Per-surface JSON file `~/.spacesh/snapshots/<surface_id>.json` holding the
serialized visible-screen snapshot. Atomic write (temp file → `sync_all`
rename), mirroring `state_store::JsonStateStore`.
```rust
pub trait SnapshotStore: Send + Sync {
fn save(&self, sid: &SurfaceId, snap: &Snapshot) -> anyhow::Result<()>;
fn load(&self, sid: &SurfaceId) -> Option<Snapshot>;
fn remove(&self, sid: &SurfaceId);
}
```
The store persists the core `spacesh_core::snapshot::Snapshot` directly
(`ansi`, `cols`, `rows`, `cursor_row`, `cursor_col`) — `spaceshd` already
depends on `spacesh-core`, so no separate daemon record type is introduced. A
corrupt/missing file yields `None` (never an error that blocks resurrect).
`remove` deletes the file and is called when a surface is closed or removed
from the tree.
### 2. On-demand snapshot from the actor — `crates/spaceshd/src/surface.rs`
Add a message that returns the current snapshot without subscribing:
```rust
SurfaceMsg::Snapshot { reply: oneshot::Sender<(Snapshot, bool)> } // (snapshot, dirty)
```
The actor tracks a `dirty` flag: set inside `flush` whenever bytes are fed into
the grid (`grid.feed`), cleared when a `Snapshot` reply is produced. The bool
lets the periodic dumper skip unchanged grids.
On actor exit (after `pty.wait()`), the actor takes a final `snapshot_ansi`
and forwards `(id, snapshot)` to the writer channel (a cloned
`mpsc::UnboundedSender<(SurfaceId, Snapshot)>` passed into the actor), so the
last screen of a finished process is persisted even between ticker ticks.
### 3. Writer task + periodic ticker — `crates/spaceshd/src/server.rs` / `main.rs`
- **Writer task:** the sole owner of `Arc<dyn SnapshotStore>`. Receives
`(SurfaceId, Snapshot)` on an unbounded channel and writes to disk. Keeps all
snapshot disk I/O off the actor/PTY hot path and serializes writes.
- **Periodic ticker:** every `snapshot_interval_secs` (config, default 5) the
router iterates live surface handles, sends `SurfaceMsg::Snapshot`, awaits the
reply, and forwards to the writer channel only when `dirty` is true.
- **Graceful shutdown:** before the daemon exits it does one final synchronous
pass over all live surfaces into the writer, then flushes the writer.
### 4. Resume config — `crates/spaceshd/src/config.rs`
```toml
[resume]
commands = { claude = ["--continue"], codex = ["resume"] }
```
```rust
#[derive(Debug, Clone, Default, Deserialize, Serialize)]
pub struct ResumeConfig {
#[serde(default)]
pub commands: std::collections::HashMap<String, Vec<String>>,
}
```
Added to `Config` as `#[serde(default)] pub resume: ResumeConfig`. A method
`resume_args(command: &str) -> Option<Vec<String>>` resolves by command
basename: user map first, then a built-in default table
(`claude → ["--continue"]`, `codex → ["resume"]`), then `None`. The default
table is a `const`/static, not inline literals in branching logic.
### 5. Protocol — `crates/spacesh-proto/src/message.rs`
- `Cmd::StartSurface { surface_id: SurfaceId, resume: bool }` — start a stopped
surface. `resume = true` builds `command + resume_args(command)` (falling
back to the original args when no resume mapping exists); `resume = false`
builds the original `command + args`. cwd and geometry come from the spec.
- `Cmd::GetSnapshot { surface_id: SurfaceId }` → response carries
`Option<SnapshotView>`.
- `SnapshotView { ansi, cols, rows, cursor_row, cursor_col }` — a proto-level
mirror of core `Snapshot`, so `spacesh-proto` does not depend on
`spacesh-core`. The daemon converts core `Snapshot` into `SnapshotView` at
the protocol boundary.
`spacesh-core::snapshot::Snapshot` gains `Deserialize` (alongside `Serialize`)
so it can be loaded back from disk into `SnapshotRecord` conversions in tests
and the store.
### 6. Server handlers — `crates/spaceshd/src/server.rs`
- `StartSurface`: look up the spec; if missing → error response. Build a
`SpawnSpec` with resume-or-plain args, the spec's cwd, and current geometry;
`spawn_surface_deferred(...)`; `registry.set_live(handle)`; broadcast
`workspace_changed` so all clients flip `running` to true.
- `GetSnapshot`: read from the snapshot store, convert to `SnapshotView`,
return `Option`.
- On surface close/remove: call `snapshot_store.remove(sid)` (via the writer or
a direct store handle) so stale files do not accumulate.
### 7. App — `app/src` and `app/src-tauri`
- `socketBridge.ts`: `startSurface(id, resume)`, `getSnapshot(id)`, and a
`SnapshotView` type.
- `app/src-tauri/src/bridge.rs`: `start_surface` and `get_snapshot` invoke
handlers forwarding to the daemon, wired into the Tauri `invoke_handler` and
the JS bridge.
- `LayoutEngine.tsx` / `TerminalView.tsx`: when a surface's `running === false`,
render a stopped overlay instead of a live terminal:
- fetch `getSnapshot(id)` and paint the ANSI into a read-only, dimmed
`xterm` instance for visual context;
- centered controls: **Resume**`startSurface(id, true)`,
**Restart fresh**`startSurface(id, false)`;
- on success the daemon's `workspace_changed` sets `running = true`, the
overlay unmounts, and the normal live `TerminalView` mounts.
- a small "stopped" indicator in the panel header.
## Data flow
```
running surface ──(every 5s, if dirty)──▶ ticker ──▶ writer task ──▶ <sid>.json
running surface ──(on exit)─────────────────────────▶ writer task ──▶ <sid>.json
daemon shutdown ──(final pass over live)────────────▶ writer task ──▶ <sid>.json
reboot ▶ daemon cold start ▶ Registry::restore(state.json) ▶ all surfaces stopped
client ▶ GetSnapshot(sid) ▶ paint dimmed read-only screen + Resume/Restart
user clicks Resume ▶ StartSurface{resume:true} ▶ spawn(command + resume_args, cwd)
▶ workspace_changed(running=true) ▶ live TerminalView mounts
```
## Error handling
- Missing/corrupt snapshot file → `GetSnapshot` returns `None`; the overlay
shows an empty dimmed panel with the Resume/Restart controls (still usable).
- `StartSurface` on an unknown/already-running surface → error response; client
ignores or surfaces a toast. No duplicate actor: guard on
`registry.is_running(sid)`.
- Resume command for an agent without a mapping → falls back to the original
spec args (plain restart), never fails the spawn.
- Writer task failure to write one file is logged and dropped; it must not stall
the daemon or other surfaces.
## Performance
- A visible-screen snapshot is ≈ rows × cols bytes of ANSI; at a 5s cadence with
the `dirty` debounce, idle panels write nothing. All disk writes happen in the
single writer task, off the PTY/actor hot path, so the keypress→echo (<16 ms)
and output-batching budgets are untouched.
## Testing
- **snapshot_store:** save→load round-trip; atomic write; missing file → `None`;
corrupt file → `None`; `remove` deletes the file.
- **config:** parse `[resume]` table; `resume_args` returns user override, then
built-in default, then `None`; missing section defaults cleanly.
- **surface actor:** `SurfaceMsg::Snapshot` returns the current grid contents;
`dirty` is true after output and false immediately after a snapshot.
- **server:** `StartSurface{resume:true}` builds `command + resume_args`;
`{resume:false}` builds `command + args`; `GetSnapshot` returns the saved
view; `is_running` guard prevents a second actor.
- **registry:** starting a stopped surface re-populates the live map and the
view flips `running` to true.
## Out of scope
- Resuming the literal in-flight process across power loss (impossible).
- Scrollback history beyond the visible screen.
- Auto-resume on daemon start (manual trigger chosen).
- Per-surface resume command stored in the spec/wizard (config map chosen).