Files
spaceshell/DOCS/superpowers/specs/2026-06-15-session-persistence-design.md
T
2026-06-15 15:20:02 +07:00

217 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Session Persistence (resurrect + resume) — Design
**Date:** 2026-06-15
**Status:** Approved, ready for implementation plan
## Goal
Let a workspace survive both GUI loss and full power loss. Closing a tab or
the whole GUI already keeps agents running (daemon owns PTYs, reattach via live
grid snapshot — M1). This design adds the missing half: after the daemon itself
dies (reboot, battery death, `kill -9`), the user can bring panels back —
panels show their last on-screen state and offer a one-click **Resume** that
restarts the agent with its session-continue flag (e.g. `claude --continue`).
## Scope decisions (locked)
- **Reboot behavior:** resurrect + resume. A live process cannot survive a
power-off — not even tmux does that. After a daemon restart we respawn the
panel from its persisted spec (cwd intact) and, for agents that support it,
relaunch with a resume flag so the conversation continues in a *new* process.
- **Resurrect trigger:** manual, per-panel. After a daemon restart panels are
shown stopped with their last screen; nothing spawns until the user clicks.
This avoids surprise token burn from auto-launching many agents.
- **Persisted scrollback:** visible screen only. We reuse the existing
`snapshot_ansi()` serializer (the same one that powers live reattach) and
write its output to disk. No scrollback history beyond the visible grid.
- **Resume command source:** a `[resume]` table in `~/.spacesh/config.toml`
mapping a command basename to resume args, merged over built-in defaults.
- **Snapshot cadence:** periodic + shutdown. A background task dumps changed
grids every N seconds (default 5), plus a full pass on graceful shutdown and
a final dump when an actor exits. This survives `kill -9` / battery (you lose
at most N seconds of the last screen).
## What already exists (do not rebuild)
- `state.json` (via `JsonStateStore` + debounced `Persister`) already persists
structure: groups, workspaces, layout, zoom, pinned, and per-surface
`SurfaceSpec` (`command`, `args`, `cwd`, `cols`, `rows`, `agent_label`,
`autostart`). On cold start `Registry::restore()` loads this; the `live`
actor map is empty, so every surface is "stopped" (spec present, no process).
- `SurfaceView.running: bool` already tells the client a surface is stopped.
- `spacesh_core::snapshot::snapshot_ansi(&GridSurface) -> Snapshot` serializes
the visible grid to an ANSI dump (`ansi`, `cols`, `rows`, `cursor_row`,
`cursor_col`). `Snapshot` currently derives `Serialize` only.
- The surface actor already answers `SurfaceMsg::AttachSnapshot` by calling
`snapshot_ansi(&grid)`; the grid is the authoritative screen model.
## Components
### 1. Snapshot store — `crates/spaceshd/src/snapshot_store.rs` (new)
Per-surface JSON file `~/.spacesh/snapshots/<surface_id>.json` holding the
serialized visible-screen snapshot. Atomic write (temp file → `sync_all`
rename), mirroring `state_store::JsonStateStore`.
```rust
pub trait SnapshotStore: Send + Sync {
fn save(&self, sid: &SurfaceId, snap: &Snapshot) -> anyhow::Result<()>;
fn load(&self, sid: &SurfaceId) -> Option<Snapshot>;
fn remove(&self, sid: &SurfaceId);
}
```
The store persists the core `spacesh_core::snapshot::Snapshot` directly
(`ansi`, `cols`, `rows`, `cursor_row`, `cursor_col`) — `spaceshd` already
depends on `spacesh-core`, so no separate daemon record type is introduced. A
corrupt/missing file yields `None` (never an error that blocks resurrect).
`remove` deletes the file and is called when a surface is closed or removed
from the tree.
### 2. On-demand snapshot from the actor — `crates/spaceshd/src/surface.rs`
Add a message that returns the current snapshot without subscribing:
```rust
SurfaceMsg::Snapshot { reply: oneshot::Sender<(Snapshot, bool)> } // (snapshot, dirty)
```
The actor tracks a `dirty` flag: set inside `flush` whenever bytes are fed into
the grid (`grid.feed`), cleared when a `Snapshot` reply is produced. The bool
lets the periodic dumper skip unchanged grids.
On actor exit (after `pty.wait()`), the actor takes a final `snapshot_ansi`
and forwards `(id, snapshot)` to the writer channel (a cloned
`mpsc::UnboundedSender<(SurfaceId, Snapshot)>` passed into the actor), so the
last screen of a finished process is persisted even between ticker ticks.
### 3. Writer task + periodic ticker — `crates/spaceshd/src/server.rs` / `main.rs`
- **Writer task:** the sole owner of `Arc<dyn SnapshotStore>`. Receives
`(SurfaceId, Snapshot)` on an unbounded channel and writes to disk. Keeps all
snapshot disk I/O off the actor/PTY hot path and serializes writes.
- **Periodic ticker:** every `snapshot_interval_secs` (config, default 5) the
router iterates live surface handles, sends `SurfaceMsg::Snapshot`, awaits the
reply, and forwards to the writer channel only when `dirty` is true.
- **Graceful shutdown:** before the daemon exits it does one final synchronous
pass over all live surfaces into the writer, then flushes the writer.
### 4. Resume config — `crates/spaceshd/src/config.rs`
```toml
[resume]
commands = { claude = ["--continue"], codex = ["resume"] }
```
```rust
#[derive(Debug, Clone, Default, Deserialize, Serialize)]
pub struct ResumeConfig {
#[serde(default)]
pub commands: std::collections::HashMap<String, Vec<String>>,
}
```
Added to `Config` as `#[serde(default)] pub resume: ResumeConfig`. A method
`resume_args(command: &str) -> Option<Vec<String>>` resolves by command
basename: user map first, then a built-in default table
(`claude → ["--continue"]`, `codex → ["resume"]`), then `None`. The default
table is a `const`/static, not inline literals in branching logic.
### 5. Protocol — `crates/spacesh-proto/src/message.rs`
The codebase already has `Cmd::RestartSurface { surface_id }` (starts a stopped
surface from its spec, guarded by `is_running`) and an `Attach` response that
already carries `{ snapshot, cols, rows, cursor_row, cursor_col, stopped }`.
So no new command or wire type is needed beyond one field:
- Extend `Cmd::RestartSurface` with `#[serde(default)] resume: bool`. `resume =
true` builds `command + resume_args(command)` (falling back to the original
args when no mapping exists); `resume = false` keeps the original
`command + args` (today's behavior). The `#[serde(default)]` keeps old frames
decoding to `resume = false`.
- No `GetSnapshot`, no `StartSurface`, no `SnapshotView`: a stopped-panel
`Attach` returns the **disk** snapshot (see §6) using the existing response
shape.
`spacesh-core::snapshot::Snapshot` gains `Deserialize` (alongside `Serialize`)
so the store can load it back from disk.
### 6. Server handlers — `crates/spaceshd/src/server.rs`
- `RestartSurface { surface_id, resume }`: unchanged flow (spec lookup,
`spawn_from_spec`, `set_live`, `SurfaceRestarted` broadcast). When `resume`,
spawn with a spec whose `args` are replaced by `config.resume_args(command)`
(when present); otherwise spawn the original spec.
- `Attach` for a **stopped** surface: instead of returning the empty
`{ snapshot: "", stopped: true }`, load the disk snapshot via the snapshot
store and return `{ snapshot: <ansi>, cols, rows, cursor_row, cursor_col,
stopped: true }`. Missing file → empty snapshot, still `stopped: true`.
- Surface close/remove (`Close`, `CloseWorkspace`, `remove_surface` paths):
send a remove to the snapshot writer so stale `<sid>.json` files do not
accumulate.
### 7. App — `app/src` and `app/src-tauri`
- `socketBridge.ts`: `restartSurface(id, resume = false)` gains the `resume`
arg; `AttachResult` gains optional `cursor_row`/`cursor_col`/`stopped`.
- `app/src-tauri/src/bridge.rs`: `restart_surface` forwards a `resume: bool`
arg into `Cmd::RestartSurface`.
- `LayoutEngine.tsx` stopped branch (`running[id] === false`): paint the disk
snapshot into a dimmed, read-only `xterm` behind the controls, and offer two
buttons — **Resume** → `restartSurface(id, true)` and **Restart fresh** →
`restartSurface(id, false)`. On success the daemon's `workspace_changed`
flips `running` to true, the overlay unmounts, and the live `TerminalView`
mounts.
## Data flow
```
running surface ──(every 5s, if dirty)──▶ ticker ──▶ writer task ──▶ <sid>.json
running surface ──(on exit)─────────────────────────▶ writer task ──▶ <sid>.json
daemon shutdown ──(final pass over live)────────────▶ writer task ──▶ <sid>.json
reboot ▶ daemon cold start ▶ Registry::restore(state.json) ▶ all surfaces stopped
client ▶ Attach(sid) [stopped] ▶ disk snapshot ▶ paint dimmed read-only screen
user clicks Resume ▶ RestartSurface{resume:true} ▶ spawn(command + resume_args, cwd)
▶ SurfaceRestarted + running=true ▶ live TerminalView mounts
```
## Error handling
- Missing/corrupt snapshot file → stopped `Attach` returns an empty snapshot;
the overlay shows an empty dimmed panel with the Resume/Restart controls.
- `RestartSurface` on an already-running surface → no-op ok (existing
`is_running` guard); unknown surface → `NOT_FOUND`.
- Resume command for an agent without a mapping → falls back to the original
spec args (plain restart), never fails the spawn.
- Writer task failure to write one file is logged and dropped; it must not stall
the daemon or other surfaces.
## Performance
- A visible-screen snapshot is ≈ rows × cols bytes of ANSI; at a 5s cadence with
the `dirty` debounce, idle panels write nothing. All disk writes happen in the
single writer task, off the PTY/actor hot path, so the keypress→echo (<16 ms)
and output-batching budgets are untouched.
## Testing
- **snapshot_store:** save→load round-trip; atomic write; missing file → `None`;
corrupt file → `None`; `remove` deletes the file.
- **config:** parse `[resume]` table; `resume_args` returns user override, then
built-in default, then `None`; missing section defaults cleanly.
- **surface actor:** `SurfaceMsg::Snapshot` returns the current grid contents;
`dirty` is true after output and false immediately after a snapshot.
- **server:** `RestartSurface{resume:true}` spawns with `command + resume_args`;
`{resume:false}` spawns with `command + args`; stopped `Attach` returns the
saved disk snapshot; `is_running` guard prevents a second actor.
- **registry:** starting a stopped surface re-populates the live map and the
view flips `running` to true.
## Out of scope
- Resuming the literal in-flight process across power loss (impossible).
- Scrollback history beyond the visible screen.
- Auto-resume on daemon start (manual trigger chosen).
- Per-surface resume command stored in the spec/wizard (config map chosen).