docs: spec for scheduled (recurring) task runs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-03 12:40:26 +07:00
parent 1810497c2a
commit e8f29064fb
1 changed files with 207 additions and 0 deletions
@@ -0,0 +1,207 @@
+# Scheduled (recurring) task runs — design
+
+**Date:** 2026-07-03
+**Status:** awaiting user review
+
+## Context
+
+Migrations are run manually today (the operator clicks "Run migration"). For
+ongoing source→destination sync, the operator wants a task to run itself on a
+recurring interval (e.g. every 1h or 6h) without babysitting it. Requirements
+from the user:
+
+- Per-task recurrence interval; the task runs automatically.
+- Never start a run while the previous one is still running; measure the interval
+  from the **completion** of the last run (not its start).
+- Show when the next run is due, in the **browser's local time**.
+- A separate modal showing a log of runs with their status and totals.
+- Deleting a task must clean up all this data along with the task.
+
+Decisions confirmed with the user:
+
+- **Run-log detail:** per-run totals only (reuse `runs`); no per-account snapshot.
+- **Enable gate:** a schedule can be enabled only when **all accounts test OK**.
+- **Breaker:** if a *scheduled* run finishes with errors, disable the schedule and
+  mark the task **broken** (red icon in the tasks list and the task detail).
+  Manual runs never trip the breaker.
+- **First run:** one interval after enabling (`schedule_anchor + interval`).
+- **Intervals:** presets 1 / 3 / 6 / 12 / 24 h, plus Off.
+
+### Current architecture (as-is)
+
+- `runs` table already has `started_at`, `finished_at`, `status`, and totals
+  (`migrations/0001_init.up.sql:35-44`). `CreateRun`/`FinishRun` in
+  `internal/store/runs.go`. No list-by-task query yet.
+- `Orchestrator.Run(ctx, taskID)` gates on `gateOK` (all accounts test ok) and
+  `TryMarkTaskRunning` (atomic single-run), creates a run, and launches `runAll`
+  asynchronously; `runAll` sets the task status and calls `FinishRun`
+  (`internal/orchestrator/orchestrator.go`).
+- `ON DELETE CASCADE` from `tasks` already removes `runs`, `accounts`, and
+  `migrated_messages` when a task is deleted.
+- `main.go` wires store → orchestrator → hub → server and calls
+  `ResetRunningOnStartup` to clear phantom "running" after a restart. No scheduler.
+- Task DTO / TS `Task` interface: `internal/httpapi/tasks.go`, `web/src/api.ts`.
+
+## Decision: in-process polling scheduler
+
+A single background goroutine (a `Scheduler`) started from `main.go`, ticking
+every **30 s**. Each tick queries the DB for schedulable tasks and triggers the
+due ones through the existing `Orchestrator.Run`. The DB is the source of truth,
+so the scheduler is stateless and restart-safe (paired with the existing
+`ResetRunningOnStartup`). Chosen over per-task timers (restart-fragile, in-memory
+state) and external cron (breaks the single-binary model, needs auth wiring).
+
+30 s poll precision is ample for hour-scale intervals.
+
+## Data model
+
+**Migration `0004_task_scheduling`:**
+
+```sql
+-- up
+ALTER TABLE tasks ADD COLUMN schedule_interval_seconds INT NOT NULL DEFAULT 0;
+ALTER TABLE tasks ADD COLUMN schedule_anchor TIMESTAMPTZ;
+ALTER TABLE tasks ADD COLUMN broken BOOLEAN NOT NULL DEFAULT false;
+ALTER TABLE runs  ADD COLUMN trigger TEXT NOT NULL DEFAULT 'manual';
+-- down
+ALTER TABLE runs  DROP COLUMN trigger;
+ALTER TABLE tasks DROP COLUMN broken;
+ALTER TABLE tasks DROP COLUMN schedule_anchor;
+ALTER TABLE tasks DROP COLUMN schedule_interval_seconds;
+```
+
+- `schedule_interval_seconds` — 0 = Off. Presets map to seconds (3600, 10800,
+  21600, 43200, 86400).
+- `schedule_anchor` — set to `now()` when a schedule is enabled; the baseline for
+  the first run when no finished run exists yet.
+- `broken` — the schedule breaker tripped; task needs operator attention.
+- `runs.trigger` — `'manual' | 'scheduled'`; drives the breaker and enriches the
+  run log.
+
+`store.Task` gains `ScheduleIntervalSeconds int64`, `ScheduleAnchor *time.Time`,
+`Broken bool`. `store.Run` gains `Trigger string`, plus `StartedAt time.Time` and
+`FinishedAt *time.Time` (needed for the log + due computation; not currently
+scanned).
+
+### Cascade / deletion
+
+No new cleanup code. `runs` cascade-delete on task delete already; the schedule
+columns live on `tasks` and vanish with the row. The in-memory scheduler polls
+the DB, so a deleted task simply stops appearing. (Confirmed against
+`migrations/0001_init.up.sql` FKs.)
+
+## Store additions
+
+- `SetTaskSchedule(ctx, id, intervalSeconds int64) error` — sets interval, sets
+  `schedule_anchor = now()` when enabling (interval>0) / leaves anchor when
+  disabling, and clears `broken` on any call (any explicit schedule change
+  un-breaks the task; enabling is additionally gated on all-accounts-OK in the
+  HTTP layer, disabling is always allowed).
+- `SetTaskBroken(ctx, id) error` — `broken = true, schedule_interval_seconds = 0`
+  in one UPDATE (the breaker).
+- `CreateRun(ctx, taskID, trigger string)` — extend the existing signature to
+  record the trigger.
+- `ListRunsByTask(ctx, taskID) ([]Run, error)` — newest first, for the modal;
+  scans `id, started_at, finished_at, status, totals, trigger`.
+- `LastFinishedRunAt(ctx, taskID) (*time.Time, error)` — most recent run with a
+  non-null `finished_at`, for due computation. (Or fold into `ListSchedulable`.)
+- `ListSchedulableTasks(ctx) ([]ScheduledTask, error)` — tasks with
+  `schedule_interval_seconds > 0 AND NOT broken AND status <> 'running'`, joined
+  with their last finished run's `finished_at`, so the scheduler decides in one
+  query per tick. Returns the fields `dueAt` needs.
+
+`GetTask`/`ListTasks` SELECT + Scan extend for the three new task columns.
+
+## Scheduler (`internal/scheduler/`)
+
+New package, one focused responsibility.
+
+- `type Scheduler struct { store *store.Store; orch *orchestrator.Orchestrator }`
+- `func (s *Scheduler) Start(ctx context.Context)` — `time.NewTicker(30s)` loop;
+  on each tick calls `s.tick(ctx)`; stops on `ctx.Done()`.
+- `func (s *Scheduler) tick(ctx)` — `ListSchedulableTasks`, then for each due task
+  calls `s.orch.Run(ctx, taskID)` with `trigger = "scheduled"`. `Run` errors
+  (`ErrAlreadyRunning`, `ErrNotTested`) are logged, not fatal.
+- **Pure, unit-tested decision function:**
+  `func dueAt(interval time.Duration, anchor time.Time, lastFinished *time.Time) time.Time`
+  → `lastFinished + interval` if a finished run exists, else `anchor + interval`.
+  A task is due when `now >= dueAt(...)`. Kept pure so the tick logic is testable
+  without a DB or clock injection at the call site.
+
+`main.go` starts it: `go scheduler.New(st, orch).Start(context.Background())`.
+
+### Trigger threading + breaker
+
+`Orchestrator.Run` gains a `trigger string` parameter (manual callers pass
+`"manual"`; the scheduler passes `"scheduled"`). `Run` forwards it to
+`CreateRun` and into `runAll`. In `runAll`, after the final status is computed:
+
+```
+if trigger == "scheduled" && totErr > 0 {
+    _ = o.store.SetTaskBroken(ctx, task.ID)
+    o.hub.Publish(Event{Type: "task_broken", TaskID: task.ID, ...})
+}
+```
+
+The HTTP `handleRun` passes `"manual"`. This keeps the breaker precise: only
+scheduled runs trip it.
+
+## HTTP API
+
+- **`PUT /api/tasks/{id}/schedule`** — body `{interval_seconds: int}`. Enabling
+  (interval>0) first checks `gateOK` (all accounts test OK) via
+  `ListAccountsByTask`; returns **409** with a clear message if not. Then
+  `SetTaskSchedule`. Disabling (0) always allowed.
+- **`GET /api/tasks/{id}/runs`** — `ListRunsByTask` → run-log rows.
+- **Task DTO** (`tasks.go`) gains `schedule_interval_seconds`, `broken`, and a
+  computed `next_run_at *string` (RFC3339, UTC; null when off / running / broken).
+  `next_run_at` is computed server-side from the same `dueAt` logic so the client
+  only formats it. `ListTasks` DTO also includes `broken` (for the red icon in the
+  list).
+
+## Frontend
+
+- **`api.ts`**: `Task` gains `schedule_interval_seconds?`, `broken?`,
+  `next_run_at?`. New `setTaskSchedule(taskId, intervalSeconds)` and
+  `listRuns(taskId)` (+ a `Run` interface).
+- **`TaskDetail.tsx`**: in the run-control panel, an interval `<select>`
+  (Off/1h/3h/6h/12h/24h) bound to `setTaskSchedule`; a "Next run: <local time>"
+  line formatting `next_run_at` with `toLocaleString()` (shows "running…" when a
+  run is active, nothing when off); a red **broken** badge when `broken`. A
+  **Runs** button opens a new modal.
+- **`RunLogModal.tsx`** (new): lists `listRuns(taskId)` rows — started/finished in
+  browser-local time, `trigger`, a `StatusBadge`, and copied/skipped/errors.
+  Reuses the existing `Modal` + `StatusBadge` + `.tbl` patterns.
+- **`Tasks.tsx`**: red icon/indicator on rows where `broken`.
+- WS: on `task_broken`/`run_done`, `TaskDetail` reloads (existing reload wiring);
+  `Tasks` list refreshes.
+
+## Error handling
+
+- Enable-while-not-tested → 409, surfaced in the existing `role="alert"` banner.
+- Scheduler `Run` errors are logged and skipped; a broken task is skipped every
+  tick until the operator re-enables (which clears `broken`).
+- Restart: `ResetRunningOnStartup` clears phantom running; the scheduler recomputes
+  due times from persisted `finished_at`/`anchor`.
+
+## Testing
+
+- **store**: `SetTaskSchedule` (sets interval+anchor, clears broken),
+  `SetTaskBroken` (interval→0, broken→true), `CreateRun` records trigger,
+  `ListRunsByTask` ordering + fields, `GetTask` returns new columns.
+- **scheduler (pure)**: `dueAt` / due decision — not-yet-due, due-from-finished,
+  first-run-from-anchor, running skipped, broken skipped.
+- **orchestrator**: scheduled run with errors trips `SetTaskBroken`; manual run
+  with errors does not.
+- **cascade**: deleting a task removes its runs (extend existing cascade test).
+- **E2E** (prod, per project rules): enable a schedule on a tested task → "Next
+  run" shows correct local time; (with a short test interval) the run auto-starts;
+  the Runs modal shows the entry with `scheduled` trigger; break an account → the
+  next scheduled run disables the schedule and marks the task broken (red icon in
+  list + detail).
+
+## Out of scope (YAGNI)
+
+- Arbitrary cron expressions; catch-up/backfill of missed runs (run once when due);
+  notifications; per-account breakdown in the run log (totals chosen); configurable
+  poll interval.