docs: spec for scheduled (recurring) task runs
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,207 @@
|
||||
# Scheduled (recurring) task runs — design
|
||||
|
||||
**Date:** 2026-07-03
|
||||
**Status:** awaiting user review
|
||||
|
||||
## Context
|
||||
|
||||
Migrations are run manually today (the operator clicks "Run migration"). For
|
||||
ongoing source→destination sync, the operator wants a task to run itself on a
|
||||
recurring interval (e.g. every 1h or 6h) without babysitting it. Requirements
|
||||
from the user:
|
||||
|
||||
- Per-task recurrence interval; the task runs automatically.
|
||||
- Never start a run while the previous one is still running; measure the interval
|
||||
from the **completion** of the last run (not its start).
|
||||
- Show when the next run is due, in the **browser's local time**.
|
||||
- A separate modal showing a log of runs with their status and totals.
|
||||
- Deleting a task must clean up all this data along with the task.
|
||||
|
||||
Decisions confirmed with the user:
|
||||
|
||||
- **Run-log detail:** per-run totals only (reuse `runs`); no per-account snapshot.
|
||||
- **Enable gate:** a schedule can be enabled only when **all accounts test OK**.
|
||||
- **Breaker:** if a *scheduled* run finishes with errors, disable the schedule and
|
||||
mark the task **broken** (red icon in the tasks list and the task detail).
|
||||
Manual runs never trip the breaker.
|
||||
- **First run:** one interval after enabling (`schedule_anchor + interval`).
|
||||
- **Intervals:** presets 1 / 3 / 6 / 12 / 24 h, plus Off.
|
||||
|
||||
### Current architecture (as-is)
|
||||
|
||||
- `runs` table already has `started_at`, `finished_at`, `status`, and totals
|
||||
(`migrations/0001_init.up.sql:35-44`). `CreateRun`/`FinishRun` in
|
||||
`internal/store/runs.go`. No list-by-task query yet.
|
||||
- `Orchestrator.Run(ctx, taskID)` gates on `gateOK` (all accounts test ok) and
|
||||
`TryMarkTaskRunning` (atomic single-run), creates a run, and launches `runAll`
|
||||
asynchronously; `runAll` sets the task status and calls `FinishRun`
|
||||
(`internal/orchestrator/orchestrator.go`).
|
||||
- `ON DELETE CASCADE` from `tasks` already removes `runs`, `accounts`, and
|
||||
`migrated_messages` when a task is deleted.
|
||||
- `main.go` wires store → orchestrator → hub → server and calls
|
||||
`ResetRunningOnStartup` to clear phantom "running" after a restart. No scheduler.
|
||||
- Task DTO / TS `Task` interface: `internal/httpapi/tasks.go`, `web/src/api.ts`.
|
||||
|
||||
## Decision: in-process polling scheduler
|
||||
|
||||
A single background goroutine (a `Scheduler`) started from `main.go`, ticking
|
||||
every **30 s**. Each tick queries the DB for schedulable tasks and triggers the
|
||||
due ones through the existing `Orchestrator.Run`. The DB is the source of truth,
|
||||
so the scheduler is stateless and restart-safe (paired with the existing
|
||||
`ResetRunningOnStartup`). Chosen over per-task timers (restart-fragile, in-memory
|
||||
state) and external cron (breaks the single-binary model, needs auth wiring).
|
||||
|
||||
30 s poll precision is ample for hour-scale intervals.
|
||||
|
||||
## Data model
|
||||
|
||||
**Migration `0004_task_scheduling`:**
|
||||
|
||||
```sql
|
||||
-- up
|
||||
ALTER TABLE tasks ADD COLUMN schedule_interval_seconds INT NOT NULL DEFAULT 0;
|
||||
ALTER TABLE tasks ADD COLUMN schedule_anchor TIMESTAMPTZ;
|
||||
ALTER TABLE tasks ADD COLUMN broken BOOLEAN NOT NULL DEFAULT false;
|
||||
ALTER TABLE runs ADD COLUMN trigger TEXT NOT NULL DEFAULT 'manual';
|
||||
-- down
|
||||
ALTER TABLE runs DROP COLUMN trigger;
|
||||
ALTER TABLE tasks DROP COLUMN broken;
|
||||
ALTER TABLE tasks DROP COLUMN schedule_anchor;
|
||||
ALTER TABLE tasks DROP COLUMN schedule_interval_seconds;
|
||||
```
|
||||
|
||||
- `schedule_interval_seconds` — 0 = Off. Presets map to seconds (3600, 10800,
|
||||
21600, 43200, 86400).
|
||||
- `schedule_anchor` — set to `now()` when a schedule is enabled; the baseline for
|
||||
the first run when no finished run exists yet.
|
||||
- `broken` — the schedule breaker tripped; task needs operator attention.
|
||||
- `runs.trigger` — `'manual' | 'scheduled'`; drives the breaker and enriches the
|
||||
run log.
|
||||
|
||||
`store.Task` gains `ScheduleIntervalSeconds int64`, `ScheduleAnchor *time.Time`,
|
||||
`Broken bool`. `store.Run` gains `Trigger string`, plus `StartedAt time.Time` and
|
||||
`FinishedAt *time.Time` (needed for the log + due computation; not currently
|
||||
scanned).
|
||||
|
||||
### Cascade / deletion
|
||||
|
||||
No new cleanup code. `runs` cascade-delete on task delete already; the schedule
|
||||
columns live on `tasks` and vanish with the row. The in-memory scheduler polls
|
||||
the DB, so a deleted task simply stops appearing. (Confirmed against
|
||||
`migrations/0001_init.up.sql` FKs.)
|
||||
|
||||
## Store additions
|
||||
|
||||
- `SetTaskSchedule(ctx, id, intervalSeconds int64) error` — sets interval, sets
|
||||
`schedule_anchor = now()` when enabling (interval>0) / leaves anchor when
|
||||
disabling, and clears `broken` on any call (any explicit schedule change
|
||||
un-breaks the task; enabling is additionally gated on all-accounts-OK in the
|
||||
HTTP layer, disabling is always allowed).
|
||||
- `SetTaskBroken(ctx, id) error` — `broken = true, schedule_interval_seconds = 0`
|
||||
in one UPDATE (the breaker).
|
||||
- `CreateRun(ctx, taskID, trigger string)` — extend the existing signature to
|
||||
record the trigger.
|
||||
- `ListRunsByTask(ctx, taskID) ([]Run, error)` — newest first, for the modal;
|
||||
scans `id, started_at, finished_at, status, totals, trigger`.
|
||||
- `LastFinishedRunAt(ctx, taskID) (*time.Time, error)` — most recent run with a
|
||||
non-null `finished_at`, for due computation. (Or fold into `ListSchedulable`.)
|
||||
- `ListSchedulableTasks(ctx) ([]ScheduledTask, error)` — tasks with
|
||||
`schedule_interval_seconds > 0 AND NOT broken AND status <> 'running'`, joined
|
||||
with their last finished run's `finished_at`, so the scheduler decides in one
|
||||
query per tick. Returns the fields `dueAt` needs.
|
||||
|
||||
`GetTask`/`ListTasks` SELECT + Scan extend for the three new task columns.
|
||||
|
||||
## Scheduler (`internal/scheduler/`)
|
||||
|
||||
New package, one focused responsibility.
|
||||
|
||||
- `type Scheduler struct { store *store.Store; orch *orchestrator.Orchestrator }`
|
||||
- `func (s *Scheduler) Start(ctx context.Context)` — `time.NewTicker(30s)` loop;
|
||||
on each tick calls `s.tick(ctx)`; stops on `ctx.Done()`.
|
||||
- `func (s *Scheduler) tick(ctx)` — `ListSchedulableTasks`, then for each due task
|
||||
calls `s.orch.Run(ctx, taskID)` with `trigger = "scheduled"`. `Run` errors
|
||||
(`ErrAlreadyRunning`, `ErrNotTested`) are logged, not fatal.
|
||||
- **Pure, unit-tested decision function:**
|
||||
`func dueAt(interval time.Duration, anchor time.Time, lastFinished *time.Time) time.Time`
|
||||
→ `lastFinished + interval` if a finished run exists, else `anchor + interval`.
|
||||
A task is due when `now >= dueAt(...)`. Kept pure so the tick logic is testable
|
||||
without a DB or clock injection at the call site.
|
||||
|
||||
`main.go` starts it: `go scheduler.New(st, orch).Start(context.Background())`.
|
||||
|
||||
### Trigger threading + breaker
|
||||
|
||||
`Orchestrator.Run` gains a `trigger string` parameter (manual callers pass
|
||||
`"manual"`; the scheduler passes `"scheduled"`). `Run` forwards it to
|
||||
`CreateRun` and into `runAll`. In `runAll`, after the final status is computed:
|
||||
|
||||
```
|
||||
if trigger == "scheduled" && totErr > 0 {
|
||||
_ = o.store.SetTaskBroken(ctx, task.ID)
|
||||
o.hub.Publish(Event{Type: "task_broken", TaskID: task.ID, ...})
|
||||
}
|
||||
```
|
||||
|
||||
The HTTP `handleRun` passes `"manual"`. This keeps the breaker precise: only
|
||||
scheduled runs trip it.
|
||||
|
||||
## HTTP API
|
||||
|
||||
- **`PUT /api/tasks/{id}/schedule`** — body `{interval_seconds: int}`. Enabling
|
||||
(interval>0) first checks `gateOK` (all accounts test OK) via
|
||||
`ListAccountsByTask`; returns **409** with a clear message if not. Then
|
||||
`SetTaskSchedule`. Disabling (0) always allowed.
|
||||
- **`GET /api/tasks/{id}/runs`** — `ListRunsByTask` → run-log rows.
|
||||
- **Task DTO** (`tasks.go`) gains `schedule_interval_seconds`, `broken`, and a
|
||||
computed `next_run_at *string` (RFC3339, UTC; null when off / running / broken).
|
||||
`next_run_at` is computed server-side from the same `dueAt` logic so the client
|
||||
only formats it. `ListTasks` DTO also includes `broken` (for the red icon in the
|
||||
list).
|
||||
|
||||
## Frontend
|
||||
|
||||
- **`api.ts`**: `Task` gains `schedule_interval_seconds?`, `broken?`,
|
||||
`next_run_at?`. New `setTaskSchedule(taskId, intervalSeconds)` and
|
||||
`listRuns(taskId)` (+ a `Run` interface).
|
||||
- **`TaskDetail.tsx`**: in the run-control panel, an interval `<select>`
|
||||
(Off/1h/3h/6h/12h/24h) bound to `setTaskSchedule`; a "Next run: <local time>"
|
||||
line formatting `next_run_at` with `toLocaleString()` (shows "running…" when a
|
||||
run is active, nothing when off); a red **broken** badge when `broken`. A
|
||||
**Runs** button opens a new modal.
|
||||
- **`RunLogModal.tsx`** (new): lists `listRuns(taskId)` rows — started/finished in
|
||||
browser-local time, `trigger`, a `StatusBadge`, and copied/skipped/errors.
|
||||
Reuses the existing `Modal` + `StatusBadge` + `.tbl` patterns.
|
||||
- **`Tasks.tsx`**: red icon/indicator on rows where `broken`.
|
||||
- WS: on `task_broken`/`run_done`, `TaskDetail` reloads (existing reload wiring);
|
||||
`Tasks` list refreshes.
|
||||
|
||||
## Error handling
|
||||
|
||||
- Enable-while-not-tested → 409, surfaced in the existing `role="alert"` banner.
|
||||
- Scheduler `Run` errors are logged and skipped; a broken task is skipped every
|
||||
tick until the operator re-enables (which clears `broken`).
|
||||
- Restart: `ResetRunningOnStartup` clears phantom running; the scheduler recomputes
|
||||
due times from persisted `finished_at`/`anchor`.
|
||||
|
||||
## Testing
|
||||
|
||||
- **store**: `SetTaskSchedule` (sets interval+anchor, clears broken),
|
||||
`SetTaskBroken` (interval→0, broken→true), `CreateRun` records trigger,
|
||||
`ListRunsByTask` ordering + fields, `GetTask` returns new columns.
|
||||
- **scheduler (pure)**: `dueAt` / due decision — not-yet-due, due-from-finished,
|
||||
first-run-from-anchor, running skipped, broken skipped.
|
||||
- **orchestrator**: scheduled run with errors trips `SetTaskBroken`; manual run
|
||||
with errors does not.
|
||||
- **cascade**: deleting a task removes its runs (extend existing cascade test).
|
||||
- **E2E** (prod, per project rules): enable a schedule on a tested task → "Next
|
||||
run" shows correct local time; (with a short test interval) the run auto-starts;
|
||||
the Runs modal shows the entry with `scheduled` trigger; break an account → the
|
||||
next scheduled run disables the schedule and marks the task broken (red icon in
|
||||
list + detail).
|
||||
|
||||
## Out of scope (YAGNI)
|
||||
|
||||
- Arbitrary cron expressions; catch-up/backfill of missed runs (run once when due);
|
||||
notifications; per-account breakdown in the run log (totals chosen); configurable
|
||||
poll interval.
|
||||
Reference in New Issue
Block a user