e8f29064fb
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
208 lines
9.9 KiB
Markdown
208 lines
9.9 KiB
Markdown
# Scheduled (recurring) task runs — design
|
|
|
|
**Date:** 2026-07-03
|
|
**Status:** awaiting user review
|
|
|
|
## Context
|
|
|
|
Migrations are run manually today (the operator clicks "Run migration"). For
|
|
ongoing source→destination sync, the operator wants a task to run itself on a
|
|
recurring interval (e.g. every 1h or 6h) without babysitting it. Requirements
|
|
from the user:
|
|
|
|
- Per-task recurrence interval; the task runs automatically.
|
|
- Never start a run while the previous one is still running; measure the interval
|
|
from the **completion** of the last run (not its start).
|
|
- Show when the next run is due, in the **browser's local time**.
|
|
- A separate modal showing a log of runs with their status and totals.
|
|
- Deleting a task must clean up all this data along with the task.
|
|
|
|
Decisions confirmed with the user:
|
|
|
|
- **Run-log detail:** per-run totals only (reuse `runs`); no per-account snapshot.
|
|
- **Enable gate:** a schedule can be enabled only when **all accounts test OK**.
|
|
- **Breaker:** if a *scheduled* run finishes with errors, disable the schedule and
|
|
mark the task **broken** (red icon in the tasks list and the task detail).
|
|
Manual runs never trip the breaker.
|
|
- **First run:** one interval after enabling (`schedule_anchor + interval`).
|
|
- **Intervals:** presets 1 / 3 / 6 / 12 / 24 h, plus Off.
|
|
|
|
### Current architecture (as-is)
|
|
|
|
- `runs` table already has `started_at`, `finished_at`, `status`, and totals
|
|
(`migrations/0001_init.up.sql:35-44`). `CreateRun`/`FinishRun` in
|
|
`internal/store/runs.go`. No list-by-task query yet.
|
|
- `Orchestrator.Run(ctx, taskID)` gates on `gateOK` (all accounts test ok) and
|
|
`TryMarkTaskRunning` (atomic single-run), creates a run, and launches `runAll`
|
|
asynchronously; `runAll` sets the task status and calls `FinishRun`
|
|
(`internal/orchestrator/orchestrator.go`).
|
|
- `ON DELETE CASCADE` from `tasks` already removes `runs`, `accounts`, and
|
|
`migrated_messages` when a task is deleted.
|
|
- `main.go` wires store → orchestrator → hub → server and calls
|
|
`ResetRunningOnStartup` to clear phantom "running" after a restart. No scheduler.
|
|
- Task DTO / TS `Task` interface: `internal/httpapi/tasks.go`, `web/src/api.ts`.
|
|
|
|
## Decision: in-process polling scheduler
|
|
|
|
A single background goroutine (a `Scheduler`) started from `main.go`, ticking
|
|
every **30 s**. Each tick queries the DB for schedulable tasks and triggers the
|
|
due ones through the existing `Orchestrator.Run`. The DB is the source of truth,
|
|
so the scheduler is stateless and restart-safe (paired with the existing
|
|
`ResetRunningOnStartup`). Chosen over per-task timers (restart-fragile, in-memory
|
|
state) and external cron (breaks the single-binary model, needs auth wiring).
|
|
|
|
30 s poll precision is ample for hour-scale intervals.
|
|
|
|
## Data model
|
|
|
|
**Migration `0004_task_scheduling`:**
|
|
|
|
```sql
|
|
-- up
|
|
ALTER TABLE tasks ADD COLUMN schedule_interval_seconds INT NOT NULL DEFAULT 0;
|
|
ALTER TABLE tasks ADD COLUMN schedule_anchor TIMESTAMPTZ;
|
|
ALTER TABLE tasks ADD COLUMN broken BOOLEAN NOT NULL DEFAULT false;
|
|
ALTER TABLE runs ADD COLUMN trigger TEXT NOT NULL DEFAULT 'manual';
|
|
-- down
|
|
ALTER TABLE runs DROP COLUMN trigger;
|
|
ALTER TABLE tasks DROP COLUMN broken;
|
|
ALTER TABLE tasks DROP COLUMN schedule_anchor;
|
|
ALTER TABLE tasks DROP COLUMN schedule_interval_seconds;
|
|
```
|
|
|
|
- `schedule_interval_seconds` — 0 = Off. Presets map to seconds (3600, 10800,
|
|
21600, 43200, 86400).
|
|
- `schedule_anchor` — set to `now()` when a schedule is enabled; the baseline for
|
|
the first run when no finished run exists yet.
|
|
- `broken` — the schedule breaker tripped; task needs operator attention.
|
|
- `runs.trigger` — `'manual' | 'scheduled'`; drives the breaker and enriches the
|
|
run log.
|
|
|
|
`store.Task` gains `ScheduleIntervalSeconds int64`, `ScheduleAnchor *time.Time`,
|
|
`Broken bool`. `store.Run` gains `Trigger string`, plus `StartedAt time.Time` and
|
|
`FinishedAt *time.Time` (needed for the log + due computation; not currently
|
|
scanned).
|
|
|
|
### Cascade / deletion
|
|
|
|
No new cleanup code. `runs` cascade-delete on task delete already; the schedule
|
|
columns live on `tasks` and vanish with the row. The in-memory scheduler polls
|
|
the DB, so a deleted task simply stops appearing. (Confirmed against
|
|
`migrations/0001_init.up.sql` FKs.)
|
|
|
|
## Store additions
|
|
|
|
- `SetTaskSchedule(ctx, id, intervalSeconds int64) error` — sets interval, sets
|
|
`schedule_anchor = now()` when enabling (interval>0) / leaves anchor when
|
|
disabling, and clears `broken` on any call (any explicit schedule change
|
|
un-breaks the task; enabling is additionally gated on all-accounts-OK in the
|
|
HTTP layer, disabling is always allowed).
|
|
- `SetTaskBroken(ctx, id) error` — `broken = true, schedule_interval_seconds = 0`
|
|
in one UPDATE (the breaker).
|
|
- `CreateRun(ctx, taskID, trigger string)` — extend the existing signature to
|
|
record the trigger.
|
|
- `ListRunsByTask(ctx, taskID) ([]Run, error)` — newest first, for the modal;
|
|
scans `id, started_at, finished_at, status, totals, trigger`.
|
|
- `LastFinishedRunAt(ctx, taskID) (*time.Time, error)` — most recent run with a
|
|
non-null `finished_at`, for due computation. (Or fold into `ListSchedulable`.)
|
|
- `ListSchedulableTasks(ctx) ([]ScheduledTask, error)` — tasks with
|
|
`schedule_interval_seconds > 0 AND NOT broken AND status <> 'running'`, joined
|
|
with their last finished run's `finished_at`, so the scheduler decides in one
|
|
query per tick. Returns the fields `dueAt` needs.
|
|
|
|
`GetTask`/`ListTasks` SELECT + Scan extend for the three new task columns.
|
|
|
|
## Scheduler (`internal/scheduler/`)
|
|
|
|
New package, one focused responsibility.
|
|
|
|
- `type Scheduler struct { store *store.Store; orch *orchestrator.Orchestrator }`
|
|
- `func (s *Scheduler) Start(ctx context.Context)` — `time.NewTicker(30s)` loop;
|
|
on each tick calls `s.tick(ctx)`; stops on `ctx.Done()`.
|
|
- `func (s *Scheduler) tick(ctx)` — `ListSchedulableTasks`, then for each due task
|
|
calls `s.orch.Run(ctx, taskID)` with `trigger = "scheduled"`. `Run` errors
|
|
(`ErrAlreadyRunning`, `ErrNotTested`) are logged, not fatal.
|
|
- **Pure, unit-tested decision function:**
|
|
`func dueAt(interval time.Duration, anchor time.Time, lastFinished *time.Time) time.Time`
|
|
→ `lastFinished + interval` if a finished run exists, else `anchor + interval`.
|
|
A task is due when `now >= dueAt(...)`. Kept pure so the tick logic is testable
|
|
without a DB or clock injection at the call site.
|
|
|
|
`main.go` starts it: `go scheduler.New(st, orch).Start(context.Background())`.
|
|
|
|
### Trigger threading + breaker
|
|
|
|
`Orchestrator.Run` gains a `trigger string` parameter (manual callers pass
|
|
`"manual"`; the scheduler passes `"scheduled"`). `Run` forwards it to
|
|
`CreateRun` and into `runAll`. In `runAll`, after the final status is computed:
|
|
|
|
```
|
|
if trigger == "scheduled" && totErr > 0 {
|
|
_ = o.store.SetTaskBroken(ctx, task.ID)
|
|
o.hub.Publish(Event{Type: "task_broken", TaskID: task.ID, ...})
|
|
}
|
|
```
|
|
|
|
The HTTP `handleRun` passes `"manual"`. This keeps the breaker precise: only
|
|
scheduled runs trip it.
|
|
|
|
## HTTP API
|
|
|
|
- **`PUT /api/tasks/{id}/schedule`** — body `{interval_seconds: int}`. Enabling
|
|
(interval>0) first checks `gateOK` (all accounts test OK) via
|
|
`ListAccountsByTask`; returns **409** with a clear message if not. Then
|
|
`SetTaskSchedule`. Disabling (0) always allowed.
|
|
- **`GET /api/tasks/{id}/runs`** — `ListRunsByTask` → run-log rows.
|
|
- **Task DTO** (`tasks.go`) gains `schedule_interval_seconds`, `broken`, and a
|
|
computed `next_run_at *string` (RFC3339, UTC; null when off / running / broken).
|
|
`next_run_at` is computed server-side from the same `dueAt` logic so the client
|
|
only formats it. `ListTasks` DTO also includes `broken` (for the red icon in the
|
|
list).
|
|
|
|
## Frontend
|
|
|
|
- **`api.ts`**: `Task` gains `schedule_interval_seconds?`, `broken?`,
|
|
`next_run_at?`. New `setTaskSchedule(taskId, intervalSeconds)` and
|
|
`listRuns(taskId)` (+ a `Run` interface).
|
|
- **`TaskDetail.tsx`**: in the run-control panel, an interval `<select>`
|
|
(Off/1h/3h/6h/12h/24h) bound to `setTaskSchedule`; a "Next run: <local time>"
|
|
line formatting `next_run_at` with `toLocaleString()` (shows "running…" when a
|
|
run is active, nothing when off); a red **broken** badge when `broken`. A
|
|
**Runs** button opens a new modal.
|
|
- **`RunLogModal.tsx`** (new): lists `listRuns(taskId)` rows — started/finished in
|
|
browser-local time, `trigger`, a `StatusBadge`, and copied/skipped/errors.
|
|
Reuses the existing `Modal` + `StatusBadge` + `.tbl` patterns.
|
|
- **`Tasks.tsx`**: red icon/indicator on rows where `broken`.
|
|
- WS: on `task_broken`/`run_done`, `TaskDetail` reloads (existing reload wiring);
|
|
`Tasks` list refreshes.
|
|
|
|
## Error handling
|
|
|
|
- Enable-while-not-tested → 409, surfaced in the existing `role="alert"` banner.
|
|
- Scheduler `Run` errors are logged and skipped; a broken task is skipped every
|
|
tick until the operator re-enables (which clears `broken`).
|
|
- Restart: `ResetRunningOnStartup` clears phantom running; the scheduler recomputes
|
|
due times from persisted `finished_at`/`anchor`.
|
|
|
|
## Testing
|
|
|
|
- **store**: `SetTaskSchedule` (sets interval+anchor, clears broken),
|
|
`SetTaskBroken` (interval→0, broken→true), `CreateRun` records trigger,
|
|
`ListRunsByTask` ordering + fields, `GetTask` returns new columns.
|
|
- **scheduler (pure)**: `dueAt` / due decision — not-yet-due, due-from-finished,
|
|
first-run-from-anchor, running skipped, broken skipped.
|
|
- **orchestrator**: scheduled run with errors trips `SetTaskBroken`; manual run
|
|
with errors does not.
|
|
- **cascade**: deleting a task removes its runs (extend existing cascade test).
|
|
- **E2E** (prod, per project rules): enable a schedule on a tested task → "Next
|
|
run" shows correct local time; (with a short test interval) the run auto-starts;
|
|
the Runs modal shows the entry with `scheduled` trigger; break an account → the
|
|
next scheduled run disables the schedule and marks the task broken (red icon in
|
|
list + detail).
|
|
|
|
## Out of scope (YAGNI)
|
|
|
|
- Arbitrary cron expressions; catch-up/backfill of missed runs (run once when due);
|
|
notifications; per-account breakdown in the run log (totals chosen); configurable
|
|
poll interval.
|