# Phase 21 — Pilot Operational Observability

**Branch:** `dine-phase-11`
**Status:** complete in three sub-commits.
**Goal:** make a real restaurant pilot supportable — visibility into runtime failures, slow paths, queue health, and operational events — without adding heavy infrastructure or external SaaS.

---

## Sub-phases

| Sub-phase | Commit | What it ships |
|---|---|---|
| 21.1 | `01948d9` | Queue backlog wired to dashboard |
| 21.2 | `e4fbcf5` | Runtime events log + slow-request middleware + diagnostics endpoint + FE panel |
| 21.3 | `551c5d7` | Structured operational events from cashier + recovery flows |

---

## What's tracked

### Backend signals (new in Phase 21)

| Signal | Source | Visible in |
|---|---|---|
| `system_health.queue_backlog` | `jobs` + `failed_jobs` row counts | Phase-9 dashboard, Phase-10 alert ladder |
| `dine_runtime_events` (class=`api_failure`) | reportable callback in `bootstrap/app.php` for `Throwable` 5xx + `ApiException >=500` | Pilot Diagnostics panel |
| `dine_runtime_events` (class=`slow_request`) | `DineSlowRequestRecorder` middleware on api group, threshold 2 000 ms | Pilot Diagnostics "Slowest requests" + recent timeline |
| `dine_runtime_events` (class=`event`, name=`order.paid`) | `DineOrderService::transitionStatus(paid)` after journal post | Pilot Diagnostics timeline |
| `dine_runtime_events` (class=`event`, name=`order.refunded`) | `DineOrderService::transitionStatus(cancelled)` when oldStatus=paid | Pilot Diagnostics timeline |
| `dine_runtime_events` (class=`event`, name=`order.cancelled`) | `DineOrderService::transitionStatus(cancelled)` when oldStatus≠paid | Pilot Diagnostics timeline |
| `dine_runtime_events` (class=`event`, name=`sync.recovered`) | `DineSyncFailuresController::store` when `retried_ok=true` | Pilot Diagnostics timeline |
| `dine_runtime_events` (class=`event`, name=`printer.failed`) | `DinePrinterJobsController::store` on `status=failed` | Pilot Diagnostics timeline |
| `dine_runtime_events` (class=`event`, name=`printer.recovered`) | `DinePrinterJobsController::store` on `status=succeeded` after a recent failure | Pilot Diagnostics timeline |

### Endpoints

| Method | URL | Purpose |
|---|---|---|
| `GET` | `/api/v1/dine/runtime-diagnostics?branch_id={id\|all}` | Today's KPI strip (`events_today`, `api_failures_today`, `slow_requests_today`, `errors_today`, `queue_pending`, `queue_failed`) + slowest-5 requests + recent-40 timeline. |

All previous endpoints unchanged. Total dine routes: **76** (was 75 before Phase 21).

---

## What's NOT tracked (deliberate)

- **Per-user activity**. The recorder never stores `user_id` for HTTP failures or slow requests. Operators investigating an issue can correlate via the existing audit log — observability is not an audit replacement.
- **Request bodies / response bodies**. The `context` JSON column has a 256-byte hard cap; producers ship summaries, not payloads.
- **PII**. `ip_address` for devices is captured server-side from the request; nothing else PII-shaped enters runtime events.
- **Every request**. Slow-request middleware writes a row only when duration ≥ 2 000 ms. Fast requests produce zero overhead beyond a `microtime()` call.
- **Every error**. The reportable callback excludes `ValidationException`, 401, 403, 404, 429, and `ApiException` with status < 500. Routine rejections would flood the table.
- **Queue worker telemetry**. We count the `jobs` + `failed_jobs` table rows but don't track per-worker stats. Adding that requires the Laravel Horizon stack — out of scope.
- **External APM** (Datadog / New Relic / Sentry). The phase ships everything in-database so a tenant's data never leaves their infrastructure.

---

## Frontend visibility

`/app/dine/ops` now renders **eight stacked panels**:

1. Operational alerts strip (Phase 10) — including `queue-backlog` alert from Phase 21.1
2. Live operations (Phase 8/9)
3. Kitchen health (Phase 8/9)
4. Cashier risk (Phase 8/9)
5. Inventory signals (Phase 8/9 + Phase 18 variant-aware low-stock)
6. Sales reality (Phase 8/9)
7. System health (Phase 8/9 + Phase 11 device counts + Phase 12 sync_failures + Phase 13 failed_printer_jobs + **Phase 21.1 queue_backlog**)
8. Devices (Phase 11/17)
9. Sync failures (Phase 12)
10. Printer jobs (Phase 13)
11. **Pilot diagnostics (Phase 21.2)** — new

The Pilot diagnostics panel is the operator's "is anything operationally wrong right now?" view. KPI strip + slowest-5 requests + recent-40 events timeline. Same 30-second polling cadence as everything else, paused in background.

---

## Operational thresholds

| Threshold | Value | Source of truth |
|---|---|---|
| Slow-request | 2 000 ms (notice), 5 000 ms (warning) | `DineRuntimeEvent::SLOW_REQUEST_THRESHOLD_MS` |
| Queue-backlog alert | `>0` warning, `≥50` danger | `DineOperationalAlertsService` (Phase 10) |
| Printer-recovered window | last 30 minutes | `DinePrinterJobsController::store` |
| Diagnostics polling | 30 s, paused in background | `useDineRuntimeDiagnostics` |
| Diagnostics row cap | 200 today / 40 timeline | `DineRuntimeDiagnosticsController::index` |

---

## Retention assumptions

`dine_runtime_events` has no automated pruning. A tenant with 100 events/day on average builds ~36 000 rows/year, comfortably within indexed-read territory. Recommended follow-up if volume grows past that:

```bash
# Daily prune (cron) — keep one month of runtime events
php artisan tinker --execute='\App\Models\DineRuntimeEvent::query()->where("occurred_at", "<", now()->subMonth())->delete();'
```

The same pattern works for `dine_sync_failures` and `dine_printer_jobs` if they grow large. None are mission-critical to the cashier flow — pruning is safe.

---

## Performance considerations

- **Slow-request middleware**: one `microtime(true)` at request start + one at response. <1 ms total overhead. Writes happen only when threshold crossed.
- **Reportable callback**: only fires on `Throwable` reaching the framework's report stack. Routine 4xx exceptions are filtered before the recorder is invoked — no row written.
- **Diagnostics endpoint**: 4 small COUNT queries + 2 indexed SELECTs (slowest-5, recent-40). No joins. Backed by three composite indexes on `dine_runtime_events`.
- **Queue backlog**: 2 cheap COUNTs (`jobs`, `failed_jobs`). Both tables are indexed by Laravel.
- **Order/sync/printer event emitters**: one Eloquent insert per qualifying event. Wrapped in try/catch so a recorder failure never bubbles into the cashier flow.

---

## Pilot readiness assessment

**Pilot-ready.** The Dine module now ships:

- Six operational dashboard sections + four read-only operational panels (devices, sync failures, printer jobs, pilot diagnostics).
- Phase-10 alerts engine fed by **all** the system-health signals it consumes — `offline_devices`, `sync_failures_today`, `failed_printer_jobs_today`, `queue_backlog`. Only `failed_payments_today` remains `null` (deferred to the Pay module's per-attempt log).
- End-to-end cash-sale → balanced journal → branch-tagged inventory consume verification (`DineCashSaleAccountingPostingTest`, 53 assertions).
- Symmetric inventory reversal for void/refund (Phase 1 hardening + Phase 16 invariant test).
- Idempotent journal posting (Phase 3) — re-posting the same order produces one journal entry.
- Tenant + branch isolation enforced everywhere (`BelongsToTenant` + `applyBranchFilter`).
- Per-screen device tracking (Phase 14) so a tablet running both POS and KDS shows up as two distinct rows.
- Pilot diagnostics surface so platform staff supporting a pilot can answer "what just failed?" in one glance.

**What's still deferred:**
- Per-payment-attempt log (`failed_payments_today`) — needs the Pay module to be completed first.
- Async accounting posting — reviewed in Phase 19, deemed out of scope until the cashier flow gains an external dependency that justifies it.
- Pre-existing `CAST(... AS INTEGER) DESC` SQLite-isms in `SignupController` — flagged in Phase Production-fix; recommended one-line patch (`INTEGER` → `SIGNED`).

---

## Rollback notes

Every Phase-21 sub-commit reverts cleanly:

- `git revert 551c5d7` — undoes the four event emitters. No tables or endpoints changed.
- `git revert e4fbcf5` — undoes the runtime events log, middleware, recorder service, diagnostics endpoint, and FE panel. Migration `down()` drops `dine_runtime_events`. Frontend panel becomes an inert import.
- `git revert 01948d9` — `system_health.queue_backlog` reverts to `null`. Phase-10 alert wiring is null-safe (skips on null), so the queue-backlog alert simply stops firing.

No journal effects. No inventory effects. No cashier-flow code paths altered. All three sub-commits revert independently.
