Improvement Tasks

Playwright…≥# Playwright Grid – Improvement Tasks Checklist

Generated: 2025-09-11 11:30 local time

The following is an ordered, actionable checklist covering architectural and code-level improvements across Hub, Worker, Dashboard, HubClient, tests, and containerization. Check items as they are completed.

[X] Establish a shared Domain model package for LabelKey parsing/validation to ensure consistent rules across Hub, Worker, Dashboard, and tests.
[X] Introduce a central Label Matching strategy service with unit tests (exact → trailing fallback → prefix expansion → optional wildcards) and pluggable settings.
[X] Add input validation and normalization for label keys (trim, case policy, segment count min/max, forbidden characters) with clear 4xx errors.
[X] Define API versioning (e.g., /api/v1) for Hub endpoints and reserve room for breaking changes.
[X] Introduce ProblemDetails-based error responses in Hub for consistent 4xx/5xx payloads.
[X] Add OpenAPI/Swagger to Hub with minimal surface (security header documented) and examples for borrow/return.
[X] Implement request correlation (Correlation-Id header or generated) propagated as runId/browserId across Hub, Worker, and Dashboard logs.
[X] Add distributed tracing via OpenTelemetry (traces, metrics, logs) with exporters configurable (OTLP/Prometheus).
[X] Expand Prometheus metrics: borrow latency histogram, borrow outcomes (success/timeout/denied), pool utilization per label, queue length, node heartbeats.
[X] Introduce a capacity queue in Hub for pending borrows with timeout and fairness (per-label and per-run caps) to reduce thundering herd.
[X] Implement node heartbeat/liveness tracker with configurable timeout; evict stale nodes and reclaim/expire orphaned sessions.
[X] Add borrow TTL and auto-return on timeout; persist session state to Redis to survive Hub restarts.
[X] Harden Redis usage: resilience (timeouts, retries with jitter, circuit breaker), connection settings, and health checks integrated into readiness.
[ ] Support secret rotation: accept multiple HUB_RUNNER_SECRET/HUB_NODE_SECRET values (comma-separated) and log deprecation windows.
[ ] Redact secrets and PII in logs; ensure headers and sensitive values never appear in structured logs.
[ ] Add rate limiting (per IP and per runner id) on Hub borrow/return to protect from abuse; return 429 with Retry-After.
[ ] Add optional IP allowlist or token-based auth (e.g., PAT via header) for Hub API alongside shared secrets.
[X] Implement graceful shutdown: Hub stops accepting new borrows; Worker drains sessions and returns cleanly on SIGTERM.
[X] Enforce maximum WebSocket message size and idle timeouts in Worker; send periodic pings and close dead connections.
[X] Add backpressure controls in Worker WS proxy (bounded channels, drop policy, and metrics for drops).
[X] Strengthen Worker sidecar management: sidecar health endpoint, restart/backoff strategy, and clear error surfacing to Hub.
[X] Make PLAYWRIGHT_VERSION reporting authoritative: validate against sidecar; surface mismatch in Dashboard and metrics.
[X] Improve WorkerOptions.FromEnvironment() with strong typing, defaults, range checks, and detailed validation errors.
[X] Replace ad-hoc HttpClient usage in HubClient with IHttpClientFactory and resilience (timeouts, retries, transient error policy).
[X] Add CancellationToken overloads to HubClient methods (BorrowAsync, ReturnAsync, SendApiLogAsync).
[X] Introduce domain-specific exceptions in HubClient (CapacityUnavailableException, AuthenticationException, ProtocolException).
[X] Batch and rate-limit HubClient log sending; add async buffering to minimize impact on runner.
[X] Add optional log redaction in PlaywrightEventForwarder (query param scrub, headers whitelist) and sampling controls.
[X] Ensure nullability annotations are correct across Hub, Worker, HubClient; enable nullable warnings as errors on CI.
[X] Audit async paths to avoid sync-over-async; ensure proper ConfigureAwait usage in library code where applicable.
[X] Standardize structured logging (Serilog or built-in ILogger scopes) with runId/browserId scope enrichment.
[X] Provide configurable log levels and per-component overrides via environment.
[X] Add graceful error pages and dashboard error boundaries for SignalR disconnections with auto-retry/backoff.
[X] Implement virtualization/pagination for Dashboard results and command logs to prevent UI slowdowns on large runs.
[x] Add filtering/search on Dashboard (by App, Browser, Env, Region, Status, runId) and deep links.
[ ] Introduce authentication for Dashboard (OIDC/OAuth2) with role-based access (viewer/admin); secure SignalR hub accordingly.
[X] Add retention policies for run results and logs (TTL in Redis; optional durable store adapter e.g., PostgreSQL/SQLite).
[X] Add API to export run details (JSON/NDJSON) for external archiving.
[x] Provide Helm chart/Kubernetes manifests with sensible defaults, probes, and resource limits.
[ ] Harden Docker images: run as non-root user, drop capabilities, read-only filesystem with writable temp for Playwright.
[ ] Slim Docker images further: prune caches (npm, dotnet), multi-stage for Node assets, consolidate OS packages, consider distroless base.
[X] Add multi-arch builds (linux/amd64, linux/arm64) for Hub/Worker via buildx.
[X] Add image vulnerability scanning (Trivy/GHCR) and SBOM generation during CI.
[X] Introduce GitHub Actions CI: build, unit tests, integration tests (with Testcontainers), publish artifacts, and optional Docker image publish.
[X] Add workflow caching for dotnet restore, npm playwright installs, and docker layers to speed up CI.
[X] Expand unit tests for label matching, options parsing (POOL_CONFIG), and secret handling edge cases.
[x] Add integration tests for: secret mismatch (401), capacity exhaustion (503), borrow queue timeout, and node eviction scenarios.
[x] Add flaky-test mitigations: deterministic time helpers, extended health timeouts via env, and richer test diagnostics.
[x] Provide smoke test for Dashboard SignalR stream (connect, receive events, disconnect) without browsers.
[ ] Add load/pressure test harness (NUnit category) with configurable CONCURRENCY/ITERATIONS and asserts on latency percentiles.
[ ] Add architecture diagrams (C4 model: Context, Container, Component) and sequence diagram for borrow/return.
[X] Create CONTRIBUTING.md (coding standards, commit messages, branching, PR checklist).
[X] Establish versioning and release notes; tag releases and publish Agenix.PlaywrightGrid.HubClient to NuGet.
[X] Add compatibility matrix documenting supported Playwright versions and Docker base image tags.
[x] Add configuration to toggle wildcards separately from trailing fallback/prefix expansion per-environment.
[x] Add per-label concurrency caps and fair sharing to prevent one label from starving others.
[x] Provide metrics-driven autoscaling hints (HPA annotations) based on borrow queue length and CPU for Workers.
[ ] Ensure graceful recovery scenarios: Hub restart does not break in-flight WebSocket sessions; document impact and mitigation.
[x] Add health and readiness endpoints separation; ensure /health checks critical dependencies and /ready reflects capacity.
[X] Add startup diagnostics dump (effective config, labels registered per node) visible in logs and Dashboard.
[x] Implement audit logging for node registration, secret changes, and admin actions.
[x] Add command-line tooling or scripts to validate POOL_CONFIG and compute effective capacity before boot.
[x] Provide local dev convenience: make .env support across Hub/Worker and docs on docker compose overrides.
[x] Improve error messages in Dashboard UI to point to remediation steps (e.g., capacity missing, secret mismatch, WS unreachable).
[x] Add browser-specific tuning options (Chromium args, Firefox prefs, WebKit flags) with validation and documentation.
[x] Enforce API request size limits and reasonable timeouts in Hub; document limits.
[x] Add support for custom labels (e.g., Channel, Headless) with controlled cardinality to avoid metrics explosion.
[X] Refactor Dashboard Results pages to use server-driven paging and streaming for command logs.
[x] Ensure all public APIs and DTOs have XML docs and nullable annotations; generate API docs from XML.
[ ] Introduce coding analyzers (StyleCop/IDisposable analyzers) and fix high-signal warnings.
[x] Add guardrails for Redis key naming to avoid collisions; centralize key patterns with tests.
[ ] Integrate X11 virtual display (Xvfb) in Worker Docker image: install xvfb, xauth, fonts, and required deps; verify image size impact.
[ ] Add WORKER_XVFB_ENABLED env flag (default: true in containers) and DISPLAY management (e.g., :99) in Worker startup.
[ ] Implement Worker sidecar/process supervisor to launch Xvfb on boot when enabled; ensure restarts/backoff and logs are captured.
[ ] Provide option to use xvfb-run wrapper vs. dedicated Xvfb process; document pros/cons and choose default.
[ ] Wire Playwright headful mode support via DISPLAY with environment propagation to browser processes; document headless/headful matrix.
[ ] Add health/readiness checks for Xvfb (e.g., xdpyinfo sanity) and expose a metric (worker_xvfb_up) and logs for diagnostics.
[ ] Ensure graceful shutdown: stop accepting new sessions, close browsers, then terminate Xvfb cleanly.
[ ] Harden security: run Xvfb as non-root, restrict access control (xauth cookie), avoid TCP listeners.
[ ] Extend WorkerOptions.FromEnvironment() to parse XVFB-related envs (enabled, display number, screen size, dpi); add unit tests.
[ ] Add integration test path that borrows a session with headful=true and validates navigation succeeds under Xvfb.
[ ] Update worker/Dockerfile and docker-compose.yml with XVFB packages and env examples; include minimal fonts set and note locales.
[ ] Update docs: README.md (usage), docs/Compatibility-Matrix.md (headful notes), and dashboard guidance for troubleshooting Xvfb.
[ ] Add troubleshooting playbook: common errors (cannot open display, fonts missing), with steps and env toggles to disable/enable Xvfb.
[ ] Enforce HTTPS by default in docker-compose via reverse proxy (Traefik/Nginx), enable HSTS, secure cookies, and strong TLS settings; document local dev exceptions.
[ ] Tighten CORS and add CSRF protection where applicable (Dashboard/API forms), with explicit allowed origins and methods.
[ ] Introduce JWT/HMAC request signing for Hub API (time-limited tokens minted by Hub) as an alternative to shared secrets; provide migration guidance.
[ ] Support secrets from files via *_FILE env convention and optional integration with external secret stores (AWS Secrets Manager/Azure Key Vault/GCP Secret Manager).
[ ] Add automated security checks: CodeQL workflow, Dependency/Container update automation (Dependabot/Renovate) with review rules.
[ ] Enable horizontal scaling for Hub: move borrow queue to Redis Streams with consumer groups; implement idempotency and deduplication.
[ ] Implement distributed leadership for sweeper jobs using Redis (SETNX + TTL) to coordinate multiple Hub instances safely.
[ ] Quarantine flapping or failing Worker nodes (cooldown period) and surface quarantine state in Dashboard and metrics.
[ ] Add Redis connection options for Sentinel/Cluster and TLS; document configuration and failover behavior.
[ ] Provide idempotency keys for Borrow/Return endpoints to handle client retries without duplicate sessions.
[ ] Enforce per-Worker max concurrent WebSocket connections (configurable); expose saturation metrics and headroom.
[x] Monitor disk/inode usage in Worker; auto-clean old browser caches/traces; emit alerts when thresholds breached.
[x] Add WS per-message compression toggle with thresholds to balance CPU vs bandwidth; document defaults.
[x] Implement safe sidecar upgrade flow (graceful drain + restart) coordinated with Hub to avoid session drops.
[ ] Improve Dashboard accessibility (WCAG 2.1 AA): keyboard navigation, landmarks, focus management, color contrast, ARIA labels.
[ ] Gate Dashboard features by role (admin/viewer) based on OIDC group/claim mapping; hide admin endpoints from non-admins. (extends 36)
[ ] Allow exporting run artifacts (HAR/trace/logs) and provide deep links to Playwright trace viewer; bulk download.
[ ] Add HubClient DI extensions (AddHubClient) with options; support proxies/custom headers; expose retry/jitter tuning knobs.
[ ] Implement idempotency support in HubClient for borrow/return (Idempotency-Key header) and transparent retry handling.
[ ] Package a CLI (dotnet tool) to interact with the grid: login, list-labels, borrow/return, tail logs, diagnose; publish to NuGet.
[ ] Add Prometheus exemplars and trace linkage for borrow latency histograms; propagate runId/traceId via W3C baggage.
[ ] Define and codify SLOs with alerting rules (borrow success rate, p95 latency, node heartbeat gap); ship Prometheus/Grafana alerts.
[ ] Expand testing with property-based and fuzz tests for label parsing/matching and Hub request validation.
[ ] Add chaos tests: Redis outage, Hub/Worker restarts, network partitions, and clock skew; assert recovery within SLO.
[ ] Create a nightly soak test pipeline to run long-duration borrow/return cycles and report regressions.
[ ] Benchmark critical paths (label matching, Redis operations) with BenchmarkDotNet; track regressions in CI.
[ ] Profile Hub/Worker memory/CPU under load; reduce allocations and capture flamegraphs for hot paths.
[ ] Add developer tooling: devcontainer setup, Makefile targets, pre-commit hooks (dotnet format, analyzers) and consistent .editorconfig.
[x] Optimize Dockerfiles with BuildKit cache mounts and better layer ordering; document cache strategy.
[ ] Author a security threat model (STRIDE) and hardening guide; include SRE runbooks and incident response procedures.
[ ] Automate diagram generation and publishing (Mermaid/PlantUML) as part of mkdocs; integrate with architecture docs (52).
[ ] Validate IPv6 and proxy support end-to-end; document reverse proxy patterns and limitations.
[ ] Provide reverse-proxy examples (Traefik/Nginx) with sticky sessions for WS and TLS termination; include compose overrides.
[ ] Introduce multi-tenancy: namespaced labels and quotas/rate limits per tenant; surface tenant in metrics and logs.
[ ] Enable hot-reload for config via IOptionsMonitor where safe (log levels, borrow strategy flags) without restarts.
[ ] Adopt FeatureManagement for feature flags with environment/tenant targeting; wire to existing strategy toggles (56).
[ ] Ensure audit logs are tamper-evident and optionally export to external SIEM (OTLP/syslog); add retention controls.
[ ] Sign container images (cosign) and publish provenance/SBOM attestations (SLSA level targets) in CI.
[ ] Add localization (i18n) to Dashboard with language switcher; ensure date/number formatting respects locale.
[ ] Document and optionally support GPU acceleration (NVIDIA/Intel VA-API) for headful runs; provide example images and detection.
[ ] Define Redis memory and eviction policies; emit alarms when approaching limits and document tuning guidance.
[ ] Add pagination/filtering/count endpoints for admin APIs (nodes, sessions, runs) to aid tooling and Dashboard.
[x] Implement a durable store adapter (e.g., PostgreSQL) with schema migrations for long-term run/log retention; make pluggable.
[ ] Minimize telemetry PII; add sampling/redaction policies across traces/logs/metrics with config-driven controls.
[ ] Add scheduled synthetic monitors (GitHub Actions cron) to hit /ready and perform a basic borrow against a local grid.
[x] Introduce RunName as a first-class, optional human-friendly identifier alongside RunId across the platform (Hub, Worker, Dashboard, HubClient).
[x] Define validation rules for RunName: trim input, max length 128, allow letters/numbers/space/._- only; reject control chars; document case policy.
[x] Domain model: add RunName (string?) to shared DTOs/entities (Run, BorrowRequest/Response, RunSummary) with XML docs and nullability annotations.
[x] Hub API: accept RunName in Borrow request payloads (and propagate in response); expose in Run results and SignalR events; update OpenAPI with examples.
[x] Backward compatibility: keep RunName optional; default display to RunId when RunName is null/empty; do not break existing clients.
[x] Storage: persist RunName alongside RunId in Redis (keys/values); verify schema/read paths; ensure sweeper and TTL logic include RunName where relevant.
[x] Hub logging/metrics: include RunName in structured logs as a field (not a metric label) to avoid high-cardinality metrics; add redaction if enabled.
[x] Worker: carry RunName in WS proxy scopes and forward in event/log messages to Hub; include in sidecar run context if applicable.
[x] HubClient: add optional runName parameter to BorrowAsync and related methods; update overloads and XML docs; maintain existing signatures.
[x] Dashboard UI: display RunName prominently in Results and Run detail pages; fall back to RunId if missing; add filter/search by RunName; include in deep links.
[x] Dashboard API/adapters: extend view models and queries to surface RunName; ensure server-driven paging/sorting can sort by RunName.
[x] Tests – unit: add validation tests for RunName parsing; update DTO serialization tests; cover HubClient overload behavior and null/empty handling.
[x] Tests – integration: borrow a session with RunName set and assert it appears in results, SignalR stream, and Dashboard filtering.
[x] Documentation: update README, API docs (Swagger snippets), and dashboard guidance with examples using RunName; add examples in docs/cli.md if relevant.
[x] Security/PII: clarify that RunName may contain descriptive text; recommend avoiding sensitive data; ensure redaction feature covers RunName if policy set.
[x] Non-goals: do not add RunName to metric labels or Redis keys; keep it as data only to avoid cardinality/compat issues; document rationale.
[ ] AI and LLM Enhancements (optional, privacy‑first, disabled by default)
[ ] Dashboard: "Explain this run" – generate an LLM summary of a run (errors, likely root cause, next steps) from redacted command logs and timings; expose a button on Results/Run pages; include copy-to-issue. Guard with AI_ENABLE=1.
[ ] Failure triage auto‑classification – categorize failures (capacity, auth/secret, WS/connectivity, site under test, timeouts, browser crash) via prompt with few‑shot examples; store category in run metadata and make it filterable in Dashboard.
[ ] Natural language search for results – convert free‑text queries (e.g., "Chromium UAT runs failing with timeouts yesterday") into structured filters (App/Browser/Env/Status/Time); provide offline synonym map fallback when AI is disabled.
[ ] RAG assistant for docs/config – in‑Dashboard helper that answers "How do I …?" using a local index of project docs (README, tasks, compatibility matrix) and current effective config; prefer local retrieval; only call LLM when explicitly enabled.
[ ] Capacity planning recommender – analyze historical label utilization and borrow queue metrics to suggest POOL_CONFIG adjustments and forecast demand by label; present as a report; use classical stats first; optionally add an LLM summary.
[ ] Anomaly detection and incident notes – detect spikes in borrow latency, WS disconnects, node heartbeats via simple z‑score/EWMA; open an incident card in Dashboard with an optional "AI incident summary" containing probable causes and suggested checks.
[ ] Flakiness detector – cluster intermittent failures across runs/labels; surface flaky labels/tests with confidence; add a weekly dashboard report.
[ ] Auto‑remediation hints – when common misconfigurations are detected (PUBLIC_WS_HOST mismatch, secret mismatch, no capacity for label), show contextual remediation steps; optionally generate an issue template with prefilled diagnostics.
[ ] AI provider plumbing – provider‑agnostic abstraction (OpenAI, Azure OpenAI, Ollama/local) behind an interface; env: AI_PROVIDER, AI_API_KEY/ENDPOINT, AI_MODEL, AI_ENABLE; strict timeouts, retries, and rate limits.
[ ] Safety and cost controls – redact all secrets/PII before prompt; token accounting and per‑day budgets; allow per‑feature enablement; never log prompt/response content, only metadata; document data handling.
[ ] Unit tests for prompt builders/mappers and NL→filter translation; add record/replay fixtures for CI (canned responses) so tests run without network/API keys.
[ ] Telemetry for AI features – Prometheus counters/histograms for usage and latency; exemplar linkage to runId; dashboards for success/error rates; no high‑cardinality content.
[ ] Security review – ensure no secrets (headers, query params) can leak into prompts; honor existing redaction settings; add a global kill switch (AI_ENABLE=0).
[ ] Documentation – add docs/ai.md detailing enabling providers, example prompts, privacy expectations, and local dev via Ollama; link from README and Dashboard help.
[ ] Non‑goals/guardrails – never make core flows depend on AI; AI features must fail‑safe, degrade gracefully, and offer offline paths (e.g., deterministic synonym search, stats‑only reports).