跳转到主要内容

Self-heal Policy v0

Scope

Defines bounded local recovery rules for Probe/runtime/channel failures and escalation conditions to Gateway/manual intervention.

Terminology

  • Self-heal: local bounded recovery without changing global system plan.
  • Budget: retry limit within a defined time window.
  • Escalation: handoff to Gateway decision path.

Normative Spec

Allowed Local Actions

Probe MAY execute only:
  • restart
  • reconnect
  • remount
  • rebind
  • refresh_local_endpoint

Forbidden Local Actions

Probe MUST NOT:
  • change host placement
  • alter stable route (workspace_id + agent_id)
  • mutate desired state authority
  • bypass platform-level approval
  • silently terminate/archive workloads

Thresholds (v0)

  • Max 3 retries for same fault class within 10 minutes.
  • Backoff sequence: 30s -> 90s -> 210s.
  • Exceeding budget MUST escalate to Gateway.

Phase Semantics

  • During healing: phase enters recovering.
  • On success: return to normal running phase.
  • On repeated failure: mark degraded/failed and escalate.

ssh_remote Downgrade

  • ide_primary failures MAY downgrade to terminal_fallback.
  • Downgrade MUST include explicit reason code.

Examples

{
  "action": "reconnect",
  "fault_class": "dialog_target_disconnected",
  "retry_count": 2,
  "backoff_seconds": 90,
  "phase": "recovering"
}
{
  "action": "restart",
  "fault_class": "ssh_remote_attach_failed",
  "retry_count": 4,
  "result": "escalated_to_gateway"
}

Error & Recovery

Self-heal MUST emit typed events:
  • RecoveryAttempted
  • RecoverySucceeded or RecoveryFailed
with at least: action, retry_count, reason, and optional error_code.

Security & Audit

  • Self-heal actions are metadata-audited.
  • DevBox internal coding-tool commands remain outside AIOS secondary approval path.
  • Platform-level risky actions still require platform policy/approval.

Status

Current status: draft Source discussions:
  • docs/maintainers/conversation-records/agent-platform/2026-03-24-self-heal-policy-v0-draft.md
  • docs/maintainers/conversation-records/agent-platform/2026-03-24-p1-freeze-result-accepted.md

Changelog

  • 2026-03-24: Initial skeleton created.
  • 2026-03-24: Added bounded self-heal actions, thresholds, and escalation rules.

相关页面