Self-heal Policy v0
Scope
Defines bounded local recovery rules for Probe/runtime/channel failures and escalation conditions to Gateway/manual intervention.Terminology
- Self-heal: local bounded recovery without changing global system plan.
- Budget: retry limit within a defined time window.
- Escalation: handoff to Gateway decision path.
Normative Spec
Allowed Local Actions
Probe MAY execute only:restartreconnectremountrebindrefresh_local_endpoint
Forbidden Local Actions
Probe MUST NOT:- change host placement
- alter stable route (
workspace_id + agent_id) - mutate desired state authority
- bypass platform-level approval
- silently terminate/archive workloads
Thresholds (v0)
- Max 3 retries for same fault class within 10 minutes.
- Backoff sequence:
30s -> 90s -> 210s. - Exceeding budget MUST escalate to Gateway.
Phase Semantics
- During healing: phase enters
recovering. - On success: return to normal running phase.
- On repeated failure: mark degraded/failed and escalate.
ssh_remote Downgrade
ide_primaryfailures MAY downgrade toterminal_fallback.- Downgrade MUST include explicit reason code.
Examples
Error & Recovery
Self-heal MUST emit typed events:RecoveryAttemptedRecoverySucceededorRecoveryFailed
action, retry_count, reason, and optional error_code.
Security & Audit
- Self-heal actions are metadata-audited.
- DevBox internal coding-tool commands remain outside AIOS secondary approval path.
- Platform-level risky actions still require platform policy/approval.
Status
Current status:draft
Source discussions:
docs/maintainers/conversation-records/agent-platform/2026-03-24-self-heal-policy-v0-draft.mddocs/maintainers/conversation-records/agent-platform/2026-03-24-p1-freeze-result-accepted.md
Changelog
- 2026-03-24: Initial skeleton created.
- 2026-03-24: Added bounded self-heal actions, thresholds, and escalation rules.