Overview
The agent state machine is authoritative - it lives in the agent and determines what actions are allowed at any given time. The collector tracks a shadow copy of this state via heartbeats for monitoring and multi-collector coordination.
Key Principles:
- All state transitions must be explicit and validated
- Invalid transitions are rejected (not silently ignored)
- Each state defines what operations are permitted
- State persists across restarts (stored in BoltDB)
- State is reported to collector in every heartbeat
State Diagram
+-----------+
| STOPPED | Initial state / final state
+-----+-----+
| Start()
v
+-----------+
| STARTING | Loading config, initializing storage
+-----+-----+
| initialized
v
+-----------+ not enrolled +-----------+
| READY |<-------------------->| ENROLLING |
| (offline) | enrolled +-----------+
+-----+-----+
| Connect()
v
+-----------+ connection lost +-------------+
|CONNECTING |<-------------------->| DISCONNECTED|
+-----+-----+ reconnecting +------+------+
| connected | reconnect
v | timeout
+-----------+ v
| READY |<---------------------- (back to CONNECTING)
| (online) |
+-----+-----+
| command received
v
+-----------+
| EXECUTING | Running a command
+-----+-----+
| command complete
v
+-----------+
| READY |
| (online) |
+-----+-----+
| drain notice / shutdown signal
v
+-----------+
| DRAINING | Finishing current work, refusing new commands
+-----+-----+
| drained
v
+-----------+
| STOPPED |
+-----------+
State Definitions
| State | Description | Allowed Operations |
|---|---|---|
STOPPED |
Agent not running | Start |
STARTING |
Initializing | None (internal) |
ENROLLING |
Requesting enrollment | Cancel |
READY |
Idle, waiting for work | Connect, ExecuteCommand, Drain, Stop |
CONNECTING |
Establishing connection | Cancel, Stop |
DISCONNECTED |
Connection lost, will retry | Reconnect, Stop |
EXECUTING |
Running a command | (command-specific), Cancel |
DRAINING |
Graceful shutdown in progress | None (wait for completion) |
State Timeouts
States have timeouts to prevent the agent from being stuck:
| State | Default Timeout | On Timeout |
|---|---|---|
STARTING |
30s | STOPPED (fatal) |
ENROLLING |
5min | READY (retry later) |
CONNECTING |
30s | DISCONNECTED |
EXECUTING |
per-command | READY (report timeout) |
DRAINING |
60s | STOPPED (force kill) |
Command Timeout Classes
Commands specify a timeout class, not a fixed duration:
| Class | Duration | Examples |
|---|---|---|
instant |
5s | ping, status |
quick |
15s | set_log_level, config |
medium |
30s | exec, manifest apply |
human |
none | manifest approval (cancellable) |
system |
none | reboot, shutdown (fire-and-forget) |
Valid Transitions
var validTransitions = map[State][]State{
STOPPED: {STARTING},
STARTING: {READY, STOPPED}, // success or fatal error
ENROLLING: {READY, STOPPED}, // enrolled or rejected
READY: {CONNECTING, ENROLLING, EXECUTING, DRAINING, STOPPED},
CONNECTING: {READY, DISCONNECTED, STOPPED},
DISCONNECTED: {CONNECTING, STOPPED},
EXECUTING: {READY, DRAINING, STOPPED}, // complete, drain, or error
DRAINING: {STOPPED},
}
Connection Status vs State
These are separate concepts:
| Concept | Tracked By | Values | Purpose |
|---|---|---|---|
| State | Agent (authoritative) | READY, EXECUTING, DRAINING, etc. | What agent is doing |
| Connection Status | Collector | ONLINE, OFFLINE, UNRESPONSIVE | Can collector reach agent? |
The collector derives connection status from:
ONLINE= received heartbeat within timeoutOFFLINE= no connectionUNRESPONSIVE= connected but no recent heartbeat
State Reporting via Heartbeat
Every heartbeat includes current state and timeout info:
message Heartbeat {
string state = 1; // READY, EXECUTING, etc.
string state_detail = 2; // e.g., "executing: info.inventory.hardware"
int64 state_since = 3; // Unix timestamp of last state change
int64 state_timeout_at = 4; // When current state will timeout (0 = no timeout)
string manifest_id = 5;
repeated string capabilities = 6;
int64 timestamp = 7;
}
Command Execution Rules
Commands can only be accepted in specific states:
Execution Flow: 1. Receive command from collector 2. Validate state == READY 3. Transition to EXECUTING 4. Execute command (may take time) 5. On completion: transition to READY, send response 6. On error: transition to READY, send error response 7. On drain during execution: finish command, then drain
State Persistence
State is persisted to BoltDB to survive restarts:
type PersistedState struct {
State State // Current state
LastTransition time.Time // When state changed
PendingCommand *Command // If EXECUTING, what command
DrainReason string // If DRAINING, why
}
On restart:
- Load persisted state
- If was
EXECUTING→ check if command was atomic, resume or mark failed - If was
DRAINING→ continue draining - Otherwise → transition to
READY
Example Scenario
Timeline:
---------------------------------------------------------------------------
T0: Agent in READY state (online)
+-- Admin A: "Run inventory scan" -> ACCEPTED
+-- Agent -> EXECUTING state
T1: Executing... (30% complete)
+-- Admin B: "Restart agent" -> REJECTED
| +-- Reason: "Agent is executing a command"
+-- Collector shows: "Cannot restart during execution"
T2: Executing... (100% complete)
+-- Agent -> READY state
T3: Agent in READY state
+-- Admin B: "Restart agent" -> ACCEPTED
---------------------------------------------------------------------------
Result: No interrupted commands. Clear feedback to Admin B.