State Machine | EyeLog Docs

Overview

The agent state machine is authoritative - it lives in the agent and determines what actions are allowed at any given time. The collector tracks a shadow copy of this state via heartbeats for monitoring and multi-collector coordination.

Key Principles:

All state transitions must be explicit and validated

Invalid transitions are rejected (not silently ignored)

Each state defines what operations are permitted

State persists across restarts (stored in BoltDB)

State is reported to collector in every heartbeat

State Diagram

+-----------+
|  STOPPED  |  Initial state / final state
+-----+-----+
      | Start()
      v
+-----------+
| STARTING  |  Loading config, initializing storage
+-----+-----+
      | initialized
      v
+-----------+     not enrolled      +-----------+
|   READY   |<-------------------->| ENROLLING |
| (offline) |     enrolled          +-----------+
+-----+-----+
      | Connect()
      v
+-----------+     connection lost   +-------------+
|CONNECTING |<-------------------->| DISCONNECTED|
+-----+-----+     reconnecting      +------+------+
      | connected                          | reconnect
      v                                    | timeout
+-----------+                              v
|   READY   |<---------------------- (back to CONNECTING)
| (online)  |
+-----+-----+
      | command received
      v
+-----------+
| EXECUTING |  Running a command
+-----+-----+
      | command complete
      v
+-----------+
|   READY   |
| (online)  |
+-----+-----+
      | drain notice / shutdown signal
      v
+-----------+
| DRAINING  |  Finishing current work, refusing new commands
+-----+-----+
      | drained
      v
+-----------+
|  STOPPED  |
+-----------+

State Definitions

State	Description	Allowed Operations
`STOPPED`	Agent not running	Start
`STARTING`	Initializing	None (internal)
`ENROLLING`	Requesting enrollment	Cancel
`READY`	Idle, waiting for work	Connect, ExecuteCommand, Drain, Stop
`CONNECTING`	Establishing connection	Cancel, Stop
`DISCONNECTED`	Connection lost, will retry	Reconnect, Stop
`EXECUTING`	Running a command	(command-specific), Cancel
`DRAINING`	Graceful shutdown in progress	None (wait for completion)

State Timeouts

States have timeouts to prevent the agent from being stuck:

State	Default Timeout	On Timeout
`STARTING`	30s	STOPPED (fatal)
`ENROLLING`	5min	READY (retry later)
`CONNECTING`	30s	DISCONNECTED
`EXECUTING`	per-command	READY (report timeout)
`DRAINING`	60s	STOPPED (force kill)

Command Timeout Classes

Commands specify a timeout class, not a fixed duration:

Class	Duration	Examples
`instant`	5s	ping, status
`quick`	15s	set_log_level, config
`medium`	30s	exec, manifest apply
`human`	none	manifest approval (cancellable)
`system`	none	reboot, shutdown (fire-and-forget)

Valid Transitions

var validTransitions = map[State][]State{
    STOPPED:      {STARTING},
    STARTING:     {READY, STOPPED},           // success or fatal error
    ENROLLING:    {READY, STOPPED},           // enrolled or rejected
    READY:        {CONNECTING, ENROLLING, EXECUTING, DRAINING, STOPPED},
    CONNECTING:   {READY, DISCONNECTED, STOPPED},
    DISCONNECTED: {CONNECTING, STOPPED},
    EXECUTING:    {READY, DRAINING, STOPPED}, // complete, drain, or error
    DRAINING:     {STOPPED},
}

Connection Status vs State

These are separate concepts:

Concept	Tracked By	Values	Purpose
State	Agent (authoritative)	READY, EXECUTING, DRAINING, etc.	What agent is doing
Connection Status	Collector	ONLINE, OFFLINE, UNRESPONSIVE	Can collector reach agent?

The collector derives connection status from:

ONLINE = received heartbeat within timeout
OFFLINE = no connection
UNRESPONSIVE = connected but no recent heartbeat

State Reporting via Heartbeat

Every heartbeat includes current state and timeout info:

message Heartbeat {
    string state = 1;              // READY, EXECUTING, etc.
    string state_detail = 2;       // e.g., "executing: info.inventory.hardware"
    int64 state_since = 3;         // Unix timestamp of last state change
    int64 state_timeout_at = 4;    // When current state will timeout (0 = no timeout)
    string manifest_id = 5;
    repeated string capabilities = 6;
    int64 timestamp = 7;
}

Command Execution Rules

Commands can only be accepted in specific states:

Execution Flow:

1. Receive command from collector
2. Validate state == READY
3. Transition to EXECUTING
4. Execute command (may take time)
5. On completion: transition to READY, send response
6. On error: transition to READY, send error response
7. On drain during execution: finish command, then drain

State Persistence

State is persisted to BoltDB to survive restarts:

type PersistedState struct {
    State           State     // Current state
    LastTransition  time.Time // When state changed
    PendingCommand  *Command  // If EXECUTING, what command
    DrainReason     string    // If DRAINING, why
}

On restart:

Load persisted state
If was EXECUTING → check if command was atomic, resume or mark failed
If was DRAINING → continue draining
Otherwise → transition to READY

Example Scenario

Timeline:
---------------------------------------------------------------------------

T0: Agent in READY state (online)
    +-- Admin A: "Run inventory scan" -> ACCEPTED
    +-- Agent -> EXECUTING state

T1: Executing... (30% complete)
    +-- Admin B: "Restart agent" -> REJECTED
    |   +-- Reason: "Agent is executing a command"
    +-- Collector shows: "Cannot restart during execution"

T2: Executing... (100% complete)
    +-- Agent -> READY state

T3: Agent in READY state
    +-- Admin B: "Restart agent" -> ACCEPTED

---------------------------------------------------------------------------
Result: No interrupted commands. Clear feedback to Admin B.

← Previous Capability System Next → Enrollment Process