Health & Discovery¶

Heartbeat system, health checks, and topology updates

Overview¶

MCP Mesh maintains agent health through:

Heartbeats - Regular pings to registry
Health checks - Custom health functions
Topology updates - Automatic rerouting on failures

Heartbeat System¶

How It Works¶

sequenceDiagram
    participant A as Agent
    participant R as Registry

    loop Every 30s (default)
        A->>R: POST /heartbeat
        R->>R: Update TTL
        R->>A: 200 OK
    end

    Note over R: If no heartbeat for 90s...
    R->>R: Mark agent unhealthy

Configuration¶

@mesh.agent(
    name="my-agent",
    health_interval=30,      # Heartbeat every 30s
    auto_run_interval=10,    # Keep-alive every 10s
)

const agent = mesh(server, {
  name: "my-agent",
  heartbeatInterval: 30, // Heartbeat every 30s
});

Health Checks¶

Custom Health Function (Python)¶

async def my_health_check():
    """Custom health check."""
    # Check database connection
    if not db.is_connected():
        return {"status": "unhealthy", "reason": "db disconnected"}

    # Check memory
    if memory_usage() > 90:
        return {"status": "degraded", "reason": "high memory"}

    return {"status": "healthy"}

@mesh.agent(
    name="my-agent",
    health_check=my_health_check,
    health_check_ttl=30,  # Cache health for 30s
)
class MyAgent:
    pass

Health States¶

State	Description
`healthy`	Agent is fully operational
`degraded`	Agent works but with issues
`unhealthy`	Agent cannot serve requests

Discovery¶

Capability Discovery¶

Agents discover each other by capability:

# Provider registers capability
@mesh.tool(capability="user_service")
def get_user(): pass

# Consumer discovers by capability
@mesh.tool(dependencies=["user_service"])
def my_function(user_service=None): pass

Tag-Based Discovery¶

Filter by tags when multiple providers exist:

@mesh.tool(dependencies=[{
    "capability": "llm",
    "tags": ["claude", "+opus"]
}])

Version-Based Discovery¶

Require specific versions:

@mesh.tool(dependencies=[{
    "capability": "api",
    "version": ">=2.0.0"
}])

Topology Updates¶

Agent Joins¶

When a new agent registers:

Registry stores agent info
Dependent agents notified
Proxies updated with new routes

Agent Leaves¶

When an agent disconnects:

Heartbeat timeout detected
Agent marked unhealthy
Traffic rerouted to healthy instances
Dependent agents notified

Automatic Failover¶

graph LR
    subgraph "Before Failure"
        A1[Consumer] --> B1[Provider A - healthy]
        A1 -.-> B2[Provider B - healthy]
    end

    subgraph "After Failure"
        A2[Consumer] --> B3[Provider B - healthy]
        X[Provider A - unhealthy]
    end

Monitoring¶

Registry Endpoints¶

# All agents
curl http://localhost:8000/agents

# Specific agent
curl http://localhost:8000/agents/my-agent

# Registry health
curl http://localhost:8000/health

CLI Commands¶

# List agents with status
meshctl list

# Detailed status
meshctl status

# Watch for changes
meshctl status --watch

Configuration¶

Environment Variables¶

# Heartbeat interval (seconds)
export MCP_MESH_HEALTH_INTERVAL=30

# Health check TTL (seconds)
export MCP_MESH_HEALTH_CHECK_TTL=30

# Agent timeout (mark unhealthy after)
export MCP_MESH_AGENT_TIMEOUT=90

Best Practices¶

1. Implement Health Checks¶

async def health():
    # Check all critical dependencies
    checks = {
        "database": await check_db(),
        "cache": await check_cache(),
        "memory": check_memory(),
    }

    if all(c["ok"] for c in checks.values()):
        return {"status": "healthy", "checks": checks}
    return {"status": "degraded", "checks": checks}

2. Handle Failures Gracefully¶

@mesh.tool(dependencies=["optional_service"])
async def my_function(optional_service=None):
    if optional_service is None:
        # Fallback logic
        return "Fallback response"
    return await optional_service()

3. Use Appropriate Intervals¶

Use Case	Heartbeat	TTL
Development	30s	90s
Production	15s	45s
High Availability	10s	30s

Troubleshooting¶

Agent Shows Unhealthy¶

# Check agent logs
meshctl start my_agent.py --log-level debug

# Check heartbeat
curl http://localhost:8000/agents/my-agent

Discovery Not Working¶

# Verify registration
curl http://localhost:8000/agents | jq '.agents[] | {name, capabilities}'

# Check namespace
curl http://localhost:8000/agents | jq '.agents[] | {name, namespace}'

Health & Discovery¶

Overview¶

Heartbeat System¶

How It Works¶

Configuration¶

Health Checks¶

Custom Health Function (Python)¶

Health States¶

Discovery¶

Capability Discovery¶

Tag-Based Discovery¶

Version-Based Discovery¶

Topology Updates¶

Agent Joins¶

Agent Leaves¶

Automatic Failover¶

Monitoring¶

Registry Endpoints¶

CLI Commands¶

Configuration¶

Environment Variables¶

Best Practices¶

1. Implement Health Checks¶

2. Handle Failures Gracefully¶

3. Use Appropriate Intervals¶

Troubleshooting¶

Agent Shows Unhealthy¶

Discovery Not Working¶

See Also¶