Skip to content

Distributed Tracing

Real-time trace correlation and analysis for MCP Mesh using Redis Streams

Overview

MCP Mesh implements a high-performance distributed tracing system built on Redis Streams that provides end-to-end visibility into request flows across multiple agents. Unlike traditional OpenTelemetry setups, this system is specifically designed for MCP's JSON-RPC protocol with automatic context propagation and real-time correlation.

Architecture Components

1. Python Agent Tracing (Publishers)

Python agents automatically publish trace events to Redis Streams when decorated with @mesh.tool():

@app.tool()
@mesh.tool(depends_on=["data-processor"])
async def generate_report(title: str) -> str:
    # Automatic trace context creation and propagation
    # publishes span_start -> calls dependency -> publishes span_end
    processor = await mesh.get_agent("data-processor")
    return await processor.process_data({"title": title})

Event Types Published: - span_start: Operation begins - span_end: Operation completes successfully - error: Operation fails with error details

2. Redis Streams (Transport Layer)

Stream Name: mesh:trace Consumer Group: mcp-mesh-registry-processors

Events are published asynchronously without blocking agent operations:

# View recent trace events
redis-cli XREVRANGE mesh:trace + - COUNT 10

# Monitor stream length
redis-cli XLEN mesh:trace

3. Go Registry (Consumer & Correlator)

The registry consumes events and correlates them into complete traces:

  • Consumer: Reads from Redis Streams with automatic failover
  • Correlator: Builds complete traces from individual span events
  • Exporters: Output traces in multiple formats (console, JSON, stats)

Configuration

Environment Variables

Variable Default Description
MCP_MESH_DISTRIBUTED_TRACING_ENABLED false Enable tracing system
TRACE_EXPORTER_TYPE console Export format
TRACE_PRETTY_OUTPUT true Pretty console output
TRACE_ENABLE_STATS true Collect statistics
TRACE_JSON_OUTPUT_DIR /tmp JSON export directory
TRACE_BATCH_SIZE 100 Consumer batch size
TRACE_TIMEOUT 5m Trace completion timeout
MCP_MESH_TRACE_RETENTION 24h Redis stream retention window (0 disables trimming)

Enable Tracing

# Enable in registry
export MCP_MESH_DISTRIBUTED_TRACING_ENABLED=true
export TRACE_EXPORTER_TYPE=console
export TRACE_PRETTY_OUTPUT=true

meshctl start --registry-only

Python agents automatically detect when tracing is enabled and begin publishing events.

Trace Data Model

TraceEvent Structure

{
  "trace_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "span_id": "x1y2z3w4-a5b6-c789-def0-123456789abc",
  "parent_span": "parent-span-id-if-exists",
  "agent_name": "weather-service",
  "agent_id": "weather-123",
  "ip_address": "192.168.1.100",
  "event_type": "span_start|span_end|error",
  "operation": "tool:get_weather",
  "timestamp": 1640995200.123,
  "duration_ms": 150,
  "success": true,
  "error_message": null,
  "capability": "get_weather",
  "target_agent": "data-processor",
  "runtime": "python-3.11"
}

CompletedTrace Structure

{
  "trace_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "spans": [
    {
      "span_id": "x1y2z3w4-a5b6-c789-def0-123456789abc",
      "agent_name": "weather-service",
      "operation": "tool:get_weather",
      "start_time": "2024-01-01T10:00:00Z",
      "end_time": "2024-01-01T10:00:00.150Z",
      "duration_ms": 150,
      "success": true
    }
  ],
  "start_time": "2024-01-01T10:00:00Z",
  "end_time": "2024-01-01T10:00:00.300Z",
  "duration": "300ms",
  "success": true,
  "span_count": 3,
  "agent_count": 2,
  "agents": ["weather-service", "data-processor"]
}

Trace Correlation Logic

1. Event Collection

Events are correlated by trace_id and individual spans by span_id:

span_start[trace_id=ABC, span_id=123] + span_end[trace_id=ABC, span_id=123] = Complete Span

2. Completion Detection

Traces are considered complete when: - All spans have both start and end events - No new events for 5 seconds - Contains at least one span

3. Export Triggers

Traces are exported when: - Immediately: When completion is detected during event processing - Cleanup: Every minute, completed traces are found and exported - Expiry: After 5 minutes of inactivity (incomplete traces)

Export Formats

Console Exporter

Real-time trace visualization in terminal:

export TRACE_EXPORTER_TYPE=console
export TRACE_PRETTY_OUTPUT=true

Output Example:

🔗 TRACE a1b2c3d4 (285ms) - SUCCESS (3 spans across 2 agents)
  📍 Agent: weather-service
    ✅ tool:get_weather [get_weather] (150ms)
  📍 Agent: data-processor
    ✅ tool:process_data [process_data] (100ms)
    ✅ tool:validate_result [validate_result] (35ms)

JSON Exporter

Structured export for external systems:

export TRACE_EXPORTER_TYPE=json
export TRACE_JSON_OUTPUT_DIR=/var/log/traces

Output Files: - /var/log/traces/trace-{trace_id}.json - One file per completed trace

Statistics Exporter

Aggregate metrics collection:

export TRACE_EXPORTER_TYPE=multi  # Enables all exporters
export TRACE_ENABLE_STATS=true

Query API

Trace Status

GET /trace/status

Returns tracing configuration and runtime statistics:

{
  "enabled": true,
  "consumer": {
    "stream_name": "mesh:trace",
    "consumer_group": "mcp-mesh-registry-processors",
    "status": "running"
  },
  "correlator": {
    "active_traces": 5,
    "total_spans": 12,
    "oldest_trace_age": "45s"
  },
  "exporter": {
    "type": "console",
    "exported_traces": 147
  }
}

List Recent Traces

GET /trace/list?limit=20&offset=0

Returns paginated list of completed traces, newest first.

Get Specific Trace

GET /trace/{trace_id}

Retrieve complete trace details by ID.

Search Traces

GET /trace/search?agent_name=weather&success=true&min_duration_ms=100

Search Parameters:

Parameter Type Description
parent_span_id string Filter by parent span
agent_name string Filter by agent name
operation string Filter by operation (partial match)
success boolean Filter by success status
start_time RFC3339 Filter by start time (after)
end_time RFC3339 Filter by end time (before)
min_duration_ms integer Minimum duration filter
max_duration_ms integer Maximum duration filter
limit integer Result limit (max 100)

Trace Statistics

GET /trace/stats

Returns aggregate statistics:

{
  "total_traces": 1250,
  "success_traces": 1189,
  "failed_traces": 61,
  "success_rate": 95.12,
  "avg_duration_ms": 234.5,
  "avg_spans_per_trace": 2.8,
  "agents_involved": ["weather", "data-processor", "report-gen"],
  "top_operations": [
    {"operation": "tool:get_weather", "count": 456},
    {"operation": "tool:process_data", "count": 389}
  ]
}

Performance Analysis Examples

Find Slow Operations

# Operations taking longer than 1 second
curl "http://localhost:8000/trace/search?min_duration_ms=1000&limit=10" | jq '.traces[] | {trace_id, duration, agents}'

Debug Failed Operations

# Get recent failures with details
curl "http://localhost:8000/trace/search?success=false&limit=5" | jq '.traces[] | {trace_id, agents, spans: [.spans[] | select(.success == false)]}'

Agent Performance Analysis

# Analyze specific agent performance
curl "http://localhost:8000/trace/search?agent_name=weather-service&limit=50" | jq '[.traces[].duration] | add / length'

Time-based Analysis

# Get traces from last hour
HOUR_AGO=$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)
curl "http://localhost:8000/trace/search?start_time=$HOUR_AGO&limit=100" | jq '.traces | length'

Advanced Features

Context Propagation

Trace context automatically flows between agents:

# Parent agent
@mesh.tool(depends_on=["child-agent"])
async def parent_operation():
    # trace_id and span_id automatically propagated
    child = await mesh.get_agent("child-agent")
    return await child.child_operation()

# Child agent
@mesh.tool()
async def child_operation():
    # Inherits trace context from parent
    # New span created with parent span ID
    pass

Error Correlation

Failed operations are automatically correlated:

@mesh.tool()
async def failing_operation():
    try:
        # operation logic
        pass
    except Exception as e:
        # Error event automatically published with trace context
        raise  # Re-raise to maintain error handling

Multi-Agent Traces

Complex workflows spanning multiple agents are automatically traced:

User Request → Agent A → Agent B → Agent C
      ↓            ↓         ↓         ↓
   trace_id    same_id   same_id   same_id
   span_1      span_2    span_3    span_4
              parent=1  parent=2  parent=3

Storage and Retention

In-Memory Storage

  • Active Traces: Stored until completion or 5-minute timeout
  • Completed Traces: Last 1000 traces kept for querying, and pruned once older than MCP_MESH_TRACE_RETENTION
  • Automatic Cleanup: Oldest 20% removed when limit exceeded

Redis Stream Retention

The registry automatically trims the mesh:trace stream: entries older than MCP_MESH_TRACE_RETENTION (default 24h) are removed when the registry connects to Redis and periodically while it runs. Set 0 to disable trimming entirely.

# Keep span events in Redis for 48 hours instead of the default 24
export MCP_MESH_TRACE_RETENTION=48h

The stream is a transport buffer, not a system of record — the registry consumes events and forwards them to your telemetry backend, so long-term queryable trace history is governed by Tempo's retention settings (configure those in Tempo, not here). The stream window only needs to cover how far the registry can fall behind, e.g. during an outage; on reconnect, entries older than the window are trimmed approximately, in batches (Redis macro-node granularity), so a few entries just past the cutoff may briefly survive. Note that the dashboard consumer group (mcp-mesh-ui-dashboard) reads the same stream and is subject to the same window if it lags.

As a disaster floor for extreme cases (registry down for many hours combined with high span volume), set a Redis maxmemory limit so the stream cannot grow without bound before the registry comes back to trim it.

Troubleshooting

No Traces Appearing

Check tracing status:

curl http://localhost:8000/trace/status | jq .enabled

Verify Redis stream:

redis-cli XLEN mesh:trace
redis-cli XINFO GROUPS mesh:trace

Check agent connectivity:

# Python agents should log tracing status on startup
# Look for: "Tracing enabled, publishing to redis://..."

Incomplete Traces

Check for orphaned events:

redis-cli XREVRANGE mesh:trace + - COUNT 20

Monitor correlator status:

curl http://localhost:8000/trace/status | jq .correlator

Performance Issues

Monitor consumer lag:

redis-cli XINFO GROUPS mesh:trace
# Look for "lag" field in consumer info

Check memory usage:

curl http://localhost:8000/trace/stats | jq .
# Monitor active_traces count

Integration Examples

Prometheus Metrics

#!/bin/bash
# Export trace metrics to Prometheus

STATS=$(curl -s http://localhost:8000/trace/stats)
SUCCESS_RATE=$(echo $STATS | jq .success_rate)
AVG_DURATION=$(echo $STATS | jq .avg_duration_ms)

echo "mcp_mesh_trace_success_rate $SUCCESS_RATE" | curl -X POST --data-binary @- http://pushgateway:9091/metrics/job/mcp-mesh
echo "mcp_mesh_trace_avg_duration_ms $AVG_DURATION" | curl -X POST --data-binary @- http://pushgateway:9091/metrics/job/mcp-mesh

External APM Integration

#!/bin/bash
# Send traces to external APM (e.g., Datadog, New Relic)

curl -s "http://localhost:8000/trace/list?limit=100" | \
  jq -c '.traces[]' | \
  while read trace; do
    curl -X POST "https://api.datadoghq.com/api/v1/traces" \
      -H "DD-API-KEY: $DD_API_KEY" \
      -H "Content-Type: application/json" \
      -d "$trace"
  done

Log Correlation

#!/bin/bash
# Correlate traces with application logs

# Extract trace IDs and search logs
curl -s "http://localhost:8000/trace/search?success=false&limit=10" | \
  jq -r '.traces[].trace_id' | \
  while read trace_id; do
    echo "=== Logs for trace $trace_id ==="
    grep "$trace_id" /var/log/mcp-mesh/*.log
  done

Best Practices

1. Monitoring

  • Set up alerts on trace export failures
  • Monitor trace completion rates
  • Track trace duration trends
  • Alert on error rate spikes

2. Performance

  • Use multi exporter for comprehensive observability
  • Configure appropriate Redis retention policies
  • Monitor correlator memory usage
  • Tune batch sizes for high throughput

3. Debugging

  • Use search API for targeted investigation
  • Correlate traces with application logs
  • Monitor Redis stream health
  • Check agent trace context propagation

4. Production Deployment

  • Configure JSON export for trace persistence
  • Set up external metrics collection
  • Implement trace sampling for high-volume systems
  • Monitor registry resource usage

Performance Characteristics

  • Throughput: 10,000+ spans/second sustained
  • Latency: <1ms trace event publishing (async)
  • Memory: ~1MB per 1000 completed traces
  • Storage: Configurable retention in Redis and memory
  • Correlation: Real-time span correlation and export
  • Availability: Registry failure doesn't impact agents

Next Steps

The distributed tracing system provides comprehensive observability out of the box. Consider extending with:

  1. Custom Exporters: Implement organization-specific backends
  2. Trace Sampling: Add intelligent sampling for high-volume scenarios
  3. SLA Monitoring: Extract SLA metrics from trace data
  4. Automated Alerting: Set up proactive monitoring based on trace patterns

💡 Tip: Use the trace search API with time windows to identify performance trends and system bottlenecks

📊 Performance: Monitor trace statistics regularly to ensure optimal system performance

🔗 Integration: Export traces to your existing observability stack using JSON exporter or custom exporters