Documentation

Copilot

AI-powered operational assistant that orchestrates reasoning, tool execution, and evidence gathering.

How it Works

OpsOrch Copilot is not just a chatbot. It is a reasoning engine that invokes Tools exposed by the MCP server.

User Question
Planning Loop (LLM)
Method: query_incidents
Method: query_metrics
Comprehensive Answer

Reasoning Engine

The Copilot uses an iterative Plan-Act-Observe loop to solve complex operational problems.

1

Planning

The LLM analyzes your question ("Why is checkout slow?") and breaks it down into required data steps.

2

Tool Execution

It invokes read-only tools via MCP (e.g., query-metrics, query-logs) to gather evidence.

3

Synthesis & Iteration

It observes the results. If the data is insufficient (e.g., no logs found), it iterates with a new plan to expand the time window or check a different service.

Capabilities

  • Contextual Analysis: Retrieve and summarize incidents with full context (metrics, logs, tickets).
  • Correlation: Connect spikes in metrics to recent deployments or log error bursts.
  • Investigation: Run multi-step investigations to find root causes across systems.
  • Runbook Discovery: Proactively suggest orchestration plans related to incidents or services.
  • Answer with Evidence: Provides citations and deep links to source data in the Console.

Deep Links & Runbook Actions

Copilot responses include structured references that power Console deep links and action cards. Runbook suggestions link directly to orchestration plans so operators can launch runs without hunting.

json
{
  "actions": [
    { "type": "orchestration_plan", "id": "db-failover", "name": "DB Failover", "reason": "Applies to the current outage." }
  ],
  "references": {
    "incidents": ["inc-404"],
    "services": ["payments-api"],
    "orchestrationPlans": ["db-failover"]
  }
}

Safety & Resilience

Read-Only by Default

The Copilot is designed to be safe. It prioritizes "Read" operations. Any "Write" operation (e.g., restarting a pod) requires explicit user confirmation via a Human-in-the-Loop flow.

Resilience Patterns

The engine handles API failures gracefully with:

  • Exponential Backoff: Retries failed provider calls automatically.
  • Window Expansion: Automatically widens time ranges if metrics are empty.
  • Circuit Breaking: Stops calling down providers to prevent latency.

Deployment

To enable Copilot in your self-hosted instance, you must provide an LLM API key.

Supported Models

  • OpenAI GPT-4o / GPT-4 Turbo
  • Anthropic Claude 3.5 Sonnet
  • Google Gemini 3.0 Flash
  • AWS Bedrock (Claude / Titan)

Configuration

bash
# In your opsorch-copilot env or secrets:
LLM_PROVIDER="openai" # or "anthropic", "gemini", "bedrock"
OPENAI_API_KEY="sk-..." 

# For Gemini:
# LLM_PROVIDER="gemini"
# GEMINI_API_KEY="your-api-key"
# GEMINI_MODEL="gemini-3-flash-preview" # optional, this is the default

# Optional: Specialized Model Selection
LLM_MODEL_PLANNER="gpt-4o"
LLM_MODEL_FAST="gpt-3.5-turbo"

Example Flow

A typical investigation flow looks like this:

User:
"Why is the payment service returning 500s?"
Step 1: Planning
Strategy: Check service health, recent logs, and active incidents.
Step 2: Tool Execution
CodeBlock.invoke("query_incidents", {"service": "payment"})
CodeBlock.invoke("query_logs", {"query": "service:payment status:500"})
Step 3: Synthesis
Found correlation: 500s started 5 mins ago, coincident with Deployment #123.
Copilot Answer:
The payment-service is experiencing a 15% error rate starting at 14:30 UTC. This correlates with Deployment #123 which finished 2 minutes prior. There is an active P1 incident #INC-404.