Documentation

Copilot

AI-powered operational assistant that orchestrates reasoning, tool execution, and evidence gathering.

How it Works

OpsOrch Copilot is not just a chatbot. It is a reasoning engine that invokes Tools exposed by the MCP server.

User Question

↓

Planning Loop (LLM)

↙↘

Method: query_incidents

Method: query_metrics

↓

Comprehensive Answer

Reasoning Engine

The Copilot uses an iterative Plan-Act-Observe loop to solve complex operational problems.

Planning

The LLM analyzes your question ("Why is checkout slow?") and breaks it down into required data steps.

Tool Execution

It invokes read-only tools via MCP (e.g., query-metrics, query-logs) to gather evidence.

Synthesis & Iteration

It observes the results. If the data is insufficient (e.g., no logs found), it iterates with a new plan to expand the time window or check a different service.

Capabilities

•Contextual Analysis: Retrieve and summarize incidents with full context (metrics, logs, tickets).
•Correlation: Connect spikes in metrics to recent deployments or log error bursts.
•Investigation: Run multi-step investigations to find root causes across systems.
•Runbook Discovery: Proactively suggest orchestration plans related to incidents or services.
•Answer with Evidence: Provides citations and deep links to source data in the Console.

Deep Links & Runbook Actions

Copilot responses include structured references that power Console deep links and action cards. Runbook suggestions link directly to orchestration plans so operators can launch runs without hunting.

json

{
  "actions": [
    { "type": "orchestration_plan", "id": "db-failover", "name": "DB Failover", "reason": "Applies to the current outage." }
  ],
  "references": {
    "incidents": ["inc-404"],
    "services": ["payments-api"],
    "orchestrationPlans": ["db-failover"]
  }
}

Safety & Resilience

Read-Only by Default

The Copilot is designed to be safe. It prioritizes "Read" operations. Any "Write" operation (e.g., restarting a pod) requires explicit user confirmation via a Human-in-the-Loop flow.

Resilience Patterns

The engine handles API failures gracefully with:

Exponential Backoff: Retries failed provider calls automatically.
Window Expansion: Automatically widens time ranges if metrics are empty.
Circuit Breaking: Stops calling down providers to prevent latency.

Deployment

To enable Copilot in your self-hosted instance, you must provide an LLM API key.

Supported Models

OpenAI GPT-4o / GPT-4 Turbo
Anthropic Claude 3.5 Sonnet
Google Gemini 3.0 Flash
AWS Bedrock (Claude / Titan)

Configuration

bash

# In your opsorch-copilot env or secrets:
LLM_PROVIDER="openai" # or "anthropic", "gemini", "bedrock"
OPENAI_API_KEY="sk-..." 

# For Gemini:
# LLM_PROVIDER="gemini"
# GEMINI_API_KEY="your-api-key"
# GEMINI_MODEL="gemini-3-flash-preview" # optional, this is the default

# Optional: Specialized Model Selection
LLM_MODEL_PLANNER="gpt-4o"
LLM_MODEL_FAST="gpt-3.5-turbo"

Example Flow

A typical investigation flow looks like this:

User:

"Why is the payment service returning 500s?"

Step 1: Planning

Strategy: Check service health, recent logs, and active incidents.

Step 2: Tool Execution

CodeBlock.invoke("query_incidents", {"service": "payment"})
CodeBlock.invoke("query_logs", {"query": "service:payment status:500"})

Step 3: Synthesis

Found correlation: 500s started 5 mins ago, coincident with Deployment #123.

Copilot Answer:

The payment-service is experiencing a 15% error rate starting at 14:30 UTC. This correlates with Deployment #123 which finished 2 minutes prior. There is an active P1 incident #INC-404.