Olympus Architecture

Olympus is structured around a small set of focused components. Each one has a single responsibility — routing, compression, provider dispatch, or UI — and communicates through well-defined interfaces. The diagrams below show how a query moves through the system from keystroke to response.

System components

The names follow Greek mythology. Zeus is the master orchestrator; Hermes is the router; the shell is the TUI. Provider plugins sit outside the core and communicate through a common interface.

flowchart TD subgraph TUI["Shell — BubbleTea TUI"] Input["Keyboard input"] VP["Viewport\nGlamour markdown renderer"] Cmds["Slash commands\n/fix /review /security /govern"] end subgraph Core["Core — Zeus"] Zeus["Zeus\nOrchestrator"] Hermes["Hermes\nProvider router"] Compress["Compression engine\nOllama summarisation"] History["Conversation history\nmessage slice"] Stats["Argus\nToken tracker / cost metrics"] end subgraph Providers["Provider layer"] Ollama["Ollama\nlocal · free"] ClaudePro["Claude Pro\nOAuth · subscription"] Copilot["GitHub Copilot\nsubscription"] ClaudeAPI["Claude API\nper-token"] Plugins["Plugin providers\nGroq · Mistral · Azure"] end Input --> Zeus Cmds --> Zeus Zeus --> History Zeus --> Compress Zeus --> Hermes Zeus --> Stats Hermes --> Ollama Hermes --> ClaudePro Hermes --> Copilot Hermes --> ClaudeAPI Hermes --> Plugins Ollama -->|response| Zeus ClaudePro -->|response| Zeus ClaudeAPI -->|response| Zeus Zeus -->|token stream| VP style TUI fill:#1e2430,color:#D8DEE9,stroke:#4C566A style Core fill:#1a2535,color:#D8DEE9,stroke:#5E81AC style Providers fill:#1a2e20,color:#D8DEE9,stroke:#A3BE8C style Ollama fill:#2E3440,color:#A3BE8C,stroke:#A3BE8C style ClaudePro fill:#2E3440,color:#88C0D0,stroke:#88C0D0 style Copilot fill:#2E3440,color:#81A1C1,stroke:#81A1C1 style ClaudeAPI fill:#3a2020,color:#BF616A,stroke:#BF616A style Plugins fill:#2E3440,color:#EBCB8B,stroke:#EBCB8B

4-level provider waterfall

Every query enters at Ollama. If Ollama is running, it handles the request for free at <200ms. If unavailable, the query falls through to cloud providers — subscription providers first, pay-per-token last. This mirrors the routing strategy described in RouteLLM (Berkeley/Lmsys, 2024).

flowchart TD Q([User query]) --> Score Score["Complexity scorer\nheuristic · pure Go · 0ms"] Score -->|score within local threshold| O Score -->|score exceeds threshold| CP O["1 · Ollama local\nllama3 / mistral / phi3"] O -->|available| OR([Response · free · less than 200ms]) O -->|unavailable| CP CP["2 · Claude Pro OAuth\nsubscription · no per-token cost"] CP -->|available| CPR([Response · subscription]) CP -->|unavailable| GH GH["3 · GitHub Copilot\nsubscription · no per-token cost"] GH -->|available| GHR([Response · subscription]) GH -->|unavailable| CA CA["4 · Claude API\npay-per-token · last resort"] CA -->|available| CAR([Response · cost warning shown]) CA -->|unavailable| ERR([Error + diagnosis]) style Score fill:#252d3a,color:#EBCB8B,stroke:#EBCB8B style O fill:#2E3440,color:#A3BE8C,stroke:#A3BE8C style OR fill:#1a2e20,color:#A3BE8C,stroke:#A3BE8C style CP fill:#2E3440,color:#88C0D0,stroke:#88C0D0 style CPR fill:#1a2535,color:#88C0D0,stroke:#88C0D0 style GH fill:#2E3440,color:#81A1C1,stroke:#81A1C1 style GHR fill:#1a2535,color:#81A1C1,stroke:#81A1C1 style CA fill:#3a2020,color:#BF616A,stroke:#BF616A style CAR fill:#3a2020,color:#EBCB8B,stroke:#EBCB8B style ERR fill:#3a2020,color:#BF616A,stroke:#BF616A style Q fill:#252d3a,color:#D8DEE9,stroke:#4C566A

The local_threshold config key controls how aggressive the local bias is. At 1.0 (default) every query goes to Ollama when available. Lower values let higher-complexity queries skip straight to cloud.

Context compression before cloud calls

Without compression, every cloud API call re-sends the entire conversation history. By turn 20 you are paying for 20x the tokens you actually need. Olympus intercepts this with a local Ollama summarisation step that runs before any cloud call.

This technique is described in detail in LLMLingua-2 (Microsoft Research, 2024) and in Anthropic's own long context management guide.

flowchart LR subgraph Session["Active session — 20 turns"] T1["Turns 1 to 16\n~6400 tokens\nverbose back-and-forth"] T2["Turns 17 to 20\n~1600 tokens\nrecent context"] end subgraph Compress["Local compression — Ollama — free"] OL["Ollama summarises\nturns 1 to 16\n~1 second · zero cost"] SUM["Dense summary\n~800 tokens\ndecisions · code · errors"] end subgraph Payload["Cloud API payload"] PAY["Summary + turns 17 to 20\n~2400 tokens\n70% smaller"] end T1 -->|compress_after_turns: 10| OL OL --> SUM T2 --> PAY SUM --> PAY PAY -->|sent to Claude or Copilot| Cloud([Cloud response]) style Session fill:#1e2430,color:#D8DEE9,stroke:#4C566A style Compress fill:#1a2e20,color:#D8DEE9,stroke:#A3BE8C style Payload fill:#1a2535,color:#D8DEE9,stroke:#5E81AC style OL fill:#2E3440,color:#A3BE8C,stroke:#A3BE8C style SUM fill:#2E3440,color:#88C0D0,stroke:#88C0D0 style PAY fill:#2E3440,color:#81A1C1,stroke:#81A1C1 style Cloud fill:#1a2535,color:#88C0D0,stroke:#88C0D0
ScenarioWithout compressionWith compressionReduction
10-turn session~4,000 tokens/call~1,200 tokens/call70%
20-turn session~8,000 tokens/call~1,200 tokens/call85%
40-turn session~16,000 tokens/call~1,400 tokens/call91%

Compression is automatic. Configure the trigger point:

routing:
  compress_after_turns: 10   # summarise history every 10 turns (default)

End-to-end query lifecycle

From keypress to rendered response, a query passes through five stages. The critical insight is that Ollama participates in two stages — as the primary answer provider and as the compression engine for cloud calls.

sequenceDiagram actor User participant Shell as Shell TUI participant Zeus as Zeus orchestrator participant Hermes as Hermes router participant Ollama as Ollama local participant Cloud as Cloud provider User->>Shell: types query Shell->>Zeus: send query Zeus->>Zeus: append to history alt Ollama available Zeus->>Hermes: route query with history Hermes->>Ollama: generate messages Ollama-->>Shell: stream tokens Shell-->>User: render markdown else Ollama unavailable Zeus->>Ollama: compress history turns 1 to N-4 Ollama-->>Zeus: summary string Zeus->>Hermes: route query with compressed history Hermes->>Cloud: generate compressed messages Cloud-->>Shell: stream tokens Shell-->>User: render markdown with cost warning end

Provider interface and plugin system

Every provider — built-in or plugin — implements the same Go interface. This means adding a new provider requires zero changes to the core routing logic.

classDiagram class Provider { <<interface>> +Name() string +IsAvailable() bool +MaxContext() int +CostPerToken() float64 +Capabilities() Capability list +Generate(ctx, req) Response } class OllamaProvider { host string model string } class ClaudeProProvider { oauthToken string } class CopilotProvider { githubPAT string } class ClaudeAPIProvider { apiKey string } class OpenAICompatProvider { name string baseURL string apiKey string model string } Provider <|.. OllamaProvider Provider <|.. ClaudeProProvider Provider <|.. CopilotProvider Provider <|.. ClaudeAPIProvider Provider <|.. OpenAICompatProvider

OpenAICompatProvider covers Groq, Mistral, Azure OpenAI, Together AI, and any OpenAI-compatible API.

Plugin providers can be added two ways:

# CLI — writes to ~/.olympus/config.yaml automatically
olympus providers add groq \
  --key gsk_... \
  --model llama-3.3-70b-versatile \
  --base-url https://api.groq.com/openai/v1
# ~/.olympus/config.yaml
plugins:
  groq:
    api_key: "gsk_..."
    model: "llama-3.3-70b-versatile"
    base_url: "https://api.groq.com/openai/v1"

For providers requiring custom auth or streaming, implement providers.Provider directly and register via init(). See the plugin guide in the README.

Shell rendering pipeline

The TUI is built with Bubble Tea (Elm architecture) and renders AI responses as full markdown using Glamour. Streaming and final rendering are handled differently to avoid partial-markdown artefacts.

flowchart LR subgraph Streaming["While streaming"] S1["Raw text tokens\narriving via channel"] S2["Displayed as plain text\nwith blinking cursor"] end subgraph Complete["On stream complete"] C1["Full response string"] C2["Glamour renderer\ndark style · terminal width"] C3["Rendered markdown\nheaders · code blocks · tables"] end S1 --> S2 S2 -->|stream done| C1 C1 --> C2 C2 --> C3 style Streaming fill:#1e2430,color:#D8DEE9,stroke:#4C566A style Complete fill:#1a2535,color:#D8DEE9,stroke:#5E81AC

The renderer is recreated on every WindowSizeMsg so word-wrap tracks terminal width. A 6-column margin is subtracted to prevent Glamour from filling the full terminal width.

Mythology naming convention

NameRoleStatus
ZeusMaster orchestrator — routes, compresses, tracks costsLive
HermesCapability-aware provider routerLive
ArgusObservability — token tracking, cost metricsLive
AthenaStructured multi-step reasoning enginePlanned
HephaestusCode generation with diff output and auto-applyPlanned
MnemosynePersistent cross-session memoryPlanned
AegisSecurity guardrails and prompt sanitisationPlanned