Caching

CC-Relay includes a flexible caching layer that can significantly reduce latency and backend load by caching responses from LLM providers.

Overview

The cache subsystem supports three operating modes:

ModeBackendDescription
singleRistrettoHigh-performance local in-memory cache (default)
haOlricDistributed cache for high-availability deployments
disabledNoopPassthrough mode with no caching

When to use each mode:

  • Single mode: Development, testing, or single-instance production deployments. Provides the lowest latency with zero network overhead.
  • HA mode: Multi-instance production deployments where cache consistency across nodes is required.
  • Disabled mode: Debugging, compliance requirements, or when caching is handled elsewhere.

Architecture

  graph TB
    subgraph "cc-relay"
        A[Proxy Handler] --> B{Cache Layer}
        B --> C[Cache Interface]
    end

    subgraph "Backends"
        C --> D[Ristretto<br/>Single Node]
        C --> E[Olric<br/>Distributed]
        C --> F[Noop<br/>Disabled]
    end

    style A fill:#6366f1,stroke:#4f46e5,color:#fff
    style B fill:#ec4899,stroke:#db2777,color:#fff
    style C fill:#f59e0b,stroke:#d97706,color:#000
    style D fill:#10b981,stroke:#059669,color:#fff
    style E fill:#8b5cf6,stroke:#7c3aed,color:#fff
    style F fill:#6b7280,stroke:#4b5563,color:#fff

The cache layer implements a unified Cache interface that abstracts over all backends:

type Cache interface {
    Get(ctx context.Context, key string) ([]byte, error)
    Set(ctx context.Context, key string, value []byte) error
    SetWithTTL(ctx context.Context, key string, value []byte, ttl time.Duration) error
    Delete(ctx context.Context, key string) error
    Exists(ctx context.Context, key string) (bool, error)
    Close() error
}

Cache Flow

  sequenceDiagram
    participant Client
    participant Proxy
    participant Cache
    participant Backend

    Client->>Proxy: POST /v1/messages
    Proxy->>Cache: Get(key)
    alt Cache Hit
        Cache-->>Proxy: Cached Response
        Proxy-->>Client: Response (fast)
        Note over Client,Proxy: Latency: ~1ms
    else Cache Miss
        Cache-->>Proxy: ErrNotFound
        Proxy->>Backend: Forward Request
        Backend-->>Proxy: LLM Response
        Proxy->>Cache: SetWithTTL(key, value, ttl)
        Proxy-->>Client: Response
        Note over Client,Backend: Latency: 500ms-30s
    end

Configuration

Single Mode (Ristretto)

Ristretto is a high-performance, concurrent cache based on research from the Caffeine library. It uses TinyLFU admission policy for optimal hit rates.

cache:
  mode: single

  ristretto:
    # Number of 4-bit access counters
    # Recommended: 10x expected max items for optimal admission policy
    # Example: For 100,000 items, use 1,000,000 counters
    num_counters: 1000000

    # Maximum memory for cached values (in bytes)
    # 104857600 = 100 MB
    max_cost: 104857600

    # Number of keys per Get buffer (default: 64)
    # Controls admission buffer size
    buffer_items: 64

Memory calculation:

The max_cost parameter controls how much memory the cache can use for values. To estimate the appropriate size:

  1. Estimate average response size (typically 1-10 KB for LLM responses)
  2. Multiply by the number of unique requests you want to cache
  3. Add 20% overhead for metadata

Example: 10,000 cached responses x 5 KB average = 50 MB, so set max_cost: 52428800

HA Mode (Olric)

Olric provides distributed caching with automatic cluster discovery and data replication.

Client Mode (connecting to external cluster):

cache:
  mode: ha

  olric:
    # Olric cluster member addresses
    addresses:
      - "olric-1:3320"
      - "olric-2:3320"
      - "olric-3:3320"

    # Distributed map name (default: "cc-relay")
    dmap_name: "cc-relay"

Embedded Mode (single-node HA or development):

cache:
  mode: ha

  olric:
    # Run embedded Olric node
    embedded: true

    # Address to bind the embedded node
    bind_addr: "0.0.0.0:3320"

    # Peer addresses for cluster discovery (optional)
    peers:
      - "cc-relay-2:3320"
      - "cc-relay-3:3320"

    dmap_name: "cc-relay"

Disabled Mode

cache:
  mode: disabled

All cache operations return immediately without storing data. Get operations always return ErrNotFound.

Cache Modes Comparison

FeatureSingle (Ristretto)HA (Olric)Disabled (Noop)
BackendLocal memoryDistributedNone
Use CaseDevelopment, single instanceProduction HADebugging
PersistenceNoOptionalN/A
Multi-NodeNoYesN/A
Latency~1 microsecond~1-10 ms (network)~0
MemoryLocal onlyDistributedNone
ConsistencyN/AEventualN/A
ComplexityLowMediumNone

Optional Interfaces

Some cache backends support additional capabilities via optional interfaces:

Statistics

if sp, ok := cache.(cache.StatsProvider); ok {
    stats := sp.Stats()
    fmt.Printf("Hits: %d, Misses: %d\n", stats.Hits, stats.Misses)
}

Statistics include:

  • Hits: Number of cache hits
  • Misses: Number of cache misses
  • KeyCount: Current number of keys
  • BytesUsed: Approximate memory used
  • Evictions: Keys evicted due to capacity

Health Check (Ping)

if p, ok := cache.(cache.Pinger); ok {
    if err := p.Ping(ctx); err != nil {
        // Cache is unhealthy
    }
}

The Pinger interface is primarily useful for distributed caches (Olric) to verify cluster connectivity.

Batch Operations

// Batch get
if mg, ok := cache.(cache.MultiGetter); ok {
    results, err := mg.GetMulti(ctx, []string{"key1", "key2", "key3"})
}

// Batch set
if ms, ok := cache.(cache.MultiSetter); ok {
    err := ms.SetMultiWithTTL(ctx, items, 5*time.Minute)
}

Performance Tips

Optimizing Ristretto

  1. Set num_counters appropriately: Use 10x your expected max items. Too low reduces hit rate; too high wastes memory.

  2. Size max_cost based on response sizes: LLM responses vary widely. Monitor actual usage and adjust.

  3. Use TTL wisely: Short TTLs (1-5 min) for dynamic content, longer TTLs (1 hour+) for deterministic responses.

  4. Monitor metrics: Track hit rate to validate cache effectiveness:

    hit_rate = hits / (hits + misses)

    Aim for >80% hit rate for effective caching.

Optimizing Olric

  1. Deploy close to cc-relay instances: Network latency dominates distributed cache performance.

  2. Use embedded mode for single-node deployments: Avoids external dependencies while maintaining HA-ready configuration.

  3. Size the cluster appropriately: Each node should have enough memory for the full dataset (Olric replicates data).

  4. Monitor cluster health: Use the Pinger interface in health checks.

General Tips

  1. Cache key design: Use deterministic keys based on request content. Include model name, prompt hash, and relevant parameters.

  2. Avoid caching streaming responses: Streaming SSE responses are not cached by default due to their incremental nature.

  3. Consider cache warming: For predictable workloads, pre-populate the cache with common queries.

Troubleshooting

Cache misses when expected hits

  1. Check key generation: Ensure cache keys are deterministic and don’t include timestamps or request IDs.

  2. Verify TTL settings: Items may have expired. Check if TTL is too short for your use case.

  3. Monitor evictions: High eviction counts indicate max_cost is too low:

    stats := cache.Stats()
    if stats.Evictions > 0 {
        // Consider increasing max_cost
    }

Ristretto not storing items

Ristretto uses admission policy that may reject items to maintain high hit rates. This is normal behavior:

  1. New items may be rejected: TinyLFU requires items to “prove” their value through repeated access.

  2. Wait for buffer flush: Ristretto buffers writes. Call cache.Wait() in tests to ensure writes are processed.

  3. Check cost calculation: Items with cost > max_cost are never stored.

Olric cluster connectivity issues

  1. Verify network connectivity: Ensure all nodes can reach each other on port 3320 (or configured port).

  2. Check firewall rules: Olric requires bidirectional communication between nodes.

  3. Validate addresses: In client mode, ensure at least one address in the list is reachable.

  4. Monitor logs: Enable debug logging to see cluster membership events:

    logging:
      level: debug

Memory pressure

  1. Reduce max_cost: Lower the cache size to reduce memory usage.

  2. Use shorter TTLs: Expire items faster to free memory.

  3. Switch to Olric: Distribute memory pressure across multiple nodes.

  4. Monitor with metrics: Track BytesUsed to understand actual memory consumption.

Error Handling

The cache package defines standard errors for common conditions:

import "github.com/anthropics/cc-relay/internal/cache"

data, err := c.Get(ctx, key)
switch {
case errors.Is(err, cache.ErrNotFound):
    // Cache miss - fetch from backend
case errors.Is(err, cache.ErrClosed):
    // Cache was closed - recreate or fail
case err != nil:
    // Other error (network, serialization, etc.)
}

Next Steps