Caching

CC-Relay includes a flexible caching layer that can significantly reduce latency and backend load by caching responses from LLM providers.

Overview

The cache subsystem supports three operating modes:

Mode	Backend	Description
`single`	Ristretto	High-performance local in-memory cache (default)
`ha`	Olric	Distributed cache for high-availability deployments
`disabled`	Noop	Passthrough mode with no caching

When to use each mode:

Single mode: Development, testing, or single-instance production deployments. Provides the lowest latency with zero network overhead.
HA mode: Multi-instance production deployments where cache consistency across nodes is required.
Disabled mode: Debugging, compliance requirements, or when caching is handled elsewhere.

Architecture

  graph TB
    subgraph "cc-relay"
        A[Proxy Handler] --> B{Cache Layer}
        B --> C[Cache Interface]
    end

    subgraph "Backends"
        C --> D[Ristretto<br/>Single Node]
        C --> E[Olric<br/>Distributed]
        C --> F[Noop<br/>Disabled]
    end

    style A fill:#6366f1,stroke:#4f46e5,color:#fff
    style B fill:#ec4899,stroke:#db2777,color:#fff
    style C fill:#f59e0b,stroke:#d97706,color:#000
    style D fill:#10b981,stroke:#059669,color:#fff
    style E fill:#8b5cf6,stroke:#7c3aed,color:#fff
    style F fill:#6b7280,stroke:#4b5563,color:#fff

The cache layer implements a unified Cache interface that abstracts over all backends:

type Cache interface {
    Get(ctx context.Context, key string) ([]byte, error)
    Set(ctx context.Context, key string, value []byte) error
    SetWithTTL(ctx context.Context, key string, value []byte, ttl time.Duration) error
    Delete(ctx context.Context, key string) error
    Exists(ctx context.Context, key string) (bool, error)
    Close() error
}

Cache Flow

  sequenceDiagram
    participant Client
    participant Proxy
    participant Cache
    participant Backend

    Client->>Proxy: POST /v1/messages
    Proxy->>Cache: Get(key)
    alt Cache Hit
        Cache-->>Proxy: Cached Response
        Proxy-->>Client: Response (fast)
        Note over Client,Proxy: Latency: ~1ms
    else Cache Miss
        Cache-->>Proxy: ErrNotFound
        Proxy->>Backend: Forward Request
        Backend-->>Proxy: LLM Response
        Proxy->>Cache: SetWithTTL(key, value, ttl)
        Proxy-->>Client: Response
        Note over Client,Backend: Latency: 500ms-30s
    end

Configuration

Single Mode (Ristretto)

Ristretto is a high-performance, concurrent cache based on research from the Caffeine library. It uses TinyLFU admission policy for optimal hit rates.

cache:
  mode: single

  ristretto:
    # Number of 4-bit access counters
    # Recommended: 10x expected max items for optimal admission policy
    # Example: For 100,000 items, use 1,000,000 counters
    num_counters: 1000000

    # Maximum memory for cached values (in bytes)
    # 104857600 = 100 MB
    max_cost: 104857600

    # Number of keys per Get buffer (default: 64)
    # Controls admission buffer size
    buffer_items: 64

Memory calculation:

The max_cost parameter controls how much memory the cache can use for values. To estimate the appropriate size:

Estimate average response size (typically 1-10 KB for LLM responses)
Multiply by the number of unique requests you want to cache
Add 20% overhead for metadata

Example: 10,000 cached responses x 5 KB average = 50 MB, so set max_cost: 52428800

HA Mode (Olric)

Olric provides distributed caching with automatic cluster discovery and data replication.

Client Mode (connecting to external cluster):

cache:
  mode: ha

  olric:
    # Olric cluster member addresses
    addresses:
      - "olric-1:3320"
      - "olric-2:3320"
      - "olric-3:3320"

    # Distributed map name (default: "cc-relay")
    dmap_name: "cc-relay"

Embedded Mode (single-node HA or development):

cache:
  mode: ha

  olric:
    # Run embedded Olric node
    embedded: true

    # Address to bind the embedded node
    bind_addr: "0.0.0.0:3320"

    # Peer addresses for cluster discovery (optional)
    peers:
      - "cc-relay-2:3320"
      - "cc-relay-3:3320"

    dmap_name: "cc-relay"

Disabled Mode

cache:
  mode: disabled

All cache operations return immediately without storing data. Get operations always return ErrNotFound.

Cache Modes Comparison

Feature	Single (Ristretto)	HA (Olric)	Disabled (Noop)
Backend	Local memory	Distributed	None
Use Case	Development, single instance	Production HA	Debugging
Persistence	No	Optional	N/A
Multi-Node	No	Yes	N/A
Latency	~1 microsecond	~1-10 ms (network)	~0
Memory	Local only	Distributed	None
Consistency	N/A	Eventual	N/A
Complexity	Low	Medium	None

Optional Interfaces

Some cache backends support additional capabilities via optional interfaces:

Statistics

if sp, ok := cache.(cache.StatsProvider); ok {
    stats := sp.Stats()
    fmt.Printf("Hits: %d, Misses: %d\n", stats.Hits, stats.Misses)
}

Statistics include:

Hits: Number of cache hits
Misses: Number of cache misses
KeyCount: Current number of keys
BytesUsed: Approximate memory used
Evictions: Keys evicted due to capacity

Health Check (Ping)

if p, ok := cache.(cache.Pinger); ok {
    if err := p.Ping(ctx); err != nil {
        // Cache is unhealthy
    }
}

The Pinger interface is primarily useful for distributed caches (Olric) to verify cluster connectivity.

Batch Operations

// Batch get
if mg, ok := cache.(cache.MultiGetter); ok {
    results, err := mg.GetMulti(ctx, []string{"key1", "key2", "key3"})
}

// Batch set
if ms, ok := cache.(cache.MultiSetter); ok {
    err := ms.SetMultiWithTTL(ctx, items, 5*time.Minute)
}

Performance Tips

Optimizing Ristretto

Set num_counters appropriately: Use 10x your expected max items. Too low reduces hit rate; too high wastes memory.
Size max_cost based on response sizes: LLM responses vary widely. Monitor actual usage and adjust.
Use TTL wisely: Short TTLs (1-5 min) for dynamic content, longer TTLs (1 hour+) for deterministic responses.
Monitor metrics: Track hit rate to validate cache effectiveness:
```
hit_rate = hits / (hits + misses)
```
Aim for >80% hit rate for effective caching.

Optimizing Olric

Deploy close to cc-relay instances: Network latency dominates distributed cache performance.
Use embedded mode for single-node deployments: Avoids external dependencies while maintaining HA-ready configuration.
Size the cluster appropriately: Each node should have enough memory for the full dataset (Olric replicates data).
Monitor cluster health: Use the Pinger interface in health checks.

General Tips

Cache key design: Use deterministic keys based on request content. Include model name, prompt hash, and relevant parameters.
Avoid caching streaming responses: Streaming SSE responses are not cached by default due to their incremental nature.
Consider cache warming: For predictable workloads, pre-populate the cache with common queries.

Troubleshooting

Cache misses when expected hits

Check key generation: Ensure cache keys are deterministic and don’t include timestamps or request IDs.
Verify TTL settings: Items may have expired. Check if TTL is too short for your use case.

Monitor evictions: High eviction counts indicate max_cost is too low:

stats := cache.Stats()
if stats.Evictions > 0 {
    // Consider increasing max_cost
}

Ristretto not storing items

Ristretto uses admission policy that may reject items to maintain high hit rates. This is normal behavior:

New items may be rejected: TinyLFU requires items to “prove” their value through repeated access.
Wait for buffer flush: Ristretto buffers writes. Call cache.Wait() in tests to ensure writes are processed.
Check cost calculation: Items with cost > max_cost are never stored.

Olric cluster connectivity issues

Verify network connectivity: Ensure all nodes can reach each other on port 3320 (or configured port).
Check firewall rules: Olric requires bidirectional communication between nodes.
Validate addresses: In client mode, ensure at least one address in the list is reachable.
Monitor logs: Enable debug logging to see cluster membership events:
```
logging:
  level: debug
```

Memory pressure

Reduce max_cost: Lower the cache size to reduce memory usage.
Use shorter TTLs: Expire items faster to free memory.
Switch to Olric: Distribute memory pressure across multiple nodes.
Monitor with metrics: Track BytesUsed to understand actual memory consumption.

Error Handling

The cache package defines standard errors for common conditions:

import "github.com/anthropics/cc-relay/internal/cache"

data, err := c.Get(ctx, key)
switch {
case errors.Is(err, cache.ErrNotFound):
    // Cache miss - fetch from backend
case errors.Is(err, cache.ErrClosed):
    // Cache was closed - recreate or fail
case err != nil:
    // Other error (network, serialization, etc.)
}

Next Steps

API Reference