Caching
CC-Relay includes a flexible caching layer that can significantly reduce latency and backend load by caching responses from LLM providers.
Overview
The cache subsystem supports three operating modes:
| Mode | Backend | Description |
|---|---|---|
single | Ristretto | High-performance local in-memory cache (default) |
ha | Olric | Distributed cache for high-availability deployments |
disabled | Noop | Passthrough mode with no caching |
When to use each mode:
- Single mode: Development, testing, or single-instance production deployments. Provides the lowest latency with zero network overhead.
- HA mode: Multi-instance production deployments where cache consistency across nodes is required.
- Disabled mode: Debugging, compliance requirements, or when caching is handled elsewhere.
Architecture
graph TB
subgraph "cc-relay"
A[Proxy Handler] --> B{Cache Layer}
B --> C[Cache Interface]
end
subgraph "Backends"
C --> D[Ristretto<br/>Single Node]
C --> E[Olric<br/>Distributed]
C --> F[Noop<br/>Disabled]
end
style A fill:#6366f1,stroke:#4f46e5,color:#fff
style B fill:#ec4899,stroke:#db2777,color:#fff
style C fill:#f59e0b,stroke:#d97706,color:#000
style D fill:#10b981,stroke:#059669,color:#fff
style E fill:#8b5cf6,stroke:#7c3aed,color:#fff
style F fill:#6b7280,stroke:#4b5563,color:#fff
The cache layer implements a unified Cache interface that abstracts over all backends:
type Cache interface {
Get(ctx context.Context, key string) ([]byte, error)
Set(ctx context.Context, key string, value []byte) error
SetWithTTL(ctx context.Context, key string, value []byte, ttl time.Duration) error
Delete(ctx context.Context, key string) error
Exists(ctx context.Context, key string) (bool, error)
Close() error
}Cache Flow
sequenceDiagram
participant Client
participant Proxy
participant Cache
participant Backend
Client->>Proxy: POST /v1/messages
Proxy->>Cache: Get(key)
alt Cache Hit
Cache-->>Proxy: Cached Response
Proxy-->>Client: Response (fast)
Note over Client,Proxy: Latency: ~1ms
else Cache Miss
Cache-->>Proxy: ErrNotFound
Proxy->>Backend: Forward Request
Backend-->>Proxy: LLM Response
Proxy->>Cache: SetWithTTL(key, value, ttl)
Proxy-->>Client: Response
Note over Client,Backend: Latency: 500ms-30s
end
Configuration
Single Mode (Ristretto)
Ristretto is a high-performance, concurrent cache based on research from the Caffeine library. It uses TinyLFU admission policy for optimal hit rates.
cache:
mode: single
ristretto:
# Number of 4-bit access counters
# Recommended: 10x expected max items for optimal admission policy
# Example: For 100,000 items, use 1,000,000 counters
num_counters: 1000000
# Maximum memory for cached values (in bytes)
# 104857600 = 100 MB
max_cost: 104857600
# Number of keys per Get buffer (default: 64)
# Controls admission buffer size
buffer_items: 64Memory calculation:
The max_cost parameter controls how much memory the cache can use for values. To estimate the appropriate size:
- Estimate average response size (typically 1-10 KB for LLM responses)
- Multiply by the number of unique requests you want to cache
- Add 20% overhead for metadata
Example: 10,000 cached responses x 5 KB average = 50 MB, so set max_cost: 52428800
HA Mode (Olric)
Olric provides distributed caching with automatic cluster discovery and data replication.
Client Mode (connecting to external cluster):
cache:
mode: ha
olric:
# Olric cluster member addresses
addresses:
- "olric-1:3320"
- "olric-2:3320"
- "olric-3:3320"
# Distributed map name (default: "cc-relay")
dmap_name: "cc-relay"Embedded Mode (single-node HA or development):
cache:
mode: ha
olric:
# Run embedded Olric node
embedded: true
# Address to bind the embedded node
bind_addr: "0.0.0.0:3320"
# Peer addresses for cluster discovery (optional)
peers:
- "cc-relay-2:3320"
- "cc-relay-3:3320"
dmap_name: "cc-relay"Disabled Mode
cache:
mode: disabledAll cache operations return immediately without storing data. Get operations always return ErrNotFound.
Cache Modes Comparison
| Feature | Single (Ristretto) | HA (Olric) | Disabled (Noop) |
|---|---|---|---|
| Backend | Local memory | Distributed | None |
| Use Case | Development, single instance | Production HA | Debugging |
| Persistence | No | Optional | N/A |
| Multi-Node | No | Yes | N/A |
| Latency | ~1 microsecond | ~1-10 ms (network) | ~0 |
| Memory | Local only | Distributed | None |
| Consistency | N/A | Eventual | N/A |
| Complexity | Low | Medium | None |
Optional Interfaces
Some cache backends support additional capabilities via optional interfaces:
Statistics
if sp, ok := cache.(cache.StatsProvider); ok {
stats := sp.Stats()
fmt.Printf("Hits: %d, Misses: %d\n", stats.Hits, stats.Misses)
}Statistics include:
Hits: Number of cache hitsMisses: Number of cache missesKeyCount: Current number of keysBytesUsed: Approximate memory usedEvictions: Keys evicted due to capacity
Health Check (Ping)
if p, ok := cache.(cache.Pinger); ok {
if err := p.Ping(ctx); err != nil {
// Cache is unhealthy
}
}The Pinger interface is primarily useful for distributed caches (Olric) to verify cluster connectivity.
Batch Operations
// Batch get
if mg, ok := cache.(cache.MultiGetter); ok {
results, err := mg.GetMulti(ctx, []string{"key1", "key2", "key3"})
}
// Batch set
if ms, ok := cache.(cache.MultiSetter); ok {
err := ms.SetMultiWithTTL(ctx, items, 5*time.Minute)
}Performance Tips
Optimizing Ristretto
Set
num_countersappropriately: Use 10x your expected max items. Too low reduces hit rate; too high wastes memory.Size
max_costbased on response sizes: LLM responses vary widely. Monitor actual usage and adjust.Use TTL wisely: Short TTLs (1-5 min) for dynamic content, longer TTLs (1 hour+) for deterministic responses.
Monitor metrics: Track hit rate to validate cache effectiveness:
hit_rate = hits / (hits + misses)Aim for >80% hit rate for effective caching.
Optimizing Olric
Deploy close to cc-relay instances: Network latency dominates distributed cache performance.
Use embedded mode for single-node deployments: Avoids external dependencies while maintaining HA-ready configuration.
Size the cluster appropriately: Each node should have enough memory for the full dataset (Olric replicates data).
Monitor cluster health: Use the
Pingerinterface in health checks.
General Tips
Cache key design: Use deterministic keys based on request content. Include model name, prompt hash, and relevant parameters.
Avoid caching streaming responses: Streaming SSE responses are not cached by default due to their incremental nature.
Consider cache warming: For predictable workloads, pre-populate the cache with common queries.
Troubleshooting
Cache misses when expected hits
Check key generation: Ensure cache keys are deterministic and don’t include timestamps or request IDs.
Verify TTL settings: Items may have expired. Check if TTL is too short for your use case.
Monitor evictions: High eviction counts indicate
max_costis too low:stats := cache.Stats() if stats.Evictions > 0 { // Consider increasing max_cost }
Ristretto not storing items
Ristretto uses admission policy that may reject items to maintain high hit rates. This is normal behavior:
New items may be rejected: TinyLFU requires items to “prove” their value through repeated access.
Wait for buffer flush: Ristretto buffers writes. Call
cache.Wait()in tests to ensure writes are processed.Check cost calculation: Items with cost >
max_costare never stored.
Olric cluster connectivity issues
Verify network connectivity: Ensure all nodes can reach each other on port 3320 (or configured port).
Check firewall rules: Olric requires bidirectional communication between nodes.
Validate addresses: In client mode, ensure at least one address in the list is reachable.
Monitor logs: Enable debug logging to see cluster membership events:
logging: level: debug
Memory pressure
Reduce
max_cost: Lower the cache size to reduce memory usage.Use shorter TTLs: Expire items faster to free memory.
Switch to Olric: Distribute memory pressure across multiple nodes.
Monitor with metrics: Track
BytesUsedto understand actual memory consumption.
Error Handling
The cache package defines standard errors for common conditions:
import "github.com/anthropics/cc-relay/internal/cache"
data, err := c.Get(ctx, key)
switch {
case errors.Is(err, cache.ErrNotFound):
// Cache miss - fetch from backend
case errors.Is(err, cache.ErrClosed):
// Cache was closed - recreate or fail
case err != nil:
// Other error (network, serialization, etc.)
}