Response Caching
Beta
Response caching is currently in beta. The API and behavior may change.
Response caching allows you to cache responses for identical API requests. When a cached response is available, OpenRouter returns it immediately from cache with no billing (all billable usage counters are reported as 0), reducing both latency and cost.
Both streaming and non-streaming requests are eligible for caching. Only successful (200 OK) responses are cached. Error responses, rate limit responses, and partial results are never cached. Responses containing tool calls are cached normally since they are part of a successful completion. For streaming requests, the cached response is replayed through the same streaming pipeline, so the client receives the same content chunks on a cache hit. The id field, created timestamp, and X-Generation-Id response header in each chunk reflect the new cache-hit generation record, not the original.
Enabling Caching
There are two ways to enable response caching:
1. Per-Request via Headers
Add the X-OpenRouter-Cache header to enable caching for individual requests:
The first request results in a cache MISS. The response is stored and billed normally:
Sending the same request again returns a cache HIT with zeroed usage and no billing. Each cache hit receives its own unique generation ID (note gen-def456 below, different from the original gen-abc123):
2. Via Presets
You can enable caching for all requests that use a specific preset by configuring these fields in the preset:
When cache_enabled is set on a preset, caching is automatically applied to every request that references that preset. No X-OpenRouter-Cache header is required.
Example preset configuration:
How It Works
Two requests are considered identical when they share the same API key, model, endpoint type, streaming mode, and request body (including all parameters). When caching is enabled, OpenRouter generates a cache key from these inputs. If an identical request has been made before and the cached response has not expired, the cached response is returned immediately. Changing any of these–including the model, endpoint, or switching between streaming and non-streaming–produces a different cache key and a cache miss.
Cache is scoped to your API key. Different API keys, even under the same account or organization, do not share cache. Rotating your API key will result in an empty cache for the new key.
Non-determinism: Cached responses are returned verbatim regardless of stochastic parameters like temperature. If you need fresh responses, use X-OpenRouter-Cache-Clear: true or a short TTL.
Cache Key Details
The cache key is derived from your API key, model, endpoint type, streaming mode, and a SHA-256 hash of the request body. Streaming and non-streaming requests are cached separately, so a stream: true request will not return a cached non-streaming response and vice versa. The request body is normalized before hashing, so extra whitespace does not affect the cache key. However, the property order of the JSON body is significant:
- Different property ordering in logically identical JSON (e.g.
{"model":"x","messages":[]}vs{"messages":[],"model":"x"}) will produce different cache keys - Omitting optional fields vs. explicitly sending defaults (e.g.
temperature: 1.0) produces different keys - Attribution headers (e.g.
HTTP-Referer,X-Title) and provider-specific headers are not part of the cache key - Multimodal requests (images, audio, video, file attachments) are eligible for caching. The full request body, including base64-encoded content, is included in the hash
Precedence
Request headers and preset configuration interact as follows:
- If a preset explicitly sets
cache_enabled: false, caching is disabled regardless of request headers–the header cannot override a preset opt-out X-OpenRouter-Cache: falseheader disables caching even if the preset enables itX-OpenRouter-Cache: trueenables caching when the preset does not configure caching (i.e.cache_enabledis absent)–but cannot override a preset that explicitly setscache_enabled: false(rule 1 takes precedence)X-OpenRouter-Cache-TTLheader overrides the presetcache_ttl_seconds(default: 300 seconds)- If neither header nor preset is set, caching is off
Concurrent Requests
If two identical requests arrive simultaneously before the first response is written to cache, both result in a cache MISS and are billed independently. There is no request coalescing.
Supported Endpoints
Cache keys include an endpoint type discriminator, so requests to different endpoints with identical bodies will not collide.
Provider caching: Some providers offer their own prompt caching (e.g. Anthropic prompt caching, OpenAI cached context). Provider caching is separate from OpenRouter response caching and the two can be used together. OpenRouter caching operates at the request level before the call reaches the provider, while provider caching operates within the provider’s infrastructure.
Request Headers
TTL values that cannot be parsed as an integer (i.e., do not begin with digits) are ignored and fall through to the preset or default TTL. Values beginning with digits are accepted even if they contain trailing non-numeric characters (e.g., 60abc is treated as 60); decimal values are truncated (e.g., 1.5 is treated as 1). Numeric values outside the valid range are clamped to [1, 86400].
Response Headers
The X-Generation-Id header is also present on every response (cached or not) and is not specific to caching. On a cache hit, the generation ID is unique to that hit–it is not reused from the original response.
TTL (Time-to-Live)
The TTL controls how long a cached response remains valid.
- Default: 300 seconds (5 minutes)
- Range: 1 second to 86400 seconds (24 hours)
You can customize the TTL per-request using the X-OpenRouter-Cache-TTL header, or set a default TTL in your preset configuration.
Cache Clearing
To force a fresh response for a specific request, send the X-OpenRouter-Cache-Clear: true header alongside X-OpenRouter-Cache: true (or with a preset that has cache_enabled: true). This deletes the existing cached entry for that cache key, makes a new request to the provider, and stores the new response. X-OpenRouter-Cache-Clear has no effect unless caching is enabled for the request. This does not clear all cached entries–only the one matching the current request.
The new cache entry uses the TTL from the current request’s X-OpenRouter-Cache-TTL header, the preset cache_ttl_seconds, or the default (300 seconds), following the standard precedence rules.
Billing
Cache hits are free. No tokens are consumed and all billable usage counters are reported as 0. For chat completions and Responses endpoints, usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens are zeroed. For the Embeddings endpoint, usage.prompt_tokens and usage.total_tokens are zeroed (completion_tokens is not present in embeddings responses). For the Anthropic Messages endpoint, usage.input_tokens and usage.output_tokens are zeroed. You are only billed for the original request that populates the cache (a cache MISS).
Cache hits do not count toward provider rate limits since the request never reaches a provider.
Limitations
- Disabled for account-level Zero Data Retention (ZDR): Response caching is not available when account-level ZDR is enforced, since caching requires temporarily storing response data. Per-request
provider.zdrdoes not affect cache eligibility. - Concurrent identical requests: If two identical requests arrive before the first response is cached, both result in a
MISS. See Concurrent Requests. - Cache eviction: Cached responses may be evicted before TTL expiry under memory pressure. There is no limit on the number of entries you can cache, but eviction under pressure means entries are not guaranteed to survive their full TTL.
Data Retention
Cached responses are stored in edge infrastructure, retained only for the TTL duration, and automatically evicted upon expiry. Cached data is accessible only via the API key that triggered the caching–no other key, account, or organization can retrieve it. Cached data is not used for training or shared with third parties.
Use Cases
Agent Workflows
When an agent workflow fails partway through, you can resume from the point of failure without re-running and re-paying for identical earlier requests. Enable caching at the start of the workflow and all prior steps return immediately from cache on retry.
Unit Testing
Get repeatable responses for your test suite. After the initial run populates the cache, subsequent identical requests return the same cached response every time at zero cost. For deterministic first-run results, use temperature: 0 or a fixed seed.
Repeated Identical Requests
If your application makes the same request multiple times (same model, same messages, same parameters), caching ensures only the first call hits the provider. Subsequent identical calls return immediately from cache at zero cost.
Monitoring Cache Effectiveness
Cache hit and miss status is visible in your Activity log. Each cached request appears as a separate entry with a cache indicator, and you can filter the log to show only cached or non-cached requests. Every cache hit receives its own unique generation ID, so you can track individual cached responses independently.