Prompt caching control
Prompt caching control lets you instruct a supported LLM provider to store a portion of your prompt server-side, so that subsequent requests that share the same prefix can be served from the cache instead of reprocessing the tokens. This reduces both latency and cost for repetitive workloads such as multi-turn conversations, large system prompts, or fixed tool definitions.
Prompt caching vs. response caching
Prompt caching control is a provider-side feature: the provider stores the prompt prefix,
not the response. This is different from CachedPromptExecutor,
which stores complete LLM responses locally so that identical prompts skip the network call entirely.
Koog supports prompt caching control for Anthropic and Amazon Bedrock.
Anthropic
Anthropic supports two complementary approaches to prompt caching.
Automatic caching (request-level)
Set the cacheControl property on AnthropicParams and pass it to your prompt.
Anthropic will automatically place the cache breakpoint at the last cacheable block in the
request, without you having to annotate individual messages.
This is the recommended approach for multi-turn conversations.
// Enable automatic caching with the default 5-minute TTL
val params = AnthropicParams(cacheControl = AnthropicCacheControl.Default)
val prompt = prompt("assistant", params = params) {
system("You are a helpful assistant with a very long system prompt...")
user("What can you help me with?")
}
val response = client.execute(prompt, AnthropicModels.Sonnet_4)
println(response)
// Enable automatic caching with the default 5-minute TTL
AnthropicParams params = new AnthropicParams(
null, null, null, null, null, null, null, null,
null, null, null, null, null, null, null,
AnthropicCacheControl.Default.INSTANCE
);
Prompt prompt = Prompt.builder("assistant")
.system("You are a helpful assistant with a very long system prompt...")
.user("What can you help me with?")
.build()
.withParams(params);
Manual caching (block-level)
Attach a cacheControl argument to individual messages or tool definitions to place the
cache breakpoint at a specific position. Everything up to and including the annotated block
is eligible for caching.
System messages
User and assistant messages
val prompt = prompt("conversation") {
system("You are a helpful assistant.")
// Cache after a large user message (e.g. document content)
user(listOf(MessagePart.Text("Here is a long document: ...", cacheControl = AnthropicCacheControl.Default)))
assistant(listOf(MessagePart.Text("I have read the document.")))
user("Summarize it.")
}
val response = client.execute(prompt, AnthropicModels.Sonnet_4)
println(response)
Prompt prompt = Prompt.builder("conversation")
.system("You are a helpful assistant.")
// Cache after a large user message (e.g. document content)
.user(List.of(new ContentPart.Text("Here is a long document: ...")), AnthropicCacheControl.Default.INSTANCE)
.assistant("I have read the document.", AnthropicCacheControl.Default.INSTANCE)
.user("Summarize it.")
.build();
Tool definitions
When a tool list is fixed across many requests, caching the last tool definition means all tool schemas are cached together.
val searchTool = ToolDescriptor(
name = "web_search",
description = "Search the web for information.",
requiredParameters = listOf(
ToolParameterDescriptor("query", "Search query", ToolParameterType.String)
),
// Cache all tool definitions up to and including this one
cacheControl = AnthropicCacheControl.Default
)
ToolDescriptor searchTool = new ToolDescriptor(
"web_search",
"Search the web for information.",
List.of(
new ToolParameterDescriptor("query", "Search query", ToolParameterType.String.INSTANCE)
),
Collections.emptyList(),
// Cache all tool definitions up to and including this one
AnthropicCacheControl.Default.INSTANCE
);
Cache TTL options
| Option | TTL | Price multiplier |
|---|---|---|
AnthropicCacheControl.Default |
5 minutes | 1.25× base input price |
AnthropicCacheControl.OneHour |
1 hour | 2× base input price |
Cache writes are charged at a higher rate than regular input tokens, but cache reads are cheaper. See the Anthropic prompt caching docs for current pricing.
Monitoring cache usage
Anthropic reports cache statistics in the response usage. These are accessible via the raw API response and can be observed through tracing or logging features.
| Field | Meaning |
|---|---|
cacheReadInputTokens |
Tokens read from an existing cache entry |
cacheCreationInputTokens |
Tokens written to a new cache entry |
Combining automatic and block-level caching
Both modes can be used simultaneously. Block-level cacheControl markers give you fine-grained
control over breakpoint positions, while the request-level cacheControl in AnthropicParams
handles the tail of the conversation automatically.
// Block-level: pin the system prompt in the 1-hour cache tier
// Automatic: let Anthropic manage breakpoints for the conversation tail
val params = AnthropicParams(cacheControl = AnthropicCacheControl.Default)
val prompt = prompt("combined", params = params) {
system("You are a helpful assistant...", AnthropicCacheControl.OneHour)
user("Hello!")
}
// Block-level: pin the system prompt in the 1-hour cache tier
// Automatic: let Anthropic manage breakpoints for the conversation tail
AnthropicParams params = new AnthropicParams(
AnthropicCacheControl.Default.INSTANCE
);
Prompt prompt = Prompt.builder("combined")
.system("You are a helpful assistant...", AnthropicCacheControl.OneHour.INSTANCE)
.user("Hello!")
.build()
.withParams(params);
Amazon Bedrock
Amazon Bedrock uses a block-level caching model via the Converse API.
When cacheControl is set on a message or tool, Bedrock inserts a CachePoint block
immediately after the annotated element.
Note
Bedrock prompt caching is a JVM-only feature, as the Bedrock client itself is JVM-only.
System messages
User and assistant messages
val prompt = prompt("conversation") {
system("You are a helpful assistant.")
// Cache after the large context message
user("Here is the document: ...", BedrockCacheControl.FiveMinutes)
assistant(listOf(MessagePart.Text("I have read the document.")))
user("Summarize it.")
}
val response = client.execute(prompt, BedrockModels.AnthropicClaude4Sonnet)
println(response)
Prompt prompt = Prompt.builder("conversation")
.system("You are a helpful assistant.")
// Cache after the large context message
.user("Here is the document: ...", BedrockCacheControl.FiveMinutes.INSTANCE)
.assistant("I have read the document.", BedrockCacheControl.Default.INSTANCE)
.user("Summarize it.")
.build();
Tool definitions
val searchTool = ToolDescriptor(
name = "web_search",
description = "Search the web for information.",
requiredParameters = listOf(
ToolParameterDescriptor("query", "Search query", ToolParameterType.String)
),
// Cache all tool definitions up to and including this one
cacheControl = BedrockCacheControl.Default
)
ToolDescriptor searchTool = new ToolDescriptor(
"web_search",
"Search the web for information.",
List.of(
new ToolParameterDescriptor("query", "Search query", ToolParameterType.String.INSTANCE)
),
Collections.emptyList(),
// Cache all tool definitions up to and including this one
BedrockCacheControl.Default.INSTANCE
);
Cache TTL options
| Option | TTL |
|---|---|
BedrockCacheControl.Default |
Provider default (no explicit TTL sent) |
BedrockCacheControl.FiveMinutes |
5 minutes |
BedrockCacheControl.OneHour |
1 hour |
See the Amazon Bedrock prompt caching docs for supported models and pricing.
Choosing a caching strategy
| Situation | Recommended approach |
|---|---|
| Multi-turn chat with a large, fixed system prompt | Anthropic automatic caching or Bedrock block-level on system |
| Stable tool definitions reused across requests | Block-level cacheControl on the last tool definition |
| Long document passed as user context | Block-level cacheControl on the user message |
| Arbitrary multi-turn conversation (Anthropic) | Automatic caching via AnthropicParams.cacheControl |
| Need 1-hour cache retention | AnthropicCacheControl.OneHour / BedrockCacheControl.OneHour |