LLM response caching
For repeated requests that you run with a prompt executor,
you can cache LLM responses to optimize performance and reduce costs.
In Koog, caching is available for all prompt executors through CachedPromptExecutor,
which is a wrapper around PromptExecutor that adds caching functionality.
It lets you store responses from previously executed prompts and retrieve them when the same prompts are run again.
To create a cached prompt executor, perform the following:
- Create a prompt executor for which you want to cache responses.
- Create a
CachedPromptExecutorinstance by providing the desired cache and the prompt executor you created. - Run the created
CachedPromptExecutorwith the desired prompt and model.
Here is an example:
// Create a prompt executor
val client = OpenAILLMClient(System.getenv("OPENAI_API_KEY"))
val promptExecutor = SingleLLMPromptExecutor(client)
// Create a cached prompt executor
val cachedExecutor = CachedPromptExecutor(
cache = FilePromptCache(Path("path/to/your/cache/directory")),
nested = promptExecutor
)
// Run cached prompt executor for the first time
// This will perform an actual LLM request
val firstTime = measureTimeMillis {
val firstResponse = cachedExecutor.execute(prompt, OpenAIModels.Chat.GPT4o)
println("First response: ${firstResponse.first().content}")
}
println("First execution took: ${firstTime}ms")
// Run cached prompt executor for the second time
// This will return the result immediately from the cache
val secondTime = measureTimeMillis {
val secondResponse = cachedExecutor.execute(prompt, OpenAIModels.Chat.GPT4o)
println("Second response: ${secondResponse.first().content}")
}
println("Second execution took: ${secondTime}ms")
The example produces the following output:
First response: Hello! It seems like we're starting a new conversation. What can I help you with today?
First execution took: 48ms
Second response: Hello! It seems like we're starting a new conversation. What can I help you with today?
Second execution took: 1ms
Note
- If you call
executeStreaming()with the cached prompt executor, it produces a response as a single chunk. - If you call
moderate()with the cached prompt executor, it forwards the request to the nested prompt executor and does not use the cache. - Caching of multiple choice responses (
executeMultipleChoice()) is not supported.