-
Notifications
You must be signed in to change notification settings - Fork 477
Description
I’d like to report a possible mismatch between the documentation and the actual behavior of prompt caching for Claude Sonnet.
The documentation states that prompt caching activates at 1024 tokens, but in practice, caching does not appear to take effect until the cached prefix reaches 2048 tokens.
Observed behavior
~1024-token prefix → no cache hit observed
≥2048-token prefix → cache creation / read tokens appear in usage metrics
This behavior has been consistently reproducible.
Could you clarify whether this behavior is intentional, model-specific, or an undocumented change?
Below is the relevant section from the developer documentation:
4096 tokens for Claude Opus 4.6, Claude Opus 4.5
1024 tokens for Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, and Claude Sonnet 3.7 (deprecated)
4096 tokens for Claude Haiku 4.5
2048 tokens for Claude Haiku 3.5 (deprecated) and Claude Haiku 3
Code to reproduce:
import anthropic
llm_definition = """
Claude and the mission of Anthropic
""".strip()
llm_definition *= 120
client = anthropic.Anthropic()
token_count = client.messages.count_tokens(
model="claude-sonnet-4-6", messages=[{"role": "user", "content": llm_definition}]
)
print("Token Count:", token_count.input_tokens)
print("*" * 100)
system_prompt = [
{
"type": "text",
"text": "8. Based on the provided content, give a concise answer in three sentences or less.",
},
{"type": "text", "text": llm_definition, "cache_control": {"type": "ephemeral"}},
]
user_messages = [
{"role": "user", "content": "Please explain what an LLM is."},
{"role": "user", "content": "Please explain the Transformer model."},
]
messages = []
for num, user_message in enumerate(user_messages, start=1):
messages.append(user_message)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
temperature=0,
system=system_prompt,
messages=messages,
)
messages.append({"role": "assistant", "content": response.content[0].text})
print(f"\nDialogue Round {num}.{"-"*100}")
print(response.content[0].text)
print(response.usage.model_dump_json())Out:
{"cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":0},"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"inference_geo":"global","input_tokens":1351,"output_tokens":103,"server_tool_use":null,"service_tier":"standard"}
{"cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":0},"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"inference_geo":"global","input_tokens":1464,"output_tokens":135,"server_tool_use":null,"service_tier":"standard"}
Environment:
tested with anthropicversion 0.80.0 and 0.83.0