Skip to content

Prompt caching for Claude Sonnet appears to start at 2048 tokens instead of documented 1024 #1194

@storybite

Description

@storybite

I’d like to report a possible mismatch between the documentation and the actual behavior of prompt caching for Claude Sonnet.

The documentation states that prompt caching activates at 1024 tokens, but in practice, caching does not appear to take effect until the cached prefix reaches 2048 tokens.

Observed behavior

~1024-token prefix → no cache hit observed

≥2048-token prefix → cache creation / read tokens appear in usage metrics

This behavior has been consistently reproducible.

Could you clarify whether this behavior is intentional, model-specific, or an undocumented change?

Below is the relevant section from the developer documentation:

4096 tokens for Claude Opus 4.6, Claude Opus 4.5
1024 tokens for Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, and Claude Sonnet 3.7 (deprecated)
4096 tokens for Claude Haiku 4.5
2048 tokens for Claude Haiku 3.5 (deprecated) and Claude Haiku 3

Code to reproduce:

import anthropic

llm_definition = """
Claude and the mission of Anthropic
""".strip()

llm_definition *= 120


client = anthropic.Anthropic()

token_count = client.messages.count_tokens(
    model="claude-sonnet-4-6", messages=[{"role": "user", "content": llm_definition}]
)
print("Token Count:", token_count.input_tokens)
print("*" * 100)

system_prompt = [
    {
        "type": "text",
        "text": "8. Based on the provided content, give a concise answer in three sentences or less.",
    },
    {"type": "text", "text": llm_definition, "cache_control": {"type": "ephemeral"}},
]

user_messages = [
    {"role": "user", "content": "Please explain what an LLM is."},
    {"role": "user", "content": "Please explain the Transformer model."},
]

messages = []
for num, user_message in enumerate(user_messages, start=1):
    messages.append(user_message)
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        temperature=0,
        system=system_prompt,
        messages=messages,
    )
    messages.append({"role": "assistant", "content": response.content[0].text})
    print(f"\nDialogue Round {num}.{"-"*100}")
    print(response.content[0].text)
    print(response.usage.model_dump_json())

Out:

{"cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":0},"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"inference_geo":"global","input_tokens":1351,"output_tokens":103,"server_tool_use":null,"service_tier":"standard"}
{"cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":0},"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"inference_geo":"global","input_tokens":1464,"output_tokens":135,"server_tool_use":null,"service_tier":"standard"}

Environment:

tested with anthropicversion 0.80.0 and 0.83.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions