Prompt caching for Claude Sonnet appears to start at 2048 tokens instead of documented 1024

I’d like to report a possible mismatch between the documentation and the actual behavior of prompt caching for Claude Sonnet.

The documentation states that prompt caching activates at 1024 tokens, but in practice, caching does not appear to take effect until the cached prefix reaches 2048 tokens.

## Observed behavior

~1024-token prefix → no cache hit observed

≥2048-token prefix → cache creation / read tokens appear in usage metrics

This behavior has been consistently reproducible.

Could you clarify whether this behavior is intentional, model-specific, or an undocumented change?

## Below is the relevant section from the developer documentation:
4096 tokens for Claude Opus 4.6, Claude Opus 4.5
1024 tokens for Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, and Claude Sonnet 3.7 (deprecated)
4096 tokens for Claude Haiku 4.5
2048 tokens for Claude Haiku 3.5 (deprecated) and Claude Haiku 3

## Code to reproduce:

```python
import anthropic

llm_definition = """
Claude and the mission of Anthropic
""".strip()

llm_definition *= 120


client = anthropic.Anthropic()

token_count = client.messages.count_tokens(
    model="claude-sonnet-4-6", messages=[{"role": "user", "content": llm_definition}]
)
print("Token Count:", token_count.input_tokens)
print("*" * 100)

system_prompt = [
    {
        "type": "text",
        "text": "8. Based on the provided content, give a concise answer in three sentences or less.",
    },
    {"type": "text", "text": llm_definition, "cache_control": {"type": "ephemeral"}},
]

user_messages = [
    {"role": "user", "content": "Please explain what an LLM is."},
    {"role": "user", "content": "Please explain the Transformer model."},
]

messages = []
for num, user_message in enumerate(user_messages, start=1):
    messages.append(user_message)
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        temperature=0,
        system=system_prompt,
        messages=messages,
    )
    messages.append({"role": "assistant", "content": response.content[0].text})
    print(f"\nDialogue Round {num}.{"-"*100}")
    print(response.content[0].text)
    print(response.usage.model_dump_json())
```
## Out:

{"cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":0},"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"inference_geo":"global","input_tokens":1351,"output_tokens":103,"server_tool_use":null,"service_tier":"standard"}
{"cache_creation":{"ephemeral_1h_input_tokens":0,"ephemeral_5m_input_tokens":0},"cache_creation_input_tokens":0,"cache_read_input_tokens":0,"inference_geo":"global","input_tokens":1464,"output_tokens":135,"server_tool_use":null,"service_tier":"standard"}

## Environment:
tested with anthropicversion 0.80.0 and 0.83.0



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt caching for Claude Sonnet appears to start at 2048 tokens instead of documented 1024 #1194

Observed behavior

Below is the relevant section from the developer documentation:

Code to reproduce:

Out:

Environment:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prompt caching for Claude Sonnet appears to start at 2048 tokens instead of documented 1024 #1194

Description

Observed behavior

Below is the relevant section from the developer documentation:

Code to reproduce:

Out:

Environment:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions