Skip to content

feat: slow plan cache#2611

Open
SkArchon wants to merge 56 commits intomainfrom
milinda/eng-9010-router-investigate-customizing-our-planner-cache-to-consider
Open

feat: slow plan cache#2611
SkArchon wants to merge 56 commits intomainfrom
milinda/eng-9010-router-investigate-customizing-our-planner-cache-to-consider

Conversation

@SkArchon
Copy link
Contributor

@SkArchon SkArchon commented Mar 9, 2026

This PR adds an expensive query cache. We do not use ristretto for this cache and use a simple custom cache, because we do not want to use the same frequency based eviction rules that ristretto uses. Instead this cache (if full) allows adding of queries that are more expensive than the "fastest min" query in the cache.

image

This feature is enabled by default when the in memory fallback cache warmer is enabled.

Summary by CodeRabbit

  • New Features

    • Added an expensive-query cache with in-memory fallback, warmup integration, and graceful planner shutdown to protect slow-to-plan queries and repopulate the main cache.
  • Observability

    • Expose expensive-cache status in debug headers and record hits in spans, OTEL attributes/histograms, and Prometheus metrics.
  • Chores

    • New configuration options: ExpensiveQueryCacheSize and ExpensiveQueryThreshold.
  • Tests

    • Comprehensive unit and integration tests covering caching, eviction, reloads, telemetry, shutdown, and concurrency.

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
  • I have read the Contributors Guide.

@github-actions github-actions bot added the router label Mar 9, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a bounded in-memory "expensive query" plan cache and integrates it across planning, warmup, reload persistence, telemetry, debug headers, graceful shutdown, and tests to retain long-running query plans outside the main cache.

Changes

Cohort / File(s) Summary
Expensive Cache Core & Tests
router/core/expensive_query_cache.go, router/core/expensive_query_cache_test.go
New expensivePlanCache and expensivePlanEntry types with newExpensivePlanCache, Get, Set (evict-by-shortest-duration), IterValues, and Close. Unit tests cover eviction, updates, iteration, concurrency, lifecycle, and edge cases.
Integration Tests
router-tests/expensive_query_cache_test.go
New integration test TestExpensiveQueryCache with multiple subtests validating main vs expensive cache behavior, cache overflow, config reloads, shutdown, OTEL/Prometheus/span telemetry, debug header emission, feature toggling, and threshold gating.
Operation Planner & Planning Flow
router/core/operation_planner.go, router/core/graph_server.go
OperationPlanner gains in-memory fallback and expensive-cache support: new constructor signature (logger, executor, planCache, inMemoryFallback bool, expensiveCacheSize int, threshold), records planning duration, conditionally caches expensive plans, adds Close, and is constructed/closed by graphMux. Evicted plans may be propagated to the expensive cache when warmup/fallback enabled.
Persistent State / Reload
router/core/reload_persistent_state.go
InMemoryPlanCacheFallback now stores per-feature-flag expensiveCaches, adds setExpensiveCacheForFF, merges operations from expensive caches into persisted operations, and removes entries when feature flags are cleaned up.
Request Context & Debug Headers
router/core/context.go, router/core/graphql_handler.go
Added expensivePlanCacheHit and expensiveCacheEnabled fields to operationContext; new ExpensivePlanCacheHeader constant and conditional debug header emission (HIT/MISS) when enabled.
Telemetry & Prehandler
router/core/graphql_prehandler.go, router/pkg/otel/attributes.go
New OTEL attribute key wg.engine.expensive_plan_cache_hit; conditionally set span attribute and include expensive cache hit in planning telemetry attributes when expensive cache is enabled.
Configuration & Schema
router/pkg/config/config.go, router/pkg/config/config.schema.json, router/pkg/config/testdata/*
Adds ExpensiveQueryCacheSize (int) and ExpensiveQueryThreshold (duration) to EngineExecutionConfiguration, updates JSON schema and testdata (defaults: 100 entries, 5s threshold).
Warmup / Fallback Wiring
router/core/graph_server.go, related warmup code
Propagates expensive cache reference into warmup/in-memory fallback paths and augments warmup metrics attributes to include expensive cache hit when fallback enabled.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The title "feat: slow plan cache" is partially related to the changeset but does not reflect the main focus. The PR implements an expensive query cache feature, and the correct terminology used throughout the codebase is "expensive" (not "slow"), as evidenced by file names and constants like ExpensiveQueryCacheSize, expensivePlanCache, and WgEngineExpensivePlanCacheHit. Update the title to "feat: expensive query cache" to accurately reflect the feature name and terminology used consistently throughout the implementation.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-b6ca29cc0cf181fd2584355b1876802f41e714cb-nonroot

@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 96.79487% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.07%. Comparing base (3426dd3) to head (6cfe7cf).

Files with missing lines Patch % Lines
router/core/graph_server.go 86.66% 1 Missing and 1 partial ⚠️
router/core/operation_planner.go 93.93% 1 Missing and 1 partial ⚠️
router/pkg/slowplancache/slow_plan_cache.go 99.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2611       +/-   ##
===========================================
- Coverage   89.54%   63.07%   -26.47%     
===========================================
  Files          20      245      +225     
  Lines        4360    26260    +21900     
  Branches     1199        0     -1199     
===========================================
+ Hits         3904    16564    +12660     
- Misses        456     8354     +7898     
- Partials        0     1342     +1342     
Files with missing lines Coverage Δ
router/core/reload_persistent_state.go 93.42% <100.00%> (ø)
router/core/router.go 69.97% <100.00%> (ø)
router/core/router_config.go 93.75% <ø> (ø)
router/pkg/config/config.go 80.51% <ø> (ø)
router/pkg/slowplancache/slow_plan_cache.go 99.00% <99.00%> (ø)
router/core/graph_server.go 84.57% <86.66%> (ø)
router/core/operation_planner.go 71.42% <93.93%> (ø)

... and 258 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
router/core/expensive_query_cache_test.go (1)

224-260: Prefer sync.WaitGroup.Go for this goroutine fan-out.

The current completion channel works, but WaitGroup.Go (available in Go 1.25) makes this test shorter and removes the manual 25 send/receive bookkeeping.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/core/expensive_query_cache_test.go` around lines 224 - 260, Replace
the manual done channel fan-out with a sync.WaitGroup using WaitGroup.Go for
each goroutine; specifically remove the done channel and its 25 send/receive
bookkeeping and instead call wg.Add or use wg.Go (Go 1.25+) for each writer
goroutine that calls c.Set, each reader goroutine that calls c.Get, and each
iterator goroutine that calls c.IterValues, then call wg.Wait() at the end to
wait for completion; ensure deferred wg.Done or rely on wg.Go to handle
completion so there are no leftover sends or magic numeric counts.
router/core/operation_planner.go (1)

199-201: Clarify the 0s threshold semantics.

p.threshold > 0 makes 0s a silent disable, not “cache every plan.” The tests already need 1ns as an effective zero, so this public knob is easy to misread. Either document 0s as disabled or treat it as “no minimum” and use a separate enable/disable switch.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@router-tests/expensive_query_cache_test.go`:
- Around line 509-520: The test can false-pass because the loop only checks
"X-WG-Expensive-Plan-Cache" while a surviving main-cache entry
(X-WG-Execution-Plan-Cache) can hide an expensive-cache miss; update the
verification in the re-query loop (where distinctQueries are re-requested via
xEnv.MakeGraphQLRequestOK) to also assert that the main-cache header
"X-WG-Execution-Plan-Cache" equals "MISS" for each response (or first evict the
remaining main-cache entry before this loop), ensuring both caches report MISS
so re-planning is truly validated.

In `@router/core/expensive_query_cache.go`:
- Around line 22-25: The newExpensivePlanCache constructor currently calls
make(..., maxSize) which panics for negative sizes and treats 0 as a usable
cache; update newExpensivePlanCache/expensivePlanCache to explicitly handle
non-positive maxSize by either returning a nil/disabled cache or by rejecting
with an error (choose one consistent behavior across codebase) — e.g., if
maxSize <= 0 set a disabled flag on expensivePlanCache or return nil so no
entries are ever inserted and eviction is skipped; then update the
schema/validation for the expensive_query_cache_size setting to enforce the
intended lower bound (either >=1 or allow 0 as "disabled") so invalid values are
rejected at startup. Include references to newExpensivePlanCache,
expensivePlanCache, expensivePlanEntry, and the expensive_query_cache_size
schema/validation when making these changes.

In `@router/core/graph_server.go`:
- Around line 1384-1401: The warmup metric is getting wg.feature_flag injected
twice because attrs unconditionally includes
otel.WgFeatureFlag.String(opts.FeatureFlagName) and then baseMetricAttributes
(which may already contain the feature-flag) is appended before calling
gm.metricStore.MeasureOperationPlanningTime; update the construction of attrs
used by MeasureOperationPlanningTime to avoid duplication by removing the
unconditional otel.WgFeatureFlag entry (or conditionally adding it only when
baseMetricAttributes does not already contain wg.feature_flag), keeping the
existing fallback handling for operationPlanner.useFallback and ensuring
gm.metricStore.MeasureOperationPlanningTime is called with the deduplicated
attributes.

---

Nitpick comments:
In `@router/core/expensive_query_cache_test.go`:
- Around line 224-260: Replace the manual done channel fan-out with a
sync.WaitGroup using WaitGroup.Go for each goroutine; specifically remove the
done channel and its 25 send/receive bookkeeping and instead call wg.Add or use
wg.Go (Go 1.25+) for each writer goroutine that calls c.Set, each reader
goroutine that calls c.Get, and each iterator goroutine that calls c.IterValues,
then call wg.Wait() at the end to wait for completion; ensure deferred wg.Done
or rely on wg.Go to handle completion so there are no leftover sends or magic
numeric counts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 96f31844-57cc-4523-ba0f-e72c125fb412

📥 Commits

Reviewing files that changed from the base of the PR and between ea70e91 and 797b9a9.

📒 Files selected for processing (14)
  • router-tests/expensive_query_cache_test.go
  • router/core/context.go
  • router/core/expensive_query_cache.go
  • router/core/expensive_query_cache_test.go
  • router/core/graph_server.go
  • router/core/graphql_handler.go
  • router/core/graphql_prehandler.go
  • router/core/operation_planner.go
  • router/core/reload_persistent_state.go
  • router/pkg/config/config.go
  • router/pkg/config/config.schema.json
  • router/pkg/config/testdata/config_defaults.json
  • router/pkg/config/testdata/config_full.json
  • router/pkg/otel/attributes.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
router/core/expensive_query_cache_test.go (1)

234-288: Consider using sync.WaitGroup.Go for concurrent test goroutines.

The concurrent access test uses manual goroutine spawning with a done channel for synchronization. Per repo conventions for Go 1.25+, prefer using sync.WaitGroup.Go(func()) to let the WaitGroup manage Add/Done automatically.

♻️ Suggested refactor using WaitGroup.Go
 func TestExpensivePlanCache_ConcurrentAccess(t *testing.T) {
 	c, err := newExpensivePlanCache(100)
 	require.NoError(t, err)
-	done := make(chan struct{})
+	var wg sync.WaitGroup

 	// Concurrent writers — each goroutine writes to its own key range
 	for i := 0; i < 10; i++ {
-		go func(id int) {
-			defer func() { done <- struct{}{} }()
+		wg.Go(func() {
+			id := i
 			for j := 0; j < 100; j++ {
 				key := uint64(id*100 + j)
 				c.Set(key, &planWithMetaData{content: "q"}, time.Duration(j)*time.Millisecond)
 			}
-		}(i)
+		})
 	}

 	// Concurrent readers
 	for i := 0; i < 10; i++ {
-		go func(id int) {
-			defer func() { done <- struct{}{} }()
+		wg.Go(func() {
+			id := i
 			for j := 0; j < 100; j++ {
 				c.Get(uint64(id*100 + j))
 			}
-		}(i)
+		})
 	}

 	// Concurrent iterators
 	for i := 0; i < 5; i++ {
-		go func() {
-			defer func() { done <- struct{}{} }()
+		wg.Go(func() {
 			c.IterValues(func(v *planWithMetaData) bool {
 				return false
 			})
-		}()
+		})
 	}

 	// Wait for all goroutines
-	for i := 0; i < 25; i++ {
-		<-done
-	}
+	wg.Wait()

Note: You'll need to add "sync" to the imports.

Based on learnings: "In Go code (Go 1.25+), prefer using sync.WaitGroup.Go(func()) to run a function in a new goroutine, letting the WaitGroup manage Add/Done automatically."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/core/expensive_query_cache_test.go` around lines 234 - 288, The test
TestExpensivePlanCache_ConcurrentAccess currently uses a manual done channel to
synchronize 25 goroutines; replace that pattern with sync.WaitGroup and use
wg.Go(...) (Go 1.25+) so the WaitGroup manages Add/Done automatically.
Specifically, import "sync", create a wg := new(sync.WaitGroup), call wg.Add(25)
once or rely on wg.Go for each goroutine, and convert each goroutine that calls
c.Set, c.Get, and c.IterValues (and currently signals done) to use wg.Go(func(){
... }) so you no longer write to the done channel; finally replace the 25
receive loop with wg.Wait() and keep the subsequent IterValues checks using
planWithMetaData, newExpensivePlanCache, Set, Get, and IterValues unchanged.
router/core/expensive_query_cache.go (1)

66-81: Eviction has O(n) complexity per insert at capacity.

The eviction logic iterates all entries to find the minimum duration. For the intended use case (small cache of expensive queries), this is acceptable. However, if maxSize grows large, consider using a min-heap for O(log n) eviction.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@router/core/expensive_query_cache.go` around lines 66 - 81, Current eviction
scans c.entries to find minKey/minDur (O(n)) which becomes costly as maxSize
grows; replace this with a min-heap (priority queue) keyed by duration so
eviction becomes O(log n). Change the cache structure to maintain a heap of
(duration, key) alongside c.entries (map key->expensivePlanEntry), push new
entries onto the heap when inserting, and pop the heap to evict the smallest
duration when at capacity; ensure you update or mark stale heap entries if an
existing entry is replaced (or maintain an index map key->heapIndex for in-place
updates). Update functions that insert/delete to keep the heap and c.entries in
sync and use expensivePlanEntry and the key to correlate heap items to map
entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@router/core/expensive_query_cache_test.go`:
- Around line 234-288: The test TestExpensivePlanCache_ConcurrentAccess
currently uses a manual done channel to synchronize 25 goroutines; replace that
pattern with sync.WaitGroup and use wg.Go(...) (Go 1.25+) so the WaitGroup
manages Add/Done automatically. Specifically, import "sync", create a wg :=
new(sync.WaitGroup), call wg.Add(25) once or rely on wg.Go for each goroutine,
and convert each goroutine that calls c.Set, c.Get, and c.IterValues (and
currently signals done) to use wg.Go(func(){ ... }) so you no longer write to
the done channel; finally replace the 25 receive loop with wg.Wait() and keep
the subsequent IterValues checks using planWithMetaData, newExpensivePlanCache,
Set, Get, and IterValues unchanged.

In `@router/core/expensive_query_cache.go`:
- Around line 66-81: Current eviction scans c.entries to find minKey/minDur
(O(n)) which becomes costly as maxSize grows; replace this with a min-heap
(priority queue) keyed by duration so eviction becomes O(log n). Change the
cache structure to maintain a heap of (duration, key) alongside c.entries (map
key->expensivePlanEntry), push new entries onto the heap when inserting, and pop
the heap to evict the smallest duration when at capacity; ensure you update or
mark stale heap entries if an existing entry is replaced (or maintain an index
map key->heapIndex for in-place updates). Update functions that insert/delete to
keep the heap and c.entries in sync and use expensivePlanEntry and the key to
correlate heap items to map entries.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5c81c12a-6f95-4519-89e0-bcc59ccc0968

📥 Commits

Reviewing files that changed from the base of the PR and between 797b9a9 and 4378c80.

📒 Files selected for processing (4)
  • router/core/expensive_query_cache.go
  • router/core/expensive_query_cache_test.go
  • router/core/graph_server.go
  • router/core/operation_planner.go

@SkArchon SkArchon marked this pull request as ready for review March 9, 2026 15:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@router/core/graph_server.go`:
- Around line 1347-1357: If NewOperationPlanner returns an error, you must tear
down any resources created earlier (caches/metric stores, pubsub providers,
connector) to avoid leaks; modify the error path after the NewOperationPlanner
call to call the appropriate shutdown/Close methods for the resources you
started (e.g., cache.Close(), metricStore.Close(), pubsub.Stop(),
connector.Shutdown()) before returning the formatted error from this function,
or implement a deferred cleanup (e.g., defer cleanupStartupResources() that is
cancelled/nulled on success) so that on err != nil the partially-initialized
resources are cleaned up prior to returning the error from
NewOperationPlanner/operationPlanner creation.

In `@router/core/operation_planner.go`:
- Around line 197-205: The code currently calls preparePlan(opContext,
operationPlannerOpts{operationContent: p.useFallback}) which causes
prepared.content to be populated for the main cache whenever p.useFallback is
true; change the flow so preparePlan is first called with operationContent:
false (so the main cached PreparedPlan won't hold the full operation text), then
after measuring planningDuration and determining the expensive-cache condition
(the same checks using p.useFallback, p.threshold, prepared.planningDuration,
and prepared.content), attach or populate the operation text only for the
expensive cache entry (for example by calling preparePlan again or by loading
the content into prepared.content) before calling
p.expensiveCache.Set(operationID, prepared, prepared.planningDuration); keep
p.planCache.Set(operationID, prepared, 1) as-is but ensure that the prepared
placed into p.planCache does not include the full content.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ee109107-a2bb-422e-a39f-802dc1e9d9fe

📥 Commits

Reviewing files that changed from the base of the PR and between 4f644cf and 2515b07.

📒 Files selected for processing (3)
  • router/core/graph_server.go
  • router/core/operation_planner.go
  • router/pkg/config/config.schema.json
🚧 Files skipped from review as they are similar to previous changes (1)
  • router/pkg/config/config.schema.json

Copy link
Member

@endigma endigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

placeholder for external discussion about caching mechanism

@SkArchon SkArchon requested review from a team as code owners March 15, 2026 16:59
@SkArchon SkArchon requested a review from wilsonrivera March 15, 2026 16:59
Copy link
Contributor

@StarpTech StarpTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@endigma endigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few things

Copy link
Member

@endigma endigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants