Conversation
…ing-our-planner-cache-to-consider
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a bounded in-memory "expensive query" plan cache and integrates it across planning, warmup, reload persistence, telemetry, debug headers, graceful shutdown, and tests to retain long-running query plans outside the main cache. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Comment |
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2611 +/- ##
===========================================
- Coverage 89.54% 63.07% -26.47%
===========================================
Files 20 245 +225
Lines 4360 26260 +21900
Branches 1199 0 -1199
===========================================
+ Hits 3904 16564 +12660
- Misses 456 8354 +7898
- Partials 0 1342 +1342
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
router/core/expensive_query_cache_test.go (1)
224-260: Prefersync.WaitGroup.Gofor this goroutine fan-out.The current completion channel works, but
WaitGroup.Go(available in Go 1.25) makes this test shorter and removes the manual25send/receive bookkeeping.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router/core/expensive_query_cache_test.go` around lines 224 - 260, Replace the manual done channel fan-out with a sync.WaitGroup using WaitGroup.Go for each goroutine; specifically remove the done channel and its 25 send/receive bookkeeping and instead call wg.Add or use wg.Go (Go 1.25+) for each writer goroutine that calls c.Set, each reader goroutine that calls c.Get, and each iterator goroutine that calls c.IterValues, then call wg.Wait() at the end to wait for completion; ensure deferred wg.Done or rely on wg.Go to handle completion so there are no leftover sends or magic numeric counts.router/core/operation_planner.go (1)
199-201: Clarify the0sthreshold semantics.
p.threshold > 0makes0sa silent disable, not “cache every plan.” The tests already need1nsas an effective zero, so this public knob is easy to misread. Either document0sas disabled or treat it as “no minimum” and use a separate enable/disable switch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@router-tests/expensive_query_cache_test.go`:
- Around line 509-520: The test can false-pass because the loop only checks
"X-WG-Expensive-Plan-Cache" while a surviving main-cache entry
(X-WG-Execution-Plan-Cache) can hide an expensive-cache miss; update the
verification in the re-query loop (where distinctQueries are re-requested via
xEnv.MakeGraphQLRequestOK) to also assert that the main-cache header
"X-WG-Execution-Plan-Cache" equals "MISS" for each response (or first evict the
remaining main-cache entry before this loop), ensuring both caches report MISS
so re-planning is truly validated.
In `@router/core/expensive_query_cache.go`:
- Around line 22-25: The newExpensivePlanCache constructor currently calls
make(..., maxSize) which panics for negative sizes and treats 0 as a usable
cache; update newExpensivePlanCache/expensivePlanCache to explicitly handle
non-positive maxSize by either returning a nil/disabled cache or by rejecting
with an error (choose one consistent behavior across codebase) — e.g., if
maxSize <= 0 set a disabled flag on expensivePlanCache or return nil so no
entries are ever inserted and eviction is skipped; then update the
schema/validation for the expensive_query_cache_size setting to enforce the
intended lower bound (either >=1 or allow 0 as "disabled") so invalid values are
rejected at startup. Include references to newExpensivePlanCache,
expensivePlanCache, expensivePlanEntry, and the expensive_query_cache_size
schema/validation when making these changes.
In `@router/core/graph_server.go`:
- Around line 1384-1401: The warmup metric is getting wg.feature_flag injected
twice because attrs unconditionally includes
otel.WgFeatureFlag.String(opts.FeatureFlagName) and then baseMetricAttributes
(which may already contain the feature-flag) is appended before calling
gm.metricStore.MeasureOperationPlanningTime; update the construction of attrs
used by MeasureOperationPlanningTime to avoid duplication by removing the
unconditional otel.WgFeatureFlag entry (or conditionally adding it only when
baseMetricAttributes does not already contain wg.feature_flag), keeping the
existing fallback handling for operationPlanner.useFallback and ensuring
gm.metricStore.MeasureOperationPlanningTime is called with the deduplicated
attributes.
---
Nitpick comments:
In `@router/core/expensive_query_cache_test.go`:
- Around line 224-260: Replace the manual done channel fan-out with a
sync.WaitGroup using WaitGroup.Go for each goroutine; specifically remove the
done channel and its 25 send/receive bookkeeping and instead call wg.Add or use
wg.Go (Go 1.25+) for each writer goroutine that calls c.Set, each reader
goroutine that calls c.Get, and each iterator goroutine that calls c.IterValues,
then call wg.Wait() at the end to wait for completion; ensure deferred wg.Done
or rely on wg.Go to handle completion so there are no leftover sends or magic
numeric counts.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 96f31844-57cc-4523-ba0f-e72c125fb412
📒 Files selected for processing (14)
router-tests/expensive_query_cache_test.gorouter/core/context.gorouter/core/expensive_query_cache.gorouter/core/expensive_query_cache_test.gorouter/core/graph_server.gorouter/core/graphql_handler.gorouter/core/graphql_prehandler.gorouter/core/operation_planner.gorouter/core/reload_persistent_state.gorouter/pkg/config/config.gorouter/pkg/config/config.schema.jsonrouter/pkg/config/testdata/config_defaults.jsonrouter/pkg/config/testdata/config_full.jsonrouter/pkg/otel/attributes.go
There was a problem hiding this comment.
🧹 Nitpick comments (2)
router/core/expensive_query_cache_test.go (1)
234-288: Consider usingsync.WaitGroup.Gofor concurrent test goroutines.The concurrent access test uses manual goroutine spawning with a done channel for synchronization. Per repo conventions for Go 1.25+, prefer using
sync.WaitGroup.Go(func())to let the WaitGroup manage Add/Done automatically.♻️ Suggested refactor using WaitGroup.Go
func TestExpensivePlanCache_ConcurrentAccess(t *testing.T) { c, err := newExpensivePlanCache(100) require.NoError(t, err) - done := make(chan struct{}) + var wg sync.WaitGroup // Concurrent writers — each goroutine writes to its own key range for i := 0; i < 10; i++ { - go func(id int) { - defer func() { done <- struct{}{} }() + wg.Go(func() { + id := i for j := 0; j < 100; j++ { key := uint64(id*100 + j) c.Set(key, &planWithMetaData{content: "q"}, time.Duration(j)*time.Millisecond) } - }(i) + }) } // Concurrent readers for i := 0; i < 10; i++ { - go func(id int) { - defer func() { done <- struct{}{} }() + wg.Go(func() { + id := i for j := 0; j < 100; j++ { c.Get(uint64(id*100 + j)) } - }(i) + }) } // Concurrent iterators for i := 0; i < 5; i++ { - go func() { - defer func() { done <- struct{}{} }() + wg.Go(func() { c.IterValues(func(v *planWithMetaData) bool { return false }) - }() + }) } // Wait for all goroutines - for i := 0; i < 25; i++ { - <-done - } + wg.Wait()Note: You'll need to add
"sync"to the imports.Based on learnings: "In Go code (Go 1.25+), prefer using sync.WaitGroup.Go(func()) to run a function in a new goroutine, letting the WaitGroup manage Add/Done automatically."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router/core/expensive_query_cache_test.go` around lines 234 - 288, The test TestExpensivePlanCache_ConcurrentAccess currently uses a manual done channel to synchronize 25 goroutines; replace that pattern with sync.WaitGroup and use wg.Go(...) (Go 1.25+) so the WaitGroup manages Add/Done automatically. Specifically, import "sync", create a wg := new(sync.WaitGroup), call wg.Add(25) once or rely on wg.Go for each goroutine, and convert each goroutine that calls c.Set, c.Get, and c.IterValues (and currently signals done) to use wg.Go(func(){ ... }) so you no longer write to the done channel; finally replace the 25 receive loop with wg.Wait() and keep the subsequent IterValues checks using planWithMetaData, newExpensivePlanCache, Set, Get, and IterValues unchanged.router/core/expensive_query_cache.go (1)
66-81: Eviction has O(n) complexity per insert at capacity.The eviction logic iterates all entries to find the minimum duration. For the intended use case (small cache of expensive queries), this is acceptable. However, if
maxSizegrows large, consider using a min-heap for O(log n) eviction.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@router/core/expensive_query_cache.go` around lines 66 - 81, Current eviction scans c.entries to find minKey/minDur (O(n)) which becomes costly as maxSize grows; replace this with a min-heap (priority queue) keyed by duration so eviction becomes O(log n). Change the cache structure to maintain a heap of (duration, key) alongside c.entries (map key->expensivePlanEntry), push new entries onto the heap when inserting, and pop the heap to evict the smallest duration when at capacity; ensure you update or mark stale heap entries if an existing entry is replaced (or maintain an index map key->heapIndex for in-place updates). Update functions that insert/delete to keep the heap and c.entries in sync and use expensivePlanEntry and the key to correlate heap items to map entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@router/core/expensive_query_cache_test.go`:
- Around line 234-288: The test TestExpensivePlanCache_ConcurrentAccess
currently uses a manual done channel to synchronize 25 goroutines; replace that
pattern with sync.WaitGroup and use wg.Go(...) (Go 1.25+) so the WaitGroup
manages Add/Done automatically. Specifically, import "sync", create a wg :=
new(sync.WaitGroup), call wg.Add(25) once or rely on wg.Go for each goroutine,
and convert each goroutine that calls c.Set, c.Get, and c.IterValues (and
currently signals done) to use wg.Go(func(){ ... }) so you no longer write to
the done channel; finally replace the 25 receive loop with wg.Wait() and keep
the subsequent IterValues checks using planWithMetaData, newExpensivePlanCache,
Set, Get, and IterValues unchanged.
In `@router/core/expensive_query_cache.go`:
- Around line 66-81: Current eviction scans c.entries to find minKey/minDur
(O(n)) which becomes costly as maxSize grows; replace this with a min-heap
(priority queue) keyed by duration so eviction becomes O(log n). Change the
cache structure to maintain a heap of (duration, key) alongside c.entries (map
key->expensivePlanEntry), push new entries onto the heap when inserting, and pop
the heap to evict the smallest duration when at capacity; ensure you update or
mark stale heap entries if an existing entry is replaced (or maintain an index
map key->heapIndex for in-place updates). Update functions that insert/delete to
keep the heap and c.entries in sync and use expensivePlanEntry and the key to
correlate heap items to map entries.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5c81c12a-6f95-4519-89e0-bcc59ccc0968
📒 Files selected for processing (4)
router/core/expensive_query_cache.gorouter/core/expensive_query_cache_test.gorouter/core/graph_server.gorouter/core/operation_planner.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@router/core/graph_server.go`:
- Around line 1347-1357: If NewOperationPlanner returns an error, you must tear
down any resources created earlier (caches/metric stores, pubsub providers,
connector) to avoid leaks; modify the error path after the NewOperationPlanner
call to call the appropriate shutdown/Close methods for the resources you
started (e.g., cache.Close(), metricStore.Close(), pubsub.Stop(),
connector.Shutdown()) before returning the formatted error from this function,
or implement a deferred cleanup (e.g., defer cleanupStartupResources() that is
cancelled/nulled on success) so that on err != nil the partially-initialized
resources are cleaned up prior to returning the error from
NewOperationPlanner/operationPlanner creation.
In `@router/core/operation_planner.go`:
- Around line 197-205: The code currently calls preparePlan(opContext,
operationPlannerOpts{operationContent: p.useFallback}) which causes
prepared.content to be populated for the main cache whenever p.useFallback is
true; change the flow so preparePlan is first called with operationContent:
false (so the main cached PreparedPlan won't hold the full operation text), then
after measuring planningDuration and determining the expensive-cache condition
(the same checks using p.useFallback, p.threshold, prepared.planningDuration,
and prepared.content), attach or populate the operation text only for the
expensive cache entry (for example by calling preparePlan again or by loading
the content into prepared.content) before calling
p.expensiveCache.Set(operationID, prepared, prepared.planningDuration); keep
p.planCache.Set(operationID, prepared, 1) as-is but ensure that the prepared
placed into p.planCache does not include the full content.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ee109107-a2bb-422e-a39f-802dc1e9d9fe
📒 Files selected for processing (3)
router/core/graph_server.gorouter/core/operation_planner.gorouter/pkg/config/config.schema.json
🚧 Files skipped from review as they are similar to previous changes (1)
- router/pkg/config/config.schema.json
…ing-our-planner-cache-to-consider
endigma
left a comment
There was a problem hiding this comment.
placeholder for external discussion about caching mechanism
…ing-our-planner-cache-to-consider
…ing-our-planner-cache-to-consider
…er-investigate-customizing-our-planner-cache-to-consider
…ing-our-planner-cache-to-consider
…er-investigate-customizing-our-planner-cache-to-consider
This PR adds an expensive query cache. We do not use ristretto for this cache and use a simple custom cache, because we do not want to use the same frequency based eviction rules that ristretto uses. Instead this cache (if full) allows adding of queries that are more expensive than the "fastest min" query in the cache.
This feature is enabled by default when the in memory fallback cache warmer is enabled.
Summary by CodeRabbit
New Features
Observability
Chores
Tests
Checklist