Replace async-disabling mechanism with retry backoff on refresh failure#696
Replace async-disabling mechanism with retry backoff on refresh failure#696mihaimitrea-db merged 5 commits intomainfrom
Conversation
…on behavior Replace the previous staleness tracking with a cached staleAfter timestamp that is computed when a token is stored. This makes the cache state easier to reason about because callers now classify tokens by comparing the current time against staleAfter and the expiry buffer, instead of deriving the stale window ad hoc at read time. Preserve backward compatibility for callers that set staleDuration through the builder by keeping that configuration on the legacy fixed-window path. This ensures that existing integrations which explicitly pass staleDuration continue to get the behavior they configured, both for the initial cached token and for tokens obtained through later refreshes. Update async refresh handling so failed async refreshes push staleAfter forward by a one-minute backoff instead of repeatedly retrying on every stale read. Also prevent an older async refresh result from overwriting a newer token that is already in the cache, and expand the tests to cover staleAfter computation as well as retry behavior before and after the backoff window.
- Update field comment on staleAfter to reflect its Instant type - Guard against null newToken in async refresh before calling cachedTokenIsNewer - Remove redundant synchronized block in handleFailedAsyncRefresh - Collapse duplicate null-check branches in updateToken - Use imperative tense in changelog entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
The PR looks great!
A few things I'd like to see addressed:
refreshInProgressshould likely be reset in afinallyblock. Right now, ifupdateTokenorcachedTokenIsNewerthrow unexpectedly,refreshInProgressstaystruepermanently and all future async refreshes are deadlocked. Something like:
try {
Token newToken = tokenSource.getToken();
synchronized (this) {
if (newToken != null && !cachedTokenIsNewer(newToken)) {
updateToken(newToken);
}
}
} catch (Exception e) {
synchronized (this) {
handleFailedAsyncRefresh();
logger.error("Asynchronous token refresh failed", e);
}
} finally {
synchronized (this) {
refreshInProgress = false;
}
}-
The memory ordering comment in
updateToken("The stale threshold is written before the volatile token write...") is accurate for the success path, buthandleFailedAsyncRefreshwrites tostaleAfterwithout a subsequent volatile write totoken. Could a thread callinggetToken(unsynchronized path) read a stalestaleAfterafterhandleFailedAsyncRefreshran? In practice the consequence is just one extra async trigger, so not a real bug, but the comment should acknowledge the failure path too. -
I think a test for the
cachedTokenIsNewerdiscard path would be valuable. This is a real concurrent scenario and the current test suite doesn't cover it. -
Minor:
triggerAsyncRefreshchanged from!= FRESHto== STALE. This is correct (theEXPIREDcase is handled bygetTokenBlocking), but it's a subtle behavioral change — a one-line comment explaining whyEXPIREDis excluded would help.
- Move refreshInProgress reset to a finally block so a thrown exception in updateToken or cachedTokenIsNewer cannot permanently deadlock future async refreshes. - Update the memory-ordering comment in updateToken to acknowledge that handleFailedAsyncRefresh writes staleAfter without a subsequent volatile token write (consequence is at most one extra async trigger). - Add a comment in triggerAsyncRefresh explaining why only STALE (not EXPIRED) triggers an async attempt. - Add testAsyncRefreshDiscardsOlderToken to cover the cachedTokenIsNewer discard path where a blocking refresh installs a newer token while an async refresh is in flight.
Fix two CI failures introduced by the previous commit: - Apply spotless-expected formatting to the new test (Javadoc line wrap, assertTrue collapse). - Wait for both staleAfter update and refreshInProgress reset before proceeding in backoff tests. Moving refreshInProgress = false to a finally block created a window where staleAfter was already updated but refreshInProgress was still true, causing triggerAsyncRefresh to bail out on Java 8 macOS. - Extract getRefreshInProgress/getRefreshInProgressUnchecked helpers and reuse them in testAsyncRefreshDiscardsOlderToken.
|
If integration tests don't run automatically, an authorized user can run them manually by following the instructions below: Trigger: Inputs:
Checks will be approved automatically on success. |
Summary
Replace the cache's staleness tracking with a cached
staleAftertimestamp and preserve the legacystaleDurationbuilder path. Async refresh failures now use a short retry backoff instead of suppressing background refresh until the token expires.Why
CachedTokenSourcepreviously mixed relative stale-window calculations with token expiry checks, which made the cache state harder to reason about and made it easier to regress callers that explicitly configurestaleDurationthrough the builder. At the same time, a failed async refresh effectively disabled further async refreshes until a blocking refresh happened on expiry, which meant a brief transient failure could prevent the SDK from recovering proactively for the rest of the token lifetime.This PR moves the cache to an absolute
staleAfterthreshold that is computed whenever a token is stored. That makes the state model much clearer: callers now classify tokens by comparing the current time againststaleAfterand the expiry buffer. It also preserves the legacy fixed-window behavior for callers that setstaleDuration, and replaces the old async-failure suppression behavior with a one-minute retry backoff so transient failures can recover without waiting for full expiry.What changed
Interface changes
None.
Behavioral changes
Calls that set
staleDurationthroughCachedTokenSource.Builder#setStaleDuration(...)continue to get the legacy fixed-window behavior. The cache now recomputesstaleAfterfrom that configured duration each time a token is stored, so both the initial cached token and later refreshed tokens honor the same caller-provided setting.Failed async refreshes no longer block future async refresh attempts until a blocking refresh on expiry. Instead, the cache moves
staleAfterone minute into the future, treats the token as fresh during that cooldown, and retries async refresh the next time the token becomes stale again.Older async refresh results are also ignored when the cache already holds a newer token. This prevents a late async refresh from overwriting a token with a later expiry that was installed by another refresh path.
Internal changes
CachedTokenSourcenow stores an absolutestaleAfterinstant rather than reasoning about staleness from relative durations at read time. When callers do not providestaleDuration, the stale threshold is derived from the token's remaining TTL at the moment the token is stored and capped at 20 minutes. This centralizes stale-threshold computation inupdateToken()and reduces token-state checks to direct time comparisons.The implementation adds
_ASYNC_REFRESH_RETRY_BACKOFF-equivalent behavior for Java viaASYNC_REFRESH_RETRY_BACKOFF, introduceshandleFailedAsyncRefresh()to apply the retry cooldown, and addscachedTokenIsNewer()to discard async refresh results that would otherwise roll the cache back to an older token.CachedTokenSourceTestwas updated to replace the old async-failure fallback coverage with parameterizedstaleAfterinitialization coverage and explicit retry-backoff tests.How is this tested?
Ran the focused unit test suite for
CachedTokenSourceand the module formatting check locally:mvn --errors -pl databricks-sdk-java -Dtest=CachedTokenSourceTest testmvn --errors -pl databricks-sdk-java spotless:checkTest coverage now includes:
testAsyncRefreshParametrizedtestStaleAfterComputationParametrizedforstaleAfterinitialization across legacy builder-providedstaleDuration, default computed thresholds, null initial tokens, and expired tokenstestGetTokenDoesNotRetryBeforeAsyncBackoffElapsesto verify repeated reads during the cooldown do not trigger additional refreshestestGetTokenRetriesAfterAsyncBackoffElapsesAndUpdatesTokento verify a read after the cooldown elapses starts a new async refresh and updates the cached token