Skip to content

Replace async-disabling mechanism with retry backoff on refresh failure#696

Merged
mihaimitrea-db merged 5 commits intomainfrom
mihaimitrea-db/async-refresh-retry
Mar 11, 2026
Merged

Replace async-disabling mechanism with retry backoff on refresh failure#696
mihaimitrea-db merged 5 commits intomainfrom
mihaimitrea-db/async-refresh-retry

Conversation

@mihaimitrea-db
Copy link
Contributor

Summary

Replace the cache's staleness tracking with a cached staleAfter timestamp and preserve the legacy staleDuration builder path. Async refresh failures now use a short retry backoff instead of suppressing background refresh until the token expires.

Why

CachedTokenSource previously mixed relative stale-window calculations with token expiry checks, which made the cache state harder to reason about and made it easier to regress callers that explicitly configure staleDuration through the builder. At the same time, a failed async refresh effectively disabled further async refreshes until a blocking refresh happened on expiry, which meant a brief transient failure could prevent the SDK from recovering proactively for the rest of the token lifetime.

This PR moves the cache to an absolute staleAfter threshold that is computed whenever a token is stored. That makes the state model much clearer: callers now classify tokens by comparing the current time against staleAfter and the expiry buffer. It also preserves the legacy fixed-window behavior for callers that set staleDuration, and replaces the old async-failure suppression behavior with a one-minute retry backoff so transient failures can recover without waiting for full expiry.

What changed

Interface changes

None.

Behavioral changes

Calls that set staleDuration through CachedTokenSource.Builder#setStaleDuration(...) continue to get the legacy fixed-window behavior. The cache now recomputes staleAfter from that configured duration each time a token is stored, so both the initial cached token and later refreshed tokens honor the same caller-provided setting.

Failed async refreshes no longer block future async refresh attempts until a blocking refresh on expiry. Instead, the cache moves staleAfter one minute into the future, treats the token as fresh during that cooldown, and retries async refresh the next time the token becomes stale again.

Older async refresh results are also ignored when the cache already holds a newer token. This prevents a late async refresh from overwriting a token with a later expiry that was installed by another refresh path.

Internal changes

CachedTokenSource now stores an absolute staleAfter instant rather than reasoning about staleness from relative durations at read time. When callers do not provide staleDuration, the stale threshold is derived from the token's remaining TTL at the moment the token is stored and capped at 20 minutes. This centralizes stale-threshold computation in updateToken() and reduces token-state checks to direct time comparisons.

The implementation adds _ASYNC_REFRESH_RETRY_BACKOFF-equivalent behavior for Java via ASYNC_REFRESH_RETRY_BACKOFF, introduces handleFailedAsyncRefresh() to apply the retry cooldown, and adds cachedTokenIsNewer() to discard async refresh results that would otherwise roll the cache back to an older token.

CachedTokenSourceTest was updated to replace the old async-failure fallback coverage with parameterized staleAfter initialization coverage and explicit retry-backoff tests.

How is this tested?

Ran the focused unit test suite for CachedTokenSource and the module formatting check locally:

  • mvn --errors -pl databricks-sdk-java -Dtest=CachedTokenSourceTest test
  • mvn --errors -pl databricks-sdk-java spotless:check

Test coverage now includes:

  • Existing async refresh state coverage in testAsyncRefreshParametrized
  • testStaleAfterComputationParametrized for staleAfter initialization across legacy builder-provided staleDuration, default computed thresholds, null initial tokens, and expired tokens
  • testGetTokenDoesNotRetryBeforeAsyncBackoffElapses to verify repeated reads during the cooldown do not trigger additional refreshes
  • testGetTokenRetriesAfterAsyncBackoffElapsesAndUpdatesToken to verify a read after the cooldown elapses starts a new async refresh and updates the cached token

…on behavior

Replace the previous staleness tracking with a cached staleAfter timestamp that is computed when a token is stored. This makes the cache state easier to reason about because callers now classify tokens by comparing the current time against staleAfter and the expiry buffer, instead of deriving the stale window ad hoc at read time.

Preserve backward compatibility for callers that set staleDuration through the builder by keeping that configuration on the legacy fixed-window path. This ensures that existing integrations which explicitly pass staleDuration continue to get the behavior they configured, both for the initial cached token and for tokens obtained through later refreshes.

Update async refresh handling so failed async refreshes push staleAfter forward by a one-minute backoff instead of repeatedly retrying on every stale read. Also prevent an older async refresh result from overwriting a newer token that is already in the cache, and expand the tests to cover staleAfter computation as well as retry behavior before and after the backoff window.
mihaimitrea-db and others added 2 commits March 10, 2026 12:27
- Update field comment on staleAfter to reflect its Instant type
- Guard against null newToken in async refresh before calling cachedTokenIsNewer
- Remove redundant synchronized block in handleFailedAsyncRefresh
- Collapse duplicate null-check branches in updateToken
- Use imperative tense in changelog entry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@renaudhartert-db renaudhartert-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks great!

A few things I'd like to see addressed:

  1. refreshInProgress should likely be reset in a finally block. Right now, if updateToken or cachedTokenIsNewer throw unexpectedly, refreshInProgress stays true permanently and all future async refreshes are deadlocked. Something like:
try {
  Token newToken = tokenSource.getToken();
  synchronized (this) {
    if (newToken != null && !cachedTokenIsNewer(newToken)) {
      updateToken(newToken);
    }
  }
} catch (Exception e) {
  synchronized (this) {
    handleFailedAsyncRefresh();
    logger.error("Asynchronous token refresh failed", e);
  }
} finally {
  synchronized (this) {
    refreshInProgress = false;
  }
}
  1. The memory ordering comment in updateToken ("The stale threshold is written before the volatile token write...") is accurate for the success path, but handleFailedAsyncRefresh writes to staleAfter without a subsequent volatile write to token. Could a thread calling getToken (unsynchronized path) read a stale staleAfter after handleFailedAsyncRefresh ran? In practice the consequence is just one extra async trigger, so not a real bug, but the comment should acknowledge the failure path too.

  2. I think a test for the cachedTokenIsNewer discard path would be valuable. This is a real concurrent scenario and the current test suite doesn't cover it.

  3. Minor: triggerAsyncRefresh changed from != FRESH to == STALE. This is correct (the EXPIRED case is handled by getTokenBlocking), but it's a subtle behavioral change — a one-line comment explaining why EXPIRED is excluded would help.

- Move refreshInProgress reset to a finally block so a thrown exception
  in updateToken or cachedTokenIsNewer cannot permanently deadlock future
  async refreshes.
- Update the memory-ordering comment in updateToken to acknowledge that
  handleFailedAsyncRefresh writes staleAfter without a subsequent volatile
  token write (consequence is at most one extra async trigger).
- Add a comment in triggerAsyncRefresh explaining why only STALE (not
  EXPIRED) triggers an async attempt.
- Add testAsyncRefreshDiscardsOlderToken to cover the cachedTokenIsNewer
  discard path where a blocking refresh installs a newer token while an
  async refresh is in flight.
Fix two CI failures introduced by the previous commit:

- Apply spotless-expected formatting to the new test (Javadoc line wrap,
  assertTrue collapse).
- Wait for both staleAfter update and refreshInProgress reset before
  proceeding in backoff tests. Moving refreshInProgress = false to a
  finally block created a window where staleAfter was already updated
  but refreshInProgress was still true, causing triggerAsyncRefresh to
  bail out on Java 8 macOS.
- Extract getRefreshInProgress/getRefreshInProgressUnchecked helpers and
  reuse them in testAsyncRefreshDiscardsOlderToken.
@github-actions
Copy link

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/sdk-java

Inputs:

  • PR number: 696
  • Commit SHA: edacf5d0c73ea0f607ac59917b9af64e21940ebf

Checks will be approved automatically on success.

@mihaimitrea-db mihaimitrea-db added this pull request to the merge queue Mar 11, 2026
Merged via the queue into main with commit 214c90b Mar 11, 2026
16 checks passed
@mihaimitrea-db mihaimitrea-db deleted the mihaimitrea-db/async-refresh-retry branch March 11, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants