Replace async-disabling mechanism with retry backoff on refresh failure by mihaimitrea-db · Pull Request #696 · databricks/databricks-sdk-java

mihaimitrea-db · 2026-03-10T12:06:09Z

Summary

Replace the cache's staleness tracking with a cached staleAfter timestamp and preserve the legacy staleDuration builder path. Async refresh failures now use a short retry backoff instead of suppressing background refresh until the token expires.

Why

CachedTokenSource previously mixed relative stale-window calculations with token expiry checks, which made the cache state harder to reason about and made it easier to regress callers that explicitly configure staleDuration through the builder. At the same time, a failed async refresh effectively disabled further async refreshes until a blocking refresh happened on expiry, which meant a brief transient failure could prevent the SDK from recovering proactively for the rest of the token lifetime.

This PR moves the cache to an absolute staleAfter threshold that is computed whenever a token is stored. That makes the state model much clearer: callers now classify tokens by comparing the current time against staleAfter and the expiry buffer. It also preserves the legacy fixed-window behavior for callers that set staleDuration, and replaces the old async-failure suppression behavior with a one-minute retry backoff so transient failures can recover without waiting for full expiry.

What changed

Interface changes

None.

Behavioral changes

Calls that set staleDuration through CachedTokenSource.Builder#setStaleDuration(...) continue to get the legacy fixed-window behavior. The cache now recomputes staleAfter from that configured duration each time a token is stored, so both the initial cached token and later refreshed tokens honor the same caller-provided setting.

Failed async refreshes no longer block future async refresh attempts until a blocking refresh on expiry. Instead, the cache moves staleAfter one minute into the future, treats the token as fresh during that cooldown, and retries async refresh the next time the token becomes stale again.

Older async refresh results are also ignored when the cache already holds a newer token. This prevents a late async refresh from overwriting a token with a later expiry that was installed by another refresh path.

Internal changes

CachedTokenSource now stores an absolute staleAfter instant rather than reasoning about staleness from relative durations at read time. When callers do not provide staleDuration, the stale threshold is derived from the token's remaining TTL at the moment the token is stored and capped at 20 minutes. This centralizes stale-threshold computation in updateToken() and reduces token-state checks to direct time comparisons.

The implementation adds _ASYNC_REFRESH_RETRY_BACKOFF-equivalent behavior for Java via ASYNC_REFRESH_RETRY_BACKOFF, introduces handleFailedAsyncRefresh() to apply the retry cooldown, and adds cachedTokenIsNewer() to discard async refresh results that would otherwise roll the cache back to an older token.

CachedTokenSourceTest was updated to replace the old async-failure fallback coverage with parameterized staleAfter initialization coverage and explicit retry-backoff tests.

How is this tested?

Ran the focused unit test suite for CachedTokenSource and the module formatting check locally:

mvn --errors -pl databricks-sdk-java -Dtest=CachedTokenSourceTest test
mvn --errors -pl databricks-sdk-java spotless:check

Test coverage now includes:

Existing async refresh state coverage in testAsyncRefreshParametrized
testStaleAfterComputationParametrized for staleAfter initialization across legacy builder-provided staleDuration, default computed thresholds, null initial tokens, and expired tokens
testGetTokenDoesNotRetryBeforeAsyncBackoffElapses to verify repeated reads during the cooldown do not trigger additional refreshes
testGetTokenRetriesAfterAsyncBackoffElapsesAndUpdatesToken to verify a read after the cooldown elapses starts a new async refresh and updates the cached token

…on behavior Replace the previous staleness tracking with a cached staleAfter timestamp that is computed when a token is stored. This makes the cache state easier to reason about because callers now classify tokens by comparing the current time against staleAfter and the expiry buffer, instead of deriving the stale window ad hoc at read time. Preserve backward compatibility for callers that set staleDuration through the builder by keeping that configuration on the legacy fixed-window path. This ensures that existing integrations which explicitly pass staleDuration continue to get the behavior they configured, both for the initial cached token and for tokens obtained through later refreshes. Update async refresh handling so failed async refreshes push staleAfter forward by a one-minute backoff instead of repeatedly retrying on every stale read. Also prevent an older async refresh result from overwriting a newer token that is already in the cache, and expand the tests to cover staleAfter computation as well as retry behavior before and after the backoff window.

- Update field comment on staleAfter to reflect its Instant type - Guard against null newToken in async refresh before calling cachedTokenIsNewer - Remove redundant synchronized block in handleFailedAsyncRefresh - Collapse duplicate null-check branches in updateToken - Use imperative tense in changelog entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

renaudhartert-db

The PR looks great!

A few things I'd like to see addressed:

refreshInProgress should likely be reset in a finally block. Right now, if updateToken or cachedTokenIsNewer throw unexpectedly, refreshInProgress stays true permanently and all future async refreshes are deadlocked. Something like:

try {
  Token newToken = tokenSource.getToken();
  synchronized (this) {
    if (newToken != null && !cachedTokenIsNewer(newToken)) {
      updateToken(newToken);
    }
  }
} catch (Exception e) {
  synchronized (this) {
    handleFailedAsyncRefresh();
    logger.error("Asynchronous token refresh failed", e);
  }
} finally {
  synchronized (this) {
    refreshInProgress = false;
  }
}

The memory ordering comment in updateToken ("The stale threshold is written before the volatile token write...") is accurate for the success path, but handleFailedAsyncRefresh writes to staleAfter without a subsequent volatile write to token. Could a thread calling getToken (unsynchronized path) read a stale staleAfter after handleFailedAsyncRefresh ran? In practice the consequence is just one extra async trigger, so not a real bug, but the comment should acknowledge the failure path too.
I think a test for the cachedTokenIsNewer discard path would be valuable. This is a real concurrent scenario and the current test suite doesn't cover it.
Minor: triggerAsyncRefresh changed from != FRESH to == STALE. This is correct (the EXPIRED case is handled by getTokenBlocking), but it's a subtle behavioral change — a one-line comment explaining why EXPIRED is excluded would help.

- Move refreshInProgress reset to a finally block so a thrown exception in updateToken or cachedTokenIsNewer cannot permanently deadlock future async refreshes. - Update the memory-ordering comment in updateToken to acknowledge that handleFailedAsyncRefresh writes staleAfter without a subsequent volatile token write (consequence is at most one extra async trigger). - Add a comment in triggerAsyncRefresh explaining why only STALE (not EXPIRED) triggers an async attempt. - Add testAsyncRefreshDiscardsOlderToken to cover the cachedTokenIsNewer discard path where a blocking refresh installs a newer token while an async refresh is in flight.

Fix two CI failures introduced by the previous commit: - Apply spotless-expected formatting to the new test (Javadoc line wrap, assertTrue collapse). - Wait for both staleAfter update and refreshInProgress reset before proceeding in backoff tests. Moving refreshInProgress = false to a finally block created a window where staleAfter was already updated but refreshInProgress was still true, causing triggerAsyncRefresh to bail out on Java 8 macOS. - Extract getRefreshInProgress/getRefreshInProgressUnchecked helpers and reuse them in testAsyncRefreshDiscardsOlderToken.

github-actions · 2026-03-11T11:19:03Z

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/sdk-java

Inputs:

PR number: 696
Commit SHA: edacf5d0c73ea0f607ac59917b9af64e21940ebf

Checks will be approved automatically on success.

mihaimitrea-db temporarily deployed to test-trigger-is March 10, 2026 12:06 — with GitHub Actions Inactive

mihaimitrea-db requested a review from renaudhartert-db March 10, 2026 12:19

mihaimitrea-db and others added 2 commits March 10, 2026 12:27

Merge remote-tracking branch 'origin/main' into async-refresh-retry

1ab6c27

mihaimitrea-db temporarily deployed to test-trigger-is March 10, 2026 12:29 — with GitHub Actions Inactive

mihaimitrea-db temporarily deployed to test-trigger-is March 10, 2026 12:30 — with GitHub Actions Inactive

mihaimitrea-db self-assigned this Mar 10, 2026

renaudhartert-db requested changes Mar 10, 2026

View reviewed changes

mihaimitrea-db temporarily deployed to test-trigger-is March 11, 2026 09:18 — with GitHub Actions Inactive

mihaimitrea-db temporarily deployed to test-trigger-is March 11, 2026 09:20 — with GitHub Actions Inactive

mihaimitrea-db temporarily deployed to test-trigger-is March 11, 2026 11:18 — with GitHub Actions Inactive

mihaimitrea-db temporarily deployed to test-trigger-is March 11, 2026 11:20 — with GitHub Actions Inactive

renaudhartert-db approved these changes Mar 11, 2026

View reviewed changes

mihaimitrea-db added this pull request to the merge queue Mar 11, 2026

Merged via the queue into main with commit 214c90b Mar 11, 2026
16 checks passed

mihaimitrea-db deleted the mihaimitrea-db/async-refresh-retry branch March 11, 2026 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace async-disabling mechanism with retry backoff on refresh failure#696

Replace async-disabling mechanism with retry backoff on refresh failure#696
mihaimitrea-db merged 5 commits intomainfrom
mihaimitrea-db/async-refresh-retry

mihaimitrea-db commented Mar 10, 2026

Uh oh!

renaudhartert-db left a comment •

edited

Loading

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihaimitrea-db commented Mar 10, 2026

Summary

Why

What changed

Interface changes

Behavioral changes

Internal changes

How is this tested?

Uh oh!

renaudhartert-db left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

renaudhartert-db left a comment •

edited

Loading