Skip to content

Improve short-query precision in Algolia search#741

Closed
Flamki wants to merge 2 commits intoprecice:masterfrom
Flamki:issue-733-search-short-token-precision
Closed

Improve short-query precision in Algolia search#741
Flamki wants to merge 2 commits intoprecice:masterfrom
Flamki:issue-733-search-short-token-precision

Conversation

@Flamki
Copy link

@Flamki Flamki commented Feb 22, 2026

Summary
Improve short-query precision in Algolia by applying a minimal query-time typo-threshold tweak in both Algolia entry points.

What changed

  • _includes/algolia.html
  • js/algolia-search.js

Both now set:

  • minWordSizefor1Typo = 5
  • minWordSizefor2Typos = 9

This is intentionally unconditional and minimal. Algolia applies these as per-word thresholds, so the behavior is naturally scoped to short words without extra helper logic.

Why
This reduces false positives for short queries (for example gsoc matching unrelated XML tokens) while avoiding broader search/index configuration changes.

Context

  • The indexing/root-cause workflow typo has already been fixed separately in #813.
  • This PR focuses only on query-time typo-tolerance behavior in the frontend search code.

Validation

  • pre-commit run --files _includes/algolia.html js/algolia-search.js (passed)
  • Attempted Docker build:
    • docker run --rm -v "${PWD}:/srv/jekyll" -w /srv/jekyll jekyll/jekyll:4 bash -lc "bundle exec jekyll build"
    • Could not run in this environment because Docker daemon is not available (dockerDesktopLinuxEngine pipe missing).

Related: #733

@MuhammadAashirAslam
Copy link
Contributor

Hi @Flamki, thanks for submitting the PR! I just have one question about the scope of the fix.

The getStrictShortTokens function only activates for single-word alphanumeric queries between 3-5 characters. What happens in a multi-word query like "gsoc projects" where the short token is part of a longer search?

Since tokens.length !== 1 returns early with an empty array, disableTypoToleranceOnWords would receive [] and the typo tolerance would remain at default — meaning "gsoc" could still match "sockets" in that context.

Was this intentional to keep the fix narrow?

(Also can you attach the screenshot of your solution working locally)

Also can you see my PR #744

@Flamki
Copy link
Author

Flamki commented Feb 23, 2026

Thanks for the careful review, great catch.

You were right: the earlier helper was too narrow for multi-word input. I updated the PR so short tokens are handled even when part of a longer query (for example gsoc projects).

Follow-up changes in this PR:

  • Detect short alphanumeric tokens (3-5 chars) anywhere in the query, not only single-word queries.
  • Use an Algolia-compatible query-time approach:
    • if a short token exists: minWordSizefor1Typo=5, minWordSizefor2Typos=9
    • otherwise: defaults 4 / 8

This keeps the scope narrow (only short-token queries), avoids global index/config changes, and keeps XML search behavior for normal XML queries.

Validation:

  • pre-commit run --files _includes/algolia.html js/algolia-search.js
  • docker run --rm -v "${PWD}:/srv/jekyll" -w /srv/jekyll jekyll/jekyll:4 bash -lc "bundle install && bundle exec jekyll build"

Local screenshots:

  1. gsoc (no noisy XML matches)

local-gsoc

  1. sockets (XML results still available)

local-sockets

@Flamki Flamki force-pushed the issue-733-search-short-token-precision branch from d33a06c to 6629092 Compare February 23, 2026 06:08
@Flamki
Copy link
Author

Flamki commented Feb 23, 2026

Also yes, I saw #744. The main difference is scope: #744 applies typo-threshold changes globally, while this PR applies stricter thresholds only when a short token is present in the query.

@MuhammadAashirAslam
Copy link
Contributor

Hey @Flamki , thanks for iterating on this! One thing I wanted to point out about the updated approach:

minWordSizefor1Typo and minWordSizefor2Typos are per-word thresholds, not per-query settings. Setting minWordSizefor1Typo = 5 only affects words with 4 or fewer characters. If the query has no such words, the setting has no effect compared to the default of 4.

This means the conditional check via hasStrictShortToken produces identical results to always setting the values. For example, searching "configuration" (13 chars) gets 2 typos regardless of whether minWordSizefor1Typo is 4 or 5, because 13 > 5 either way.

So the helper function adds ~20 lines of logic that doesn't change any search behavior compared to the simpler unconditional approach.

I think the minimal 2-line version keeps things cleaner and easier to maintain. Happy to discuss if I'm missing something though! 🙂

@MakisH MakisH added GSoC Contributed in the context of the Google Summer of Code technical Technical issues on the website labels Feb 23, 2026
@Flamki Flamki force-pushed the issue-733-search-short-token-precision branch from 6629092 to 05a7bd6 Compare February 24, 2026 20:34
@precice-bot
Copy link
Contributor

This pull request has been mentioned on preCICE Forum on Discourse. There might be relevant details there:

https://precice.discourse.group/t/gsoc-2026-website-modernization-interest-ayush-singh-flamki/2766/1

@Flamki Flamki force-pushed the issue-733-search-short-token-precision branch from 05a7bd6 to fbe90d1 Compare February 26, 2026 10:59
@Flamki
Copy link
Author

Flamki commented Feb 26, 2026

@MakisH @MuhammadAashirAslam
I simplified this PR by removing the helper/conditional logic entirely.

It now unconditionally sets minWordSizefor1Typo = 5 and minWordSizefor2Typos = 9 at query time in both _includes/algolia.html and js/algolia-search.js.

I also added a short inline comment in both files to explain why these values are used (to reduce short-query false positives like gsoc matching unrelated XML tokens).

Note: the indexing root-cause fix (typo in update-algolia.yml) was already merged separately in #813, so this PR only handles the typo-tolerance side.

Validation update: I ran a full local Jekyll build in Docker.

Command:
docker run --rm -v "${PWD}:/srv/jekyll" -w /srv/jekyll jekyll/jekyll:4 bash -lc "bundle install && bundle exec jekyll build"

Result:
bundle exec jekyll build completed successfully (done in 27.344 seconds).

@MuhammadAashirAslam
Copy link
Contributor

Hey @Flamki I see you have implemented exactly the same solution as my PR but the issue was something else (already working on it and found the root cause). This is something for after that so for now you can look into other issues. Thanks 😊

@Flamki
Copy link
Author

Flamki commented Feb 26, 2026

@MakisH @MuhammadAashirAslam thanks!
Since #733 was resolved and closed by #813, this PR is no longer addressing the primary issue.

It now serves as a small hardening improvement: adding query-time typo threshold tuning in _includes/algolia.html and js/algolia-search.js (minWordSizefor1Typo=5, minWordSizefor2Typos=9) to reduce fuzzy false positives for short tokens.

I’m happy to keep this as an optional follow-up enhancement, or close it if you prefer to keep things minimal post-#813. Let me know your preference.

@MakisH
Copy link
Member

MakisH commented Feb 26, 2026

Will look into this after we manage to update the Algolia record again (see #388).

Let's see if this is still needed then (happy to merge it if yes).

@MakisH
Copy link
Member

MakisH commented Feb 26, 2026

Regarding the original issue, this has now been solved. I would prefer to leave the settings as-is for now, unless there is a separate good reason to change them.

Thanks a lot for contributing anyway!

@MakisH MakisH closed this Feb 26, 2026
@Flamki
Copy link
Author

Flamki commented Feb 26, 2026

Thank you for the clarification and for taking the time to review this thoroughly.

That makes sense — since the original issue has now been resolved and there isn’t a strong need to adjust the typo thresholds at this point, I’m happy to leave the current settings unchanged and close this PR.

I appreciate the feedback and the discussion. It was helpful to dig into how Algolia’s per-word typo thresholds behave in practice.

Thanks again, and I’ll look into other issues where I can contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

GSoC Contributed in the context of the Google Summer of Code technical Technical issues on the website

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants