Skip to content

Add position bias detection and Cohen's kappa to llm_judge#3810

Open
joaquinhuigomez wants to merge 1 commit intolm-sys:mainfrom
joaquinhuigomez:feature/position-bias-cohens-kappa
Open

Add position bias detection and Cohen's kappa to llm_judge#3810
joaquinhuigomez wants to merge 1 commit intolm-sys:mainfrom
joaquinhuigomez:feature/position-bias-cohens-kappa

Conversation

@joaquinhuigomez
Copy link
Copy Markdown

@joaquinhuigomez joaquinhuigomez commented Mar 16, 2026

Summary

  • Adds compute_position_bias(), compute_cohens_kappa(), and interpret_kappa() to compute_agreement.py, working with the standard _pair.jsonl format
  • Extends display_result_pairwise() in show_result.py with a one-line consistency summary (position bias rate + kappa) printed after win rates
  • Adds display_consistency_metrics() function and --show-consistency CLI flag for detailed breakdown (agreement rate, bias direction, kappa with Landis & Koch interpretation)
  • Includes 13 unit tests using synthetic judgment data (no API calls)

Test plan

  • All 13 new tests pass (python3 -m pytest tests/test_consistency_metrics.py -v)
  • Existing imports verified intact
  • End-to-end tested with synthetic JSONL data
  • Manual verification with real MT-Bench judgment output

Extends compute_agreement.py with compute_position_bias(),
compute_cohens_kappa(), and interpret_kappa() functions that work
with the standard _pair.jsonl output format.

Modifies show_result.py to print a one-line consistency summary
at the bottom of pairwise results and adds --show-consistency flag
for detailed metrics (bias rate, direction, kappa, Landis & Koch).

Includes 13 unit tests with synthetic data in
tests/test_consistency_metrics.py.
@joaquinhuigomez joaquinhuigomez force-pushed the feature/position-bias-cohens-kappa branch from 43abf58 to 02dbd00 Compare March 27, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants