For the LLM-as-a-judge evaluation setting, this library systematically addresses two long-standing consistency issues—Score–Comparison inconsistency (lower-rated responses winning in pairwise ...