
TL;DR: We release Decentralized Arena that automates and scales “Chatbot Arena” for LLM evaluation across various fine-grained dimensions (e.g., math – algebra, geometry, probability; logical reasoning, social reasoning, biology, chemistry, …). The evaluation is decentralized and democratic, with all LLMs participating in evaluating others. It achieves a 95% correlation with Chatbot Arena's overall rankings, while being fully transparent and reproducible.
The community has built, and continues to build, thousands of large language models (LLMs) with increasingly strong reasoning and generation capabilities. As these models are getting widely deployed, benchmarking their performance in diverse application scenarios has been a big challenge. The most popular benchmark to date, Chatbot Arena, ranks LLMs by collecting users’ preferences for the models’ outputs. However, for both targeted industrial production and in-depth scientific understanding, it’s crucial to evaluate LLMs’ capabilities on fine-grained dimensions, such as:
Such large-scale fine-grained (and even customized) evaluation is challenging for Chatbot Arena or similar benchmarks that rely on human crowd-sourcing—it’s simply impractical to gather enough user votes for 1000+ models (or millions of model pairs) across 1000+ dimensions! In addition, as the human querying and voting process is noisy and uncontrollable, the evaluation results are largely irreproducible.
People have also studied automatic evaluation schemes, typically by selecting one (or a few) “strongest” model (usually GPT-4) as a judge to evaluate all other models. However, the judge model can be biased, e.g., by favoring outputs that resemble its own style. Optimizing models based on such evaluations could end up with all models overfitting to GPT-4's biases.
Can we combine the best of both schemes, to achieve more robust and less biased evaluations by leveraging the “wisdom of the crowds” (Chatbot Arena relies on human crowds), while also making the process automatic and scalable for comparing models across numerous dimensions? Here, we release Decentralized Arena to this end. Figure 1 illustrates the main difference between these benchmarking paradigms. The core idea behind Decentralized Arena is to use the collective intelligence of all LLMs to evaluate and compare themselves. This forms a decentralized, democratic system where all LLMs to be evaluated are also judges to evaluate others (with adaptive weights), leading to fairer rankings compared to relying on a centralized “authority” model as the judge.
Before diving into more technical details, we summarize the key advantages of Decentralized Arena below:
Figure 3 shows a screenshot of the resulting leaderboard. We’re continuing to add more models and dimensions, and welcome contributions and submissions from the community!
Figure 3: Decentralized Arena leaderboard, including rankings for different dimensions.
The decentralization concept is to use all LLMs as judges which vote on model pairs (i.e., deciding which model's output “wins”, like what human judges do on Chatbot Arena). A naive method where every model votes on all other model pairs has a complexity of , where is the number of models and is the number of queries. This becomes prohibitively slow when and are large. We design a significantly more efficient method based on incremental ranking with binary-search insertion and coarse-to-fine adjustment.
We start with a small set of “seed” models (e.g., 15 models) which are quickly ranked with the above naive method. Other models are then incrementally inserted into the rank list, one by one, through both coarse- and fine-grained steps. All models in the ranking list act as judges to help the new model find its position. The video illustrates the process.
During the above ranking process, we collect pairwise comparison results of models. We then use the Bradley-Terry (BT) approach to estimate a score of each model in the ranking. These scores are used as weights when the models act as judges—models with higher scores have a greater influence when evaluating other model pairs. (We also used other simple weighting methods such as linearly decaying weights given models’ rankings, which will be discussed further in the upcoming tech report.) The scores are automatically adjusted throughout the ranking process, with the final scores determined upon completion of the ranking.
A key advantage of the decentralized evaluation system is that rankings become more stable and robust as more models participate, as shown in Figure 4.
We apply the above automated evaluation approach to a number of evaluation dimensions to get fine-grained rankings of popular LLMs (see the leaderboard page).
Our approach achieves high correlations with Chatbot Arena that relies on extensive human judges (95% in the “Overall” dimension). Figures 2 and 5 visualizes the correlations, showing our approach outperforms other popular benchmarks, and how the rankings for different dimensions relate to each other.
Figure 5: Correlations between rankings for different dimensions.
A key advantage of the automatic Decentralized Arena is its scalability in adding arbitrary new evaluation dimensions for benchmarking LLMs. Users can easily create rankings for a new dimension that they care about. For demonstration, we’ve created dimensional rankings for various dimensions in math, reasoning, science, and coding (leaderboard).
To build rankings for a new dimension, we need to prepare a set of queries for this dimension. The LLMs are then compared on this set. For the above dimensions (e.g., math-algebra), we start with a large initial set of queries extracted and merged from various relevant open-source datasets. We further sub-sample a smaller core set of queries for efficient ranking. To do so, a naive way is to just randomly sample queries from the initial set. The more queries sampled, the more stable the final rankings would be.
To derive stable rankings with fewer queries (and thus more efficient ranking), we also design a new way of automatically selecting queries as illustrated in Figure 6. The intuition is to select queries that lead to consistent rankings (on a small set of LLMs). We’ll introduce more details in an upcoming technical report.
Figure 7 shows our query selection method leads to better and more consistent rankings than random query sampling.
We did more analyses to understand the Decentralized Arena results.
Figure 8 shows the scores and confidence intervals of LLMs in the ranking.
Figure 8: Scores and confidence intervals of LLMs
Figure 9 and 10 visualize the distributions of win rates and comparison counts for LLM pairs in our ranking process (“Overall” dimension).
As shown in Figures 9 and 10, the collective LLM intelligence automatically focuses primarily on the hard-to-distinguish neighboring LLM pairs (those close to the diagonal in Figure 10, or, equivalently, those with near 50% win rates in Figure 9). In contrast, comparisons between LLMs with large performance gaps are sparse (or even omitted), reducing the overall computation cost.
Zhiting Hu (core advising)
Contact us at Zhen Wang, Kun Zhou, and Zhiting Hu
For attribution in academic contexts, please cite this work as
@misc{decentralized2024,
title = {Decentralized Arena via Collective LLM Intelligence: Building Automated, Robust, and Transparent LLM Evaluation for Numerous Dimensions},
author = {Yanbin Yin AND Zhen Wang AND Kun Zhou AND Xiangdong Zhang AND Shibo Hao AND Yi Gu AND Jieyuan Liu AND Somanshu Singla AND Tianyang Liu AND Xing, Eric P. AND Zhengzhong Liu AND Haojian Jin AND Zhiting Hu},
year = 2024,
month = 10,
url = {https://de-arena.maitrix.org/}
}