***TL;DR:** We release **Decentralized Arena** that automates and scales “Chatbot Arena” for LLM evaluation across various fine-grained dimensions (e.g., math – algebra, geometry, probability; logical reasoning, social reasoning, biology, chemistry, …). The evaluation is decentralized and democratic, with all LLMs participating in evaluating others. It achieves a 95% correlation with Chatbot Arena's overall rankings, while being fully transparent and reproducible.*
Introducing Decentralized Arena
The community has built, and continues to build, thousands of large language models (LLMs) with increasingly strong reasoning and generation capabilities. As these models are getting widely deployed, benchmarking their performance in diverse application scenarios has been a big challenge. The most popular benchmark to date, [Chatbot Arena](https://lmarena.ai/), ranks LLMs by collecting users’ preferences for the models’ outputs. However, for both targeted industrial production and in-depth scientific understanding, it’s crucial to evaluate LLMs’ capabilities on fine-grained dimensions, such as:
+ Math and, further, specialized branches like algebra, geometry, probability, and calculus.
+ Different types of reasoning, like symbolic, analogical, counterfactual, and social reasoning.
+ Coding in different programming languages, like Python, C++, JavaScript, and SQL.
+ Various science domains, like physics, biology, and chemistry.
+ Or any other nuanced problems relevant to your own use cases.
Such large-scale fine-grained (and even customized) evaluation is challenging for Chatbot Arena or similar benchmarks that rely on human crowd-sourcing—**it’s simply impractical to gather enough user votes for 1000+ models (or millions of model pairs) across 1000+ dimensions!** In addition, as the human querying and voting process is noisy and uncontrollable, the evaluation results are largely **irreproducible**.
People have also studied automatic evaluation schemes, typically by selecting one (or a few) “strongest” model (usually GPT-4) as a judge to evaluate all other models. However, the **judge model can be biased**, e.g., by favoring outputs that resemble its own style. Optimizing models based on such evaluations could end up with all models overfitting to GPT-4's biases.
Can we combine the best of both schemes, to achieve more robust and less biased evaluations by leveraging the “wisdom of the crowds” (Chatbot Arena relies on human crowds), while also making the process automatic and scalable for comparing models across numerous dimensions? Here, we release **Decentralized Arena** to this end. [Figure 1](#figure-1) illustrates the main difference between these benchmarking paradigms. The core idea behind Decentralized Arena is to use the **collective intelligence of *all* LLMs** to evaluate and compare themselves. This forms a decentralized, democratic system where all LLMs to be evaluated are also judges to evaluate others (with adaptive weights), leading to fairer rankings compared to relying on a centralized “authority” model as the judge.
Figure 2: Decentralized Arena shows the strongest correlation with Chatbot Arena (Overall)
Key Advantages of Decentralized Arena
Before diving into more technical details, we summarize the **key advantages** of Decentralized Arena below:
+ **Robust, unbiased:** Decentralization avoids bias due to a single or a small committee of judge models, and is less gameable by overfitting the judge models. The more LLMs that participate in the arena, the more robust the evaluation becomes ([Figure 4](#figure-4))! Moreover, Decentralized Arena archives a very high correlation (95%) with Chatbot Arena in the “Overall” dimension on 50+ models ([Figure 2](#figure-2)).
+ **Automatic, easily scalable, and customizable to any evaluation dimensions:** While Chatbot Arena is limited to evaluating a few dimensions due to the limited number of meaningful user votes it can gather, Decentralized Arena is fully automatic and can scale to an infinite number of dimensions. We also provide guidelines of automatically selecting dimension-specific queries for customized evaluation in [later section](#build).
+ **Fast, instant ranking of new models:** Similarly, thanks to the automation and the efficient *binary-search* ranking algorithm described in [Method](#method), we can instantly get the evaluation results of a new model, without needing to wait for weeks to gather user votes.
+ **Transparent, fully reproducible:** All algorithms, implementations, and inputs/outputs will be made open, making the results fully reproducible.
+ **Trustworthy:** Ultimately, with its robustness, strong alignment with existing human evaluation results, fine-grained dimensional analysis, and transparency, Decentralized Arena aims to provide a benchmark the community can trust.
Public Leaderboard
[Figure 3](#figure-3) shows a screenshot of the resulting [leaderboard](https://huggingface.co/spaces/LLM360/de-arena). We’re continuing to add more models and dimensions, and welcome contributions and submissions from the community!
Figure 3: Decentralized Arena **leaderboard**, including rankings for different dimensions.
Sorting LLMs Quickly
Benchmarking LLMs via Collective LLM Intelligence
The decentralization concept is to use *all* LLMs as judges which vote on model pairs (i.e., deciding which model's output “wins”, like what human judges do on Chatbot Arena). A naive method where every model votes on all other model pairs has a complexity of $O(n^3k)$, where $n$ is the number of models and $k$ is the number of queries. This becomes prohibitively slow when $n$ and $k$ are large. We design a significantly more efficient method based on incremental ranking with *binary-search* insertion and *coarse-to-fine* adjustment.
We start with a small set of “seed” models (e.g., 15 models) which are quickly ranked with the above naive method. Other models are then incrementally inserted into the rank list, one by one, through both coarse- and fine-grained steps. All models in the ranking list act as judges to help the new model find its position. The [video](#video) illustrates the process.
+ **Step-1: Coarse-grained Ranking with Binary Search Insertion.** This step finds the rough position of a new model within the current ranking. The idea is to use binary search to quickly narrow down the position. When comparing the new model with an existing one, the other models in the ranking serve as judges. The time complexity of this binary search is $O(kn\log n)$.
+ **Step-2: Fine-grained In-window Ranking and Sliding.** To further refine the ranking of the new model, we compare it with neighboring models within a window (e.g., two models before and after it in the ranking). The rationale is that these nearby LLMs are often the hardest to distinguish, warranting closer comparison. All other models outside this window serve as judges. If the in-window comparison leads to a change in the new LLM's position, the process is repeated within the updated window until the ranking stabilizes. This functions like a sliding window, guiding the LLM crowd to focus on the most ambiguous comparison pairs, ensuring accurate ranking while minimizing computational cost.
During the above ranking process, we collect pairwise comparison results of models. We then use the [Bradley-Terry (BT) approach](https://en.wikipedia.org/wiki/Bradley–Terry_model) to estimate a score of each model in the ranking. These scores are used as weights when the models act as judges—models with higher scores have a greater influence when evaluating other model pairs. (We also used other simple weighting methods such as linearly decaying weights given models’ rankings, which will be discussed further in the upcoming tech report.) The scores are automatically adjusted throughout the ranking process, with the final scores determined upon completion of the ranking.
A key advantage of the decentralized evaluation system is that rankings become more stable and robust as more models participate, as shown in [Figure 4](#figure-4).
Figure 4: The variance (shaded area) in the rankings decreases as the number of models increases, indicating progressively more robust rankings.
We apply the above automated evaluation approach to a number of evaluation dimensions to get fine-grained rankings of popular LLMs (see the [leaderboard](https://huggingface.co/spaces/LLM360/de-arena) page).
Our approach achieves high correlations with Chatbot Arena that relies on extensive human judges (95% in the “Overall” dimension). Figures [2](#figure-2) and [5](#figure-5) visualizes the correlations, showing our approach outperforms other popular benchmarks, and how the rankings for different dimensions relate to each other.
Figure 5: **Correlations** between rankings for different dimensions.
Building Your Own Dimension
Selecting High-value Queries
A key advantage of the automatic Decentralized Arena is its scalability in adding arbitrary new evaluation dimensions for benchmarking LLMs. Users can easily create rankings for a new dimension that they care about. For demonstration, we’ve created dimensional rankings for various dimensions in math, reasoning, science, and coding ([leaderboard](https://huggingface.co/spaces/LLM360/de-arena)).
To build rankings for a new dimension, we need to prepare a set of queries for this dimension. The LLMs are then compared on this set. For the above dimensions (e.g., math-algebra), we start with a large initial set of queries extracted and merged from various relevant open-source datasets. We further sub-sample a smaller core set of queries for efficient ranking. To do so, a naive way is to just randomly sample queries from the initial set. The more queries sampled, the more stable the final rankings would be.
To derive stable rankings with fewer queries (and thus more efficient ranking), we also design a new way of automatically selecting queries as illustrated in [Figure 6](#figure-6). The intuition is to select queries that lead to consistent rankings (on a small set of LLMs). We’ll introduce more details in an upcoming technical report.
Figure 6: Automatic query selection for a new dimension
[Figure 7](#figure-7) shows our query selection method leads to better and more consistent rankings than random query sampling.
Figure 7: Using queries selected by our method achieves higher correlation and lower variance than using randomly-sampled queries.
More Statistics Results for Decentralized Arena
We did more analyses to understand the Decentralized Arena results.
**[Figure 8](#figure-8) shows the scores and confidence intervals of LLMs in the ranking.**
Figure 8: Scores and **confidence** intervals of LLMs
**Figure 9 and 10 visualize the distributions of win rates and comparison counts for LLM pairs in our ranking process (“Overall” dimension).**
As shown in Figures [9](#figure-9) and [10](#figure-10), the collective LLM intelligence automatically focuses primarily on the hard-to-distinguish neighboring LLM pairs (those close to the diagonal in [Figure 10](#figure-10), or, equivalently, those with near 50% win rates in [Figure 9](#figure-9)). In contrast, comparisons between LLMs with large performance gaps are sparse (or even omitted), reducing the overall computation cost.
Figure 9: Win-rate distribution map
Figure 10: Comparison-count distribution map
Contributors
Yanbin Yin, [Zhen Wang](https://zhenwang9102.github.io/), [Kun Zhou](https://lancelot39.github.io/), Xiangdong Zhang (core contribution)
[Shibo Hao](https://ber666.github.io/), [Yi Gu](https://www.yigu.page/), [Jieyuan Liu](https://www.linkedin.com/in/jieyuan-liu/), [Somanshu Singla](https://www.linkedin.com/in/somanshu-singla-105636214/), [Tianyang Liu](https://leolty.github.io/),
[Eric P. Xing](https://www.cs.cmu.edu/~epxing/), [Zhengzhong Liu](https://hunterhector.github.io/), [Haojian Jin](https://www.haojianj.in/),
[Zhiting Hu](https://zhiting.ucsd.edu/) (core advising)
**Contact us** at [Zhen Wang](mailto:zhenwang9102@gmail.com), [Kun Zhou](mailto:franciskunzhou@gmail.com), and [Zhiting Hu](mailto:zhitinghu@gmail.com)