EloMMR
EloMMR
ABSTRACT have a panel of judges who rank contestants against one another;
Skill estimation mechanisms, colloquially known as rating sys- these subjective scores are known to be noisy [32]. In all these cases,
tems, play an important role in competitive sports and games. They scores can only be used to compare and rank participants at the
provide a measure of player skill, which incentivizes competitive same event. Players, spectators, and contest organizers who are
performances and enables balanced match-ups. In this paper, we interested in comparing players’ skill levels across different compe-
present a novel Bayesian rating system for contests with many titions will need to aggregate the entire history of such rankings. A
participants. It is widely applicable to competition formats with strong player, then, is one who consistently wins against weaker
discrete ranked matches, such as online programming competitions, players. To quantify skill, we need a rating system.
obstacle courses races, and video games. The system’s simplicity Good rating systems are difficult to create, as they must bal-
allows us to prove theoretical bounds on its robustness and runtime. ance several mutually constraining objectives. First and foremost,
In addition, we show that it is incentive-compatible: a player who rating systems must be accurate, in that ratings provide useful pre-
seeks to maximize their rating will never want to underperform. dictors of contest outcomes. Second, the ratings must be efficient
Experimentally, the rating system surpasses existing systems in to compute: within video game applications, rating systems are
prediction accuracy, and computes faster than existing systems by predominantly used for matchmaking in massively multiplayer
up to an order of magnitude. online games (such as Halo, CounterStrike, League of Legends,
etc.) [25, 29, 36]. These games have hundreds of millions of players
CCS CONCEPTS playing tens of millions of games per day, necessitating certain
latency and memory requirements for the rating system [12]. Third,
• Information systems → Learning to rank; • Computing me-
rating systems must be incentive-compatible: a player’s rating
thodologies → Learning in probabilistic graphical models.
should never increase had they scored worse, and never decrease
KEYWORDS had they scored better. This is to prevent players from regretting
a win, or from throwing matches to game the system. Rating sys-
rating system, skill estimation, mechanism design, competition, tems that can be gamed often create disastrous consequences to
bayesian inference, robust, incentive-compatible, elo, glicko, trueskill player-base, potentially leading to the loss of players [3]. Finally,
ACM Reference Format: the ratings provided by the system must be human-interpretable:
Aram Ebtekar and Paul Liu. 2021. An Elo-like System for Massive Multi- ratings are typically represented to players as a single number en-
player Competitions. In Proceedings of the Web Conference 2021 (WWW ’21), capsulating their overall skill, and many players want to understand
April 19–23, 2021, Ljubljana, Slovenia. ACM, New York, NY, USA, 15 pages.
and predict how their performances affect their rating [21].
https://doi.org/10.1145/3442381.3450091
Classically, rating systems were designed for two-player games.
The famous Elo system [18], as well as its Bayesian successors
1 INTRODUCTION
Glicko and Glicko-2, have been widely applied to games such as
Competitions, in the form of sports, games, and examinations, have Chess and Go [21–23]. Both Glicko versions model each player’s
been with us since antiquity. Many competitions grade perfor- skill as a real random variable that evolves with time according
mances along a numerical scale, such as a score on a test or a to Brownian motion. Inference is done by entering these variables
completion time in a race. In the case of a college admissions exam into the Bradley-Terry model [14], which predicts probabilities of
or a track race, scores are standardized so that a given score on game outcomes. Glicko-2 refines the Glicko system by adding a
two different occasions carries the same meaning. However, in rating volatility parameter. Unfortunately, Glicko-2 is known to
events that feature novelty, subjectivity, or close interaction, stan- be flawed in practice, potentially incentivizing players to lose in
dardization is difficult. The Spartan Races, completed by millions what’s known as “volatility farming”. In some cases, these attacks
of runners, feature a variety of obstacles placed on hiking trails can inflate a user’s rating several hundred points above its natural
around the world [11]. Rock climbing, a sport to be added to the value, producing ratings that are essentially impossible to beat via
2020 Olympics, likewise has routes set specifically for each com- honest play. This was most notably exploited in the popular game
petition. DanceSport, gymnastics, and figure skating competitions of Pokemon Go [3]. See Section 5.1 for a discussion of this issue, as
This paper is published under the Creative Commons Attribution 4.0 International well as an application of this attack to the Topcoder rating system.
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their The family of Elo-like methods just described only utilize the
personal and corporate Web sites with the appropriate attribution.
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
binary outcome of a match. In settings where a scoring system
© 2021 IW3C2 (International World Wide Web Conference Committee), published provides a more fine-grained measure of match performance, Ko-
under Creative Commons CC-BY 4.0 License. valchik [27] has shown variants of Elo that are able to take advan-
ACM ISBN 978-1-4503-8312-7/21/04.
https://doi.org/10.1145/3442381.3450091 tage of score information. For competitions consisting of several set
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu
tasks, such as academic olympiads, Forišek [19] developed a model allows us to rigorously analyze its properties: the “MMR” in the
in which each task gives a different “response” to the player: the to- name stands for “Massive”, “Monotonic”, and “Robust”. “Massive”
tal response then predicts match outcomes. However, such systems means that it supports any number of players with a runtime that
are often highly application-dependent and hard to calibrate. scales linearly; “monotonic” is a synonym for incentive-compatible,
Though Elo-like systems are widely used in two-player settings, ensuring that a rating-maximizing player always wants to perform
one needn’t look far to find competitions that involve much more well; “robust” means that rating changes are bounded, with the
than two players. In response to the popularity of team-based games bound being smaller for more consistent players than for volatile
such as CounterStrike and Halo, many recent works focus on com- players. Robustness turns out to be a natural byproduct of accurately
petitions that are between two teams [15, 24, 26, 28]. Another pop- modeling performances with heavy-tailed distributions, such as
ular setting is many-player contests such as academic olympiads: the logistic. TrueSkill is believed to satisfy the first two properties,
notably, programming contest platforms such as Codeforces, Top- albeit without proof, but fails robustness. Codeforces only satisfies
coder, and Kaggle [6, 8, 10]. As with the aforementioned Spartan incentive-compatibility, and Topcoder only satisfies robustness.
races, a typical event attracts thousands of contestants. Program- Experimentally, we show that Elo-MMR achieves state-of-the-art
ming contest platforms have seen exponential growth over the past performance in terms of both prediction accuracy and runtime on
decade, collectively boasting millions of users [5]. As an example, industry datasets. In particular, we process the entire Codeforces
Codeforces gained over 200K new users in 2019 alone [2]. database of over 400K rated users and 1000 contests in well under a
In “free-for-all” settings, where 𝑁 players are ranked individually, minute, beating the existing Codeforces system by more than an or-
the Bayesian Approximation Ranking (BAR) algorithm [34] models der of magnitude while improving upon its accuracy. Furthermore,
the competition as a series of 𝑁2 independent two-player contests. we show that the well-known Topcoder system is severely vulnera-
In reality, of course, the pairwise match outcomes are far from ble to volatility farming, whereas Elo-MMR is immune to such at-
independent. Thus, TrueSkill [25] and its variants [17, 29, 31] model tacks. A difficulty we faced was the scarcity of efficient open-source
a player’s performance during each contest as a single random rating system implementations. In an effort to aid researchers and
variable. The overall rankings are assumed to reveal the total order practitioners alike, we provide open-source implementations of all
among these hidden performance variables, with various methods rating systems, dataset mining, and additional processing used in
used to model ties and teams. For a textbook treatment of these our experiments at https://github.com/EbTech/Elo-MMR.
methods, see [35]. These rating systems are efficient in practice, We note that since releasing our preprint, Elo-MMR has already
successfully rating userbases that number well into the millions (the been put in production in industry settings [9].
Halo series, for example, has over 60 million sales since 2001 [4]).
Organization. In Section 2, we formalize the details of our Bayesian
The main disadvantage of TrueSkill is its complexity: originally
model. We then show how to estimate player skill under this model
developed by Microsoft for the popular Halo video game, TrueSkill
in Section 3, and develop some intuitions of the resulting formulas.
performs approximate belief propagation, which consists of mes-
As a further refinement, Section 4 models skill evolutions from
sage passing on a factor graph, iterated until convergence. Aside
players training or atrophying between competitions. This mod-
from being less human-interpretable, this complexity means that,
eling is quite tricky as we choose to retain players’ momentum
to our knowledge, there are no proofs of key properties such as run-
while preserving incentive-compatibility. While our modeling and
time and incentive-compatibility. Even when these properties are
derivations occupy multiple sections, the system itself is succinctly
discussed [29], no rigorous justification is provided. In addition, we
presented in Algorithms 1 to 3. In Section 5, we perform a volatility
are not aware of any work that extends TrueSkill to non-Gaussian
farming attack on the Topcoder system and prove that, in contrast,
performance models, which might be desirable to limit the influence
Elo-MMR satisfies several salient properties, the most critical of
of outlier performances (see Section 5.2).
which is incentive-compatibility. Finally, in Section 6, we present
It might be for these reasons that popular platforms such as
experimental evaluations, showing improvements over industry
Codeforces and Topcoder opted for their own custom rating sys-
standards in both accuracy and speed.
tems. These systems are not published in academia and do not come
with Bayesian justifications. However, they retain the formulaic
2 A BAYESIAN MODEL FOR MASSIVE
simplicity of Elo and Glicko, extending them to settings with much
more than two players. The Codeforces system includes ad hoc COMPETITIONS
heuristics to distinguish top players, while curbing rampant infla- We now describe the setting formally, denoting random variables
tion. Topcoder’s formulas are more principled from a statistical by capital letters. A series of competitive rounds, indexed by 𝑡 =
perspective; however, it has a volatility parameter similar to Glicko- 1, 2, 3, . . ., take place sequentially in time. Each round has a set of
2, and hence suffers from similar exploits [19]. Despite their flaws, participating players P𝑡 , which may in general overlap between
these systems have been in place for over a decade, and have more rounds. A player’s skill is likely to change with time, so we repre-
recently gained adoption by additional platforms such as CodeChef sent the skill of player 𝑖 at time 𝑡 by a real random variable 𝑆𝑖,𝑡 .
and LeetCode [1, 7]. In round 𝑡, each player 𝑖 ∈ P𝑡 competes at some performance
level 𝑃𝑖,𝑡 , typically close to their current skill 𝑆𝑖,𝑡 . The deviations
Our contributions. In this paper, we describe the Elo-MMR rating {𝑃𝑖,𝑡 −𝑆𝑖,𝑡 }𝑖 ∈ P𝑡 are assumed to be i.i.d. and independent of {𝑆𝑖,𝑡 }𝑖 ∈ P𝑡 .
system, obtained by a principled approximation of a Bayesian model Performances are not observed directly; instead, a ranking gives
similar to Glicko and TrueSkill. It is fast, embarrassingly parallel, the relative order among all performances {𝑃𝑖,𝑡 }𝑖 ∈ P𝑡 . In particular,
and makes accurate predictions. Most interesting of all, its simplicity ties are modelled to occur when performances are exactly equal,
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
a zero-probability event when their distributions are continuous.1 Since our posteriors are continuous, the convergence holds for
This ranking constitutes the observational evidence 𝐸𝑡 for our all 𝑠 simultaneously. Moreover, we don’t even need the full evidence
𝐿 = {𝑗 ∈ P : 𝑃
Bayesian updates. The rating system seeks to estimate the skill 𝑆𝑖,𝑡 𝐸𝑡 . Let 𝐸𝑖,𝑡 𝑗,𝑡 > 𝑃𝑖,𝑡 } be the set of players against
of every player at the present time 𝑡, given the historical round 𝑊
whom 𝑖 lost, and 𝐸𝑖,𝑡 = { 𝑗 ∈ P : 𝑃 𝑗,𝑡 < 𝑃𝑖,𝑡 } be the set of players
rankings 𝐸 ≤𝑡 := {𝐸 1, . . . , 𝐸𝑡 }. against whom 𝑖 won. That is, we only look at who wins, draws,
We overload the notation Pr for both probabilities and probability and loses against 𝑖. 𝑃𝑖,𝑡 remains identifiable using only (𝐸𝑖,𝑡 𝐿 , 𝐸𝑊 ),
𝑖,𝑡
densities: the latter interpretation applies to zero-probability events, which will be more convenient for our purposes.
such as in Pr(𝑆𝑖,𝑡 = 𝑠). We also use colons as wildcards to denote In practice, we should care about the rate of convergence. Sup-
collections of variables differing only in a subscript: for instance, pose we want our estimate to be within 𝜀 of 𝑃𝑖,𝑡 , with probability
𝑃:,𝑡 := {𝑃𝑖,𝑡 }𝑖 ∈ P𝑡 . The joint distribution described by our Bayesian at least 1 − 𝛿. By asymptotic normality of the posterior [20], it
model factorizes as follows: suffices to have 𝑂 ( 𝜀12 log 𝛿1 ) participants. Experimentally, we see
Pr(𝑆 :,:, 𝑃:,:, 𝐸 : ) (1) in Section 6.5 that Elo-MMR is competitive on all sizes of contests.
Ö Ö Ö Ö Bayesian ratings systems, such as Glicko and TrueSkill, make
= Pr(𝑆𝑖,0 ) Pr(𝑆𝑖,𝑡 | 𝑆𝑖,𝑡 −1 ) Pr(𝑃𝑖,𝑡 | 𝑆𝑖,𝑡 ) Pr(𝐸𝑡 | 𝑃:,𝑡 ), several simplifying assumptions to render their posterior updates
𝑖 𝑖,𝑡 𝑖,𝑡 𝑡
tractable. Typically these are chosen ad hoc for convenience; how-
where Pr(𝑆𝑖,0 ) is the initial skill prior, ever, having passed to a limit in which 𝑃𝑖,≤𝑡 is identified, our frame-
Pr(𝑆𝑖,𝑡 | 𝑆𝑖,𝑡 −1 ) is the skill evolution model (Section 4), work is able to rigorously justify such simplifications. Firstly, since
Pr(𝑃𝑖,𝑡 | 𝑆𝑖,𝑡 ) is the performance model, and 𝑃𝑖, ≤𝑡 is a sufficient statistic for predicting 𝑆𝑖,𝑡 , it may be said that
Pr(𝐸𝑡 | 𝑃:,𝑡 ) is the evidence model. (𝐸𝑖,𝐿 ≤𝑡 , 𝐸𝑊
𝑖, ≤𝑡 ) are “almost sufficient” for 𝑆𝑖,𝑡 : any additional informa-
tion, such as from domain-specific scoring systems, becomes redun-
For the first three factors, we will specify log-concave distributions dant for the purposes of skill estimation. Secondly, conditioned on
(see Definition 3.1). The evidence model, on the other hand, is a 𝑃:, ≤𝑡 , the posterior skills 𝑆 :,𝑡 are independent of one another. As a
deterministic indicator. It equals one when 𝐸𝑡 is consistent with result, there are no inter-player correlations to model, and a player’s
the relative ordering among 𝑃:,𝑡 , and zero otherwise. posterior is unaffected by rounds in which they are not a partici-
Finally, our model assumes that the number of participants |P𝑡 | pant. Finally, if we’ve truly identified 𝑃𝑖,𝑡 , then rounds later than 𝑡
is large. The main idea behind our algorithm is that, in sufficiently should not prompt revisions in our estimate for 𝑃𝑖,𝑡 . This obviates
massive competitions, the evidence 𝐸𝑡 contains enough information the need for expensive whole-history update procedures [16, 17],
to infer very precise estimates for 𝑃:,𝑡 . Hence, we can treat these for the purposes of present skill estimation.2
performances as if they were observed directly. Thus, when the initial prior, performance model, and evolution
With that in mind, we’ll often discuss the distributions of vari- model are all Gaussian, treating 𝑃𝑖,𝑡 as certain is the only simplify-
ables whose round subscript is 𝑡, conditioned on either the prior ing approximation we will make; that is, in the limit |P𝑡 | → ∞, our
context 𝑃𝑖,<𝑡 or the posterior context 𝑃𝑖, ≤𝑡 : these are called prior method performs exact inference on Equation (1). In the following
and posterior distributions, respectively. In particular, suppose we sections, we focus some attention on generalizing the performance
have the skill prior: model to non-Gaussian log-concave families, parametrized by loca-
𝜋𝑖,𝑡 (𝑠) := Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,<𝑡 ). (2) tion and scale; here, a few minor approximations keep the deriva-
tions tractable. We will use the logistic distribution as a running
Now, we observe 𝐸𝑡 . By Equation (1), it is conditionally indepen- example and see that it induces robustness; however, our framework
dent of 𝑆𝑖,𝑡 , given 𝑃𝑖, ≤𝑡 . By the law of total probability, is agnostic to the specific distributions used.
The prior rating 𝜇𝑖,𝑡 𝜋 and posterior rating 𝜇 of player 𝑖 at round
Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,<𝑡 , 𝐸𝑡 ) 𝑖,𝑡
∫ 𝑡 should be statistics that summarize the player’s prior and posterior
= Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,<𝑡 , 𝑃𝑖,𝑡 = 𝑝) Pr(𝑃𝑖,𝑡 = 𝑝 | 𝑃𝑖,<𝑡 , 𝐸𝑡 ) d𝑝. skill distribution, respectively. We’ll use the mode: thus, 𝜇𝑖,𝑡 is the
maximum a posteriori (MAP) estimate, obtained by setting 𝑠 to
This integral is intractable in general, since the performance maximize the posterior Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,≤𝑡 ). By Bayes’ rule,
posterior Pr(𝑃𝑖,𝑡 = 𝑝 | 𝑃𝑖,<𝑡 , 𝐸𝑡 ) depends not only on player 𝑖, but 𝜋
also on our beliefs regarding the skills of all 𝑗 ∈ P𝑡 . However, in 𝜇𝑖,𝑡 := arg max 𝜋𝑖,𝑡 (𝑠),
𝑠
the limit of infinite participants, Doob’s consistency theorem [20]
𝜇𝑖,𝑡 := arg max 𝜋𝑖,𝑡 (𝑠) Pr(𝑃𝑖,𝑡 | 𝑆𝑖,𝑡 = 𝑠). (3)
implies that the posterior concentrates at the true value 𝑃𝑖,𝑡 . That 𝑠
is, with probability one, as |P𝑡 | → ∞,
This objective suggests a two-phase algorithm to update each
player 𝑖 ∈ P𝑡 in response to the results of round 𝑡. In phase one,
Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,<𝑡 , 𝐸𝑡 ) 𝐿 , 𝐸𝑊 ). By Doob’s consistency theorem,
we estimate 𝑃𝑖,𝑡 from (𝐸𝑖,𝑡 𝑖,𝑡
∫ our estimate is extremely precise when |P𝑡 | is large, so we assume
→ Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖, ≤𝑡 ) Pr(𝑃𝑖,𝑡 = 𝑝 | 𝑃𝑖,<𝑡 , 𝐸𝑡 ) d𝑝 it to be exact. In phase two, we update our posterior for 𝑆𝑖,𝑡 and
= Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,≤𝑡 ). the rating 𝜇𝑖,𝑡 according to Equation (3).
1 The relevant limiting procedure is to treat performances within 𝜀 -width buckets as 2 As opposed to historical skill estimation, which is concerned with 𝑃 (𝑆𝑖,𝑡 | 𝑃𝑖,≤𝑡 ′ )
ties, and letting 𝜀 → 0. This technicality appears in the proof of Theorem 3.2. for 𝑡 ′ > 𝑡 . Whole-history methods can take advantage of future information.
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu
L2 LR Normal Logistic
3 SKILL ESTIMATION IN TWO PHASES
12
3.1 Performance estimation 0.4
10
In this section, we describe the first phase of Elo-MMR. For nota- 8 0.3
tional convenience, we assume all probability expressions to be 6 0.2
conditioned on the prior context 𝑃𝑖,<𝑡 , and omit the subscript 𝑡. 4
Our prior belief on each player’s skill 𝑆𝑖 implies a prior distri- 0.1
2
bution on 𝑃𝑖 . Let’s denote its probability density function (pdf)
by -4 -2 2 4 -4 -2 2 4
∫
𝑓𝑖 (𝑝) := Pr(𝑃𝑖 = 𝑝) = 𝜋𝑖 (𝑠) Pr(𝑃𝑖 = 𝑝 | 𝑆𝑖 = 𝑠) d𝑠, (4) Figure 1: 𝐿2 versus 𝐿𝑅 for typical values (left). Gaussian ver-
sus logistic probability density functions (right).
where 𝜋𝑖 (𝑠) was defined in Equation (2). Let
∫ 𝑝
𝐹𝑖 (𝑝) := Pr(𝑃𝑖 ≤ 𝑝) = 𝑓𝑖 (𝑥) d𝑥,
−∞ Theorem 3.2. Suppose that for all 𝑗, 𝑓 𝑗 is continuously differen-
be the corresponding cumulative distribution function (cdf). We’ll tiable and log-concave. Then the maximizer of Pr(𝑃𝑖 = 𝑝 | 𝐸𝑖𝐿 , 𝐸𝑊𝑖 )
also define the following functions, which will be associated with is unique and given by the unique zero of
losses, draws, and wins, respectively: Õ Õ Õ
𝑄𝑖 (𝑝) := 𝑙 𝑗 (𝑝) + 𝑑 𝑗 (𝑝) + 𝑣 𝑗 (𝑝).
d −𝑓𝑖 (𝑝)
𝑙𝑖 (𝑝) := ln(1 − 𝐹𝑖 (𝑝)) = , 𝑗 ≻𝑖 𝑗∼𝑖 𝑗 ≺𝑖
d𝑝 1 − 𝐹𝑖 (𝑝)
d 𝑓 ′ (𝑝) The proof appears in the appendix. Intuitively, we’re saying
𝑑𝑖 (𝑝) := ln 𝑓𝑖 (𝑝) = 𝑖 ,
d𝑝 𝑓𝑖 (𝑝) that the performance is the balance point between appropriately
d 𝑓𝑖 (𝑝) weighted wins, draws, and losses. Let’s look at two specializations
𝑣𝑖 (𝑝) := ln 𝐹𝑖 (𝑝) = . of our general model, to serve as running examples in this paper.
d𝑝 𝐹𝑖 (𝑝)
Evidently, 𝑙𝑖 (𝑝) < 0 < 𝑣𝑖 (𝑝). Now we define what it means for Gaussian performance model. If both 𝑆 𝑗 and 𝑃 𝑗 − 𝑆 𝑗 are assumed
the deviation 𝑃𝑖 − 𝑆𝑖 to be log-concave. to be Gaussian with known means and variances, then their inde-
pendent sum 𝑃 𝑗 will also be a known Gaussian. It is analytic and
Definition 3.1. An absolutely continuous random variable on a log-concave, so Theorem 3.2 applies.
convex domain is log-concave if its probability density function 𝑓 is We substitute the well-known Gaussian pdf and cdf for 𝑓 𝑗 and 𝐹 𝑗 ,
positive on its domain and satisfies respectively. A simple binary search, or faster numerical techniques
𝑓 (𝜃𝑥 + (1 − 𝜃 )𝑦) > 𝑓 (𝑥)𝜃 𝑓 (𝑦) 1−𝜃 , ∀𝜃 ∈ (0, 1), 𝑥 ≠ 𝑦. such as the Illinois algorithm or Newton’s method, can be employed
to solve for the unique zero of 𝑄𝑖 .
Log-concave distributions appear widely, and include the Gauss-
ian and logistic distributions used in Glicko, TrueSkill, and many Logistic performance model. Now we assume the performance
others. We’ll see inductively that our prior 𝜋𝑖 is log-concave at deviation 𝑃 𝑗 − 𝑆 𝑗 has a logistic distribution with mean 0 and vari-
every round. Since log-concave densities are closed under convolu- ance 𝛽 2 . In general, the rating system administrator is free to set 𝛽
tion [13], the independent sum 𝑃𝑖 = 𝑆𝑖 + (𝑃𝑖 −𝑆𝑖 ) is also log-concave. differently for each contest. Since shorter contests tend to be more
Log-concavity is made very convenient by the following lemma, variable, one reasonable choice might be to make 1/𝛽 2 proportional
proved in the appendix: to the contest duration.
Given the mean and variance of the skill prior, the independent
Lemma 3.1. If 𝑓𝑖 is continuously differentiable and log-concave, sum 𝑃 𝑗 = 𝑆 𝑗 + (𝑃 𝑗 − 𝑆 𝑗 ) would have the same mean, and a variance
then the functions 𝑙𝑖 , 𝑑𝑖 , 𝑣𝑖 are continuous, strictly decreasing, and that’s increased by 𝛽 2 . Unfortunately, we’ll see that the logistic
𝑙𝑖 (𝑝) < 𝑑𝑖 (𝑝) < 𝑣𝑖 (𝑝) for all 𝑝. performance model implies a form of skill prior from which it’s
tough to extract a mean and variance. Even if we could, the sum
For the remainder of this section, we fix the analysis with respect does not yield a simple distribution.
to some player 𝑖. As argued in Section 2, 𝑃𝑖 concentrates very For experienced players, we expect 𝑆 𝑗 to contribute much less
narrowly in the posterior. Hence, we can estimate 𝑃𝑖 by its MAP, variance than 𝑃 𝑗 − 𝑆 𝑗 ; thus, in our heuristic approximation, we take
choosing 𝑝 so as to maximize: 𝑃 𝑗 to have the same form of distribution as the latter. That is, we
take 𝑃 𝑗 to be logistic, centered at the prior rating 𝜇 𝜋𝑗 = arg max 𝜋 𝑗 ,
Pr(𝑃𝑖 = 𝑝 | 𝐸𝑖𝐿 , 𝐸𝑊 𝐿 𝑊
𝑖 ) ∝ 𝑓𝑖 (𝑝) Pr(𝐸𝑖 , 𝐸𝑖 | 𝑃𝑖 = 𝑝). with variance 𝛿 2𝑗 = 𝜎 2𝑗 + 𝛽 2 , where 𝜎 𝑗 will be given by Equation (8).
Define 𝑗 ≻ 𝑖, 𝑗 ≺ 𝑖, 𝑗 ∼ 𝑖 as shorthand for 𝑗 ∈ 𝐸𝑖𝐿 ,
𝑗 ∈ 𝐸𝑊
𝑖 ,
This distribution is analytic and log-concave, so the same methods
𝑗 ∈ P \ (𝐸𝑖𝐿 ∪ 𝐸𝑊
𝑖 ) (that is, 𝑃 𝑗 > 𝑃 ,
𝑖 𝑗𝑃 < 𝑃 ,
𝑖 𝑗𝑃 = 𝑃𝑖 ), respectively. based on Theorem 3.2 apply.
The following theorem yields our MAP estimate: Let’s derive 𝑄𝑖 explicitly in this case, since it has a rather intuitive
form. The logistic distribution with variance 𝛿 2𝑗 has scale parameter
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
√
𝛿¯𝑗 := 𝜋3 𝛿 𝑗 ; its cdf and pdf are: Logistic performance model. When the performance model is non-
! Gaussian, the pointwise product of pdfs does not simplify so easily.
1 1 𝑝 − 𝜇 𝜋𝑗
𝐹 𝑗 (𝑝) = = 1 + tanh , By Equation (3), each round contributes an additional factor to the
−(𝑝−𝜇 𝜋𝑗 )/𝛿¯𝑗 2 2𝛿¯𝑗 belief distribution. In general, we allow it to consist of a collection
1+𝑒
(𝑝−𝜇 𝜋𝑗 )/𝛿¯𝑗 𝑝 − 𝜇 𝜋𝑗 of simple log-concave factors, one for each round in which player 𝑖
𝑒 1 2
𝑓 𝑗 (𝑝) = = sech . has participated. Denote 𝑖’s participation history by
(𝑝−𝜇 𝜋𝑗 )/𝛿¯𝑗 2 4𝛿¯𝑗 2𝛿¯𝑗
𝛿¯𝑗 1 + 𝑒 H𝑖,𝑡 := {𝑘 ∈ {1, . . . , 𝑡 } : 𝑖 ∈ P𝑘 }.
They satisfy two very convenient relations: Since the factors deal with only a single player, we’ll omit the
𝐹 𝑗′ (𝑝) = 𝑓 𝑗 (𝑝) = 𝐹 𝑗 (𝑝)(1 − 𝐹 𝑗 (𝑝))/𝛿¯𝑗 , subscript 𝑖. Specializing to the logistic setting, each 𝑘 ∈ H𝑡 con-
tributes a logistic factor to the posterior, with mean 𝑝𝑘 and variance
𝑓 ′ (𝑝) = 𝑓 𝑗 (𝑝)(1 − 2𝐹 𝑗 (𝑝))/𝛿¯𝑗 ,
𝑗 𝛽𝑘2 . We still use a Gaussian initial prior, with mean and variance
from which it follows that denoted by 𝑝 0 and 𝛽 02 , respectively. Postponing the discussion of
1 − 2𝐹 𝑗 (𝑝) −𝐹 𝑗 (𝑝) 1 − 𝐹 𝑗 (𝑝) skill evolution to Section 4, for the moment we assume that 𝑆𝑘 = 𝑆 0
𝑑 𝑗 (𝑝) = = + = 𝑙 𝑗 (𝑝) + 𝑣 𝑗 (𝑝). for all 𝑘. The posterior pdf, up to normalization, is then
𝛿¯ 𝛿¯ 𝛿¯
Ö
In other words, a tie counts as the sum of a win and a loss. 𝜋 0 (𝑠) Pr(𝑃𝑘 = 𝑝𝑘 | 𝑆𝑘 = 𝑠)
This can be compared to the approach (used in Elo, Glicko, BAR, 𝑘 ∈H𝑡
Topcoder, and Codeforces) of treating each tie as half a win plus !
half a loss.3 (𝑠 − 𝑝 0 ) 2 Ö 2 𝜋 𝑠 − 𝑝𝑘
∝ exp − sech √ . (5)
Finally, putting everything together: 2𝛽 02 𝑘 ∈H𝑡 12 𝛽𝑘
Õ Õ Õ
𝑄𝑖 (𝑝) = 𝑙 𝑗 (𝑝) + 𝑙 𝑗 (𝑝) + 𝑣 𝑗 (𝑝) + 𝑣 𝑗 (𝑝) Maximizing the posterior density amounts to minimizing its
𝑗 ≻𝑖 𝑗∼𝑖 𝑗 ≺𝑖 negative logarithm. Up to a constant offset, this is given by
Õ Õ
= 𝑙 𝑗 (𝑝) + 𝑣 𝑗 (𝑝)
𝑠 − 𝑝0
Õ
𝑠 − 𝑝𝑘
𝑗 ⪰𝑖 𝑗 ⪯𝑖 𝐿(𝑠) := 𝐿2 + 𝐿𝑅 ,
𝛽0 𝛽𝑘
Õ −𝐹 𝑗 (𝑝) Õ 1 − 𝐹 𝑗 (𝑝) 𝑘 ∈H𝑡
= + . 1
𝜋𝑥
𝑗 ⪰𝑖 𝛿¯𝑗 𝑗 ⪯𝑖𝛿¯𝑗 where 𝐿2 (𝑥) := 𝑥 2 and 𝐿𝑅 (𝑥) := 2 ln cosh √ .
2 12
Our estimate for 𝑃𝑖 is the zero of this expression. Its terms cor-
respond to probabilities, weighted by 1/𝛿¯𝑗 , of losing and winning 𝑠 − 𝑝0 Õ 𝜋 (𝑠 − 𝑝𝑘 )𝜋
Í Thus, 𝐿 ′ (𝑠) = + √ tanh √ . (6)
against each player 𝑗. Accordingly, we can interpret 𝑗 ∈ P (1 − 𝛽02
𝛽
𝑘 ∈H 𝑘 3 𝛽𝑘 12
𝑡
𝐹 𝑗 (𝑝))/𝛿¯𝑗 as a weighted expected rank of a player whose perfor-
𝐿 ′ is continuous and strictly increasing in 𝑠, so its zero is unique:
mance is 𝑝. 𝑃𝑖 can thus be viewed as the performance level at
it is the MAP 𝜇𝑡 . Similar to what we did in the first phase, we can
which one’s expected rank would equal 𝑖’s actual rank. While the
solve for 𝜇𝑡 with binary search or other root-solving methods.
Codeforces and Topcoder systems compute performance values in
Furthermore, Equation (6) reveals a rather intuitive interpreta-
a similar manner, here we’ve derived the formula from Bayesian
tion for the rating 𝜇𝑡 as an aggregate of the historical performances
principles.
𝑝 ≤𝑡 : Gaussian factors in 𝐿 become 𝐿2 penalty terms, whereas logis-
tic factors appear as the more interesting 𝐿𝑅 terms. In Figure 1, we
3.2 Belief update
see that 𝐿𝑅 behaves quadratically near the origin, but linearly at the
Having estimated 𝑃𝑖,𝑡 in the first phase, the second phase is more extremities. It’s essentially a smoothed Huber loss, interpolating
straightforward. Ignoring normalizing constants, Equation (3) tells between 𝐿2 and 𝐿1 over a scale of magnitude 𝛽𝑘 .
us that the pdf of the skill posterior can be obtained as the pointwise It is well-known that minimizing a sum of 𝐿2 terms pushes the
product of the pdfs of the skill prior and the performance model. argument towards a weighted mean, while minimizing a sum of
When both factors are differentiable and log-concave, then so is 𝐿1 terms pushes the argument towards a weighted median. With
their product. Its maximum is the new rating 𝜇𝑖,𝑡 ; let’s see how to 𝐿𝑅 terms, the net effect is that 𝜇𝑡 acts like a robust average of the
compute it for the same two specializations of our model. historical performances 𝑝 ≤𝑡 . Specifically, one can check that
Gaussian performance model. When the skill prior and perfor-
Í
𝑘 𝑤 𝑘 𝑝𝑘 1
mance model are Gaussian with known means and variances, multi- 𝜇𝑡 = Í , where 𝑤 0 := 2 and
𝑘 𝑤𝑘 𝛽0
plying their pdfs yields another known Gaussian. Hence, the poste-
rior is compactly represented by its mean 𝜇𝑖,𝑡 , which coincides with 𝜋 (𝜇𝑡 − 𝑝𝑘 )𝜋
2 , which is our uncertainty
the MAP and rating; and its variance 𝜎𝑖,𝑡 𝑤𝑘 := √ tanh √ for 𝑘 ∈ H𝑡 . (7)
(𝜇𝑡 − 𝑝𝑘 )𝛽𝑘 3 𝛽𝑘 12
regarding the player’s skill.
𝑤𝑘 is close to 1/𝛽𝑘2 for typical performances, but can be up to
3 Elo-MMR, 2
too, can be modified to split ties into half win plus half loss. It’s easy to 𝜋 /6 times more as |𝜇𝑡 − 𝑝𝑘 | → 0, or vanish entirely as |𝜇𝑡 − 𝑝𝑘 | →
check that Lemma 3.1 still holds if 𝑑 𝑗 (𝑝) is replaced by 𝑤𝑙 𝑙 𝑗 (𝑝) + 𝑤𝑣 𝑣 𝑗 (𝑝) , provided
that 𝑤𝑙 , 𝑤𝑣 ∈ [0, 1] and |𝑤𝑙 − 𝑤𝑣 | < 1. In particular, we can set 𝑤𝑙 = 𝑤𝑣 = 0.5. The ∞. The latter feature is due to the thicker tails of the logistic dis-
results in Section 5 won’t be altered by this change. tribution, as compared to the Gaussian, resulting in an algorithm
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu
that resists drastic rating changes in the presence of a few unusu- overall performance. By simply distributing the credit equally, we
ally good or bad performances. We’ll formally state this robustness ensure that every individual’s incentive is perfectly aligned with
property in Theorem 5.7. winning as a team.
Estimating skill uncertainty. While there is no easy way to com-
4 SKILL EVOLUTION OVER TIME
pute the variance of a posterior in the form of Equation (5), it will
be useful to have some estimate 𝜎𝑡2 of uncertainty. There is a simple Over time, as a player trains or rests, a player’s skill can change. If
formula in the case where all factors are Gaussian. Since moment- we model skill as a static variable, our system will eventually grow
matched logistic and normal distributions are relatively close (cf. so confident in its estimate that it will refuse to admit substantial
Figure 1), we apply the same formula: changes. To remedy this, we introduce a skill evolution model, so
1 Õ 1 that in general 𝑆𝑡 ≠ 𝑆𝑡 ′ for 𝑡 ≠ 𝑡 ′ . Rather than simply being equal
2
:= . (8) to the previous round’s posterior, now the skill prior at round 𝑡 is
𝜎𝑡 𝛽2
𝑘 ∈ {0}∪H 𝑘 given by
𝑡
∫
3.3 Team competitions 𝜋𝑡 (𝑠) = Pr(𝑆𝑡 = 𝑠 | 𝑆𝑡 −1 = 𝑥) Pr(𝑆𝑡 −1 = 𝑥 | 𝑃 <𝑡 ) d𝑥 . (9)
While our main focus is on ranked competitions between a large
number of individuals, Elo-MMR can be adapted to ranked compe- The factors in the integrand are the skill evolution model and the
titions between a large number of teams. In this setting, round 𝑡’s previous round’s posterior, respectively. Following other Bayesian
set of participants P𝑡 is partitioned into a disjoint union of teams rating systems (e.g., Glicko, Glicko-2, and TrueSkill [22, 23, 25]),
Ã
𝜏 ∈ T𝑡 : formally, P𝑡 = 𝜏 ∈ T𝑡 𝜏. we model the skill changes 𝑆𝑡 − 𝑆𝑡 −1 as independent zero-mean
Instead of ranking individual 𝑖 by their performance 𝑃𝑖 , the com- Gaussians. That is, Pr(𝑆𝑡 | 𝑆𝑡 −1 = 𝑥) is a Gaussian with mean 𝑥
petition ranks an entire team 𝜏 by a performance variable 𝑃𝜏 , which and some variance 𝛾𝑡2 .
depends on the skills {𝑆𝑖 : 𝑖 ∈ 𝜏 } of all its members. In general, the There is some flexibility in how 𝛾𝑡 is set. Glicko, in its origi-
probabilistic team performance model should be domain-specific: nal presentation, sets 𝛾𝑡2 proportionally to the time elapsed since
depending, for instance, on whether game outcomes are most heav- the last update, corresponding to a continuous Brownian motion.
ily influenced by a team’s weakest or strongest player. A default Codeforces and Topcoder simply set 𝛾𝑡 to a constant when a player
choice that credits team members equally is the sum of their indi- participates, and zero otherwise, corresponding to changes that are
vidual performances: in proportion to how often the player competes. Now we are ready
Õ Õ Õ to complete the two specializations of our rating system.
𝑃𝜏 := 𝑃𝑖 = 𝑆𝑖 + (𝑃𝑖 − 𝑆𝑖 ).
𝑖 ∈𝜏 𝑖 ∈𝜏 𝑖 ∈𝜏 Gaussian performance model. If the performance model and the
Thus, 𝑃𝜏 is a sum of 2|𝜏 | independently distributed terms. Just prior on 𝑆𝑡 −1 are both Gaussian, then the posterior on 𝑆𝑡 −1 is also
as before, we approximate this sum by a single Gaussian or logistic Gaussian. Since 𝑆𝑡 = 𝑆𝑡 −1 + (𝑆𝑡 − 𝑆𝑡 −1 ) is a sum of independent
term with matching moments. Instead of the moments (𝜇𝑖𝜋 , 𝛿𝑖 ) of Gaussians, its prior is Gaussian as well. By induction, the skill belief
𝑃𝑖 in Algorithm 1, we’ll have distribution forever remains Gaussian. As we’ll see in Section 5.2,
Õ this Gaussian specialization of the Elo-MMR framework lacks the
𝜇𝜏𝜋 ← 𝜇𝑖 , R for robustness, so we call it Elo-MM𝜒.
𝑖 ∈𝜏
s Õ Logistic performance model. After a player’s first participation,
𝛿𝜏 ← |𝜏 |𝛽 2 + 𝜎𝑖2 . the posterior in Equation (5) becomes non-Gaussian, rendering the
𝑖 ∈𝜏 integral in Equation (9) intractable.
With this change, the algorithm proceeds almost exactly as be- A very simple approach would be to replace the full posterior in
fore, with the performance estimation step operating at the level of Equation (5) by a Gaussian approximation with mean 𝜇𝑡 (equal to
teams instead of individuals, 𝑃𝜏 , 𝜇𝜏𝜋 , 𝛿𝜏 replacing 𝑃𝑖 , 𝜇𝑖𝜋 , 𝛿𝑖 . the posterior MAP) and variance 𝜎𝑡2 (given by Equation (8)). Then,
The main caveat is that, in our limit of large competitions, we as in the previous case, the intractable integral specializes to a
only obtain precise estimates of the team performance 𝑃𝜏 . To esti- simple addition of Gaussian random variables.
mate the individual performance 𝑃𝑖 , which in turn approximates With this approximation, no memory is kept of the individual
𝑆𝑖 , we subtract all of 𝑖’s teammates’ ratings from 𝑃𝜏 . Since performances 𝑃𝑡 . Priors are simply Gaussian, while the pdf of a
Õ Õ skill posterior is the product of two factors: the Gaussian prior, and
𝑆𝑖 = 𝑃𝜏 − 𝑆𝑗 − (𝑃 𝑗 − 𝑆 𝑗 ), a logistic factor corresponding to the latest performance. To ensure
𝑗 ∈𝜏,𝑗≠𝑖 𝑗 ∈𝜏
Í robustness (see Section 5.2), 𝜇𝑡 is computed as the arg max of this
the variance of this estimate is not 𝛽 2 , but |𝜏 |𝛽 2 + 𝑗 ∈𝜏,𝑗≠𝑖 𝜎 2𝑗 . posterior before replacement by its Gaussian approximation. We
Since we don’t know who to credit for a team outcome, it’s impossi- call the rating system that takes this approach Elo-MMR(∞).
ble to precisely estimate 𝑃𝑖 . As a result, the independence argument As the name implies, it turns out to be a limiting case of Elo-
in Section 2 ceases to hold. Nonetheless, Elo-MMR for team contests MMR(𝜌). In the general setting with 𝜌 ∈ [0, ∞), we keep the full
continues to enjoy the properties described in Section 5. posterior from Equation (5). Since we cannot tractably compute the
While smarter credit-assignment schemes may be considered in effect of a Gaussian diffusion, we seek a heuristic derivation of the
future work, one should be wary of the risk that such mechanisms next round’s prior, retaining a form similar to Equation (5) while
may motivate players to seek credit, even at the expense of a team’s satisfying many of the same properties as the intended diffusion.
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
𝑝𝑖 > 𝑝 𝑗 . By induction, the conclusion also holds for 𝑖, 𝑗 that are not
2750 adjacent in the rankings. □
2500 What matters for incentives is that performance scores be coun-
terfactually monotonic; meaning, if we were to alter the round
2250
standings, a strategic player will always prefer to place higher:
2000
Lemma 5.3. In any given round, holding fixed the relative rankings
1750 among all players other than 𝑖 (and holding fixed all preceding rounds),
0 20 40 60 80 100
Contest # the performance 𝑝𝑖 is a monotonic function of player i’s prior rating
and of player 𝑖’s rank in this round.
3750
Elo-MMR (honest) Proof. 𝑄𝑖 (𝑝) depends on the prior rating 𝜇𝑖𝜋 only through the
3500 Elo-MMR (adversarial) self-tie term 𝑑𝑖 , which in turn depends only on 𝑝 − 𝜇𝑖𝜋 . Thus, a
change in 𝜇𝑖𝜋 has the same effect as an opposite change in 𝑝. By
3250
Lemma 3.1, 𝑑𝑖 is monotonically increasing in 𝜇𝑖𝜋 , from which it
3000 follows that 𝑝𝑖 is also monotonically increasing in 𝜇𝑖𝜋 .
Rating
2750 Now, since an upward shift in 𝑖’s ranking can only convert losses
to ties and ties to wins, Lemma 5.1 implies that 𝑝𝑖 is also monotoni-
2500 cally increasing in improvements to 𝑖’s rank. □
2250
Having established the relationship between round rankings
2000 and performance scores, the next step is to prove that, even with
1750
hindsight, players will always prefer their performance scores to
0 20 40 60 80 100 be as high as possible:
Contest #
Lemma 5.4. Holding fixed the set of contest rounds in which a
player has participated, their current rating is monotonic in each of
Figure 2: Volatility farming attack on the Topcoder system. their past performance scores.
To this end, we need a few lemmas. Recall that, for the purposes Proof. The player’s rating is given by the zero of 𝐿 ′ in Equa-
of the algorithm, the performance 𝑝𝑖 is defined to be the unique tion (10). This expression contains the variables 𝛽 :, 𝜔 :, 𝑝 :, and 𝑠. As
Í Í Í
zero of the function 𝑄𝑖 (𝑝) := 𝑗 ≻𝑖 𝑙 𝑗 (𝑝) + 𝑗∼𝑖 𝑑 𝑗 (𝑝) + 𝑗 ≺𝑖 𝑣 𝑗 (𝑝), 𝑝𝑘 is varied, 𝛽 : and 𝜔 : do not change: although the pseudodiffusions
whose terms 𝑙 𝑗 , 𝑑 𝑗 , 𝑣 𝑗 are contributed by opponents against whom 𝑖 of Section 4 do modify 𝜔 : , these changes are agnostic to 𝑝𝑘 . On the
lost, drew, or won, respectively. Wins always contribute positively other hand, 𝐿 ′ (𝑠) is monotonically increasing in 𝑠 and decreasing
to a player’s performance score, while losses contribute negatively: in each of the 𝑝𝑘 . Therefore, its zero is monotonically increasing in
each of the 𝑝𝑘 .
Lemma 5.1. Adding a win term to 𝑄𝑖 , or replacing a tie term by a This is almost what we wanted to prove, except that 𝑝 0 is not
win term, always increases its zero. Conversely, adding a loss term, or a performance. Due to the pseudodiffusion’s transfer step (or the
replacing a tie term by a loss term, always decreases it. actual diffusion, in the case of Elo-MM𝜒), 𝑝 0 is a weighted average
of its previous value and the prior rating, and so it is monotonic in
Proof. By Lemma 3.1, 𝑄𝑖 (𝑝) is decreasing in 𝑝. Thus, adding a both. Using this same lemma in the previous round as an inductive
positive term will increase its zero whereas adding a negative term hypothesis, it follows that 𝑝 0 is monotonic in past performances.
will decrease it. The desired conclusion follows by noting that, for By induction, the proof is complete. □
all 𝑗 and 𝑝,
Finally, we conclude that a rating-maximizing player is always
𝑣 𝑗 (𝑝) > 0, 𝑣 𝑗 (𝑝) − 𝑑 𝑗 (𝑝) > 0, motivated to improve their round rankings:
𝑙 𝑗 (𝑝) < 0, 𝑙 𝑗 (𝑝) − 𝑑 𝑗 (𝑝) < 0. Theorem 5.5 (Incentive-compatibility). Holding fixed the set
of contest rounds in which each player has participated, and the
□ historical ratings and relative rankings among all players other than
𝑖, player 𝑖’s current rating is monotonic in each of 𝑖’s past rankings.
While not needed for our main result, a similar argument shows
that performance scores are monotonic across the round standings: Proof. Choose any contest round in player 𝑖’s history, and con-
sider improving player 𝑖’s rank in that round while holding every-
Theorem 5.2. If 𝑖 ≻ 𝑗 (that is, 𝑖 beats 𝑗) in a given round, then thing else fixed. It suffices to show that player 𝑖’s current rating
the players’ performance estimates satisfy 𝑝𝑖 > 𝑝 𝑗 . would necessarily increase as a result.
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu
In the altered round, by Lemma 5.3, 𝑝𝑖 is increased; and by between Gaussian and logistic factors in the posterior. Recall the
Lemma 5.4, player 𝑖’s post-round rating is increased. By Lemma 5.3 notation in Equation (10), describing the loss function and weights.
again, this increases player 𝑖’s performance score in the following
Theorem 5.7. In the Elo-MMR(𝜌) rating system, let
round. Proceeding inductively, we find that performance scores and
ratings from this point onward are all increased. □ Δ+ := lim 𝜇𝑡 − 𝜇𝑡 −1, Δ− := lim 𝜇𝑡 −1 − 𝜇𝑡 .
𝑝𝑡 →+∞ 𝑝𝑡 →−∞
In the special cases of Elo-MM𝜒 or Elo-MMR(∞), the rating sys- Then, for Δ± ∈ {Δ+, Δ− },
tem is “memoryless”: the only data retained for each player are −1
the current rating 𝜇𝑖,𝑡 and uncertainty 𝜎𝑖,𝑡 ; detailed performance 𝜋 © 𝜋2 Õ 𝜋 1
√ 𝑤 0 + ≤ Δ± ≤ √
ª
𝑤𝑘 ® .
history is not saved. In this setting, we present a natural mono- 𝛽𝑡 3 6 𝛽𝑡 3 𝑤 0
𝑘 ∈H𝑡 −1
tonicity theorem. A similar theorem was previously stated for the « ¬
Codeforces system, albeit in an informal context without proof [8]. Proof. The limits exist, by monotonicity. Using the fact that
𝑑 tanh(𝑥) ≤ 1, differentiating 𝐿 ′ in Equation (10) yields
0 < 𝑑𝑥
Theorem 5.6 (Memoryless Monotonicity). In either the Elo-
MM𝜒 or Elo-MMR(∞) system, suppose 𝑖 and 𝑗 are two participants of 𝜋2 Õ
round 𝑡. Suppose that the ratings and corresponding uncertainties sat- ∀𝑠 ∈ R, 𝑤 0 ≤ 𝐿 ′′ (𝑠) ≤ 𝑤 0 + 𝑤𝑘 .
6
isfy 𝜇𝑖,𝑡 −1 ≥ 𝜇 𝑗,𝑡 −1, 𝜎𝑖,𝑡 −1 = 𝜎 𝑗,𝑡 −1 . Then, 𝜎𝑖,𝑡 = 𝜎 𝑗,𝑡 . Furthermore: 𝑘 ∈H𝑡 −1
If 𝑖 ≻ 𝑗 in round 𝑡, then 𝜇𝑖,𝑡 > 𝜇 𝑗,𝑡 . Now, the performance at round 𝑡 adds a new term with multiplic-
(𝑠−𝑝𝑘 )𝜋
If 𝑗 ≻ 𝑖 in round 𝑡, then 𝜇 𝑗,𝑡 − 𝜇 𝑗,𝑡 −1 > 𝜇𝑖,𝑡 − 𝜇𝑖,𝑡 −1 . ity one to 𝐿 ′ (𝑠): its value is 𝜋√ tanh √ . As a result, for every
𝛽𝑘 3 𝛽𝑘 12
Proof. The new contest round will add a rating perturbation 𝑠 ∈ R, lim𝑝𝑡 →±∞ 𝐿 ′ (𝑠) increases by ∓ 𝜋√ , while lim𝑝𝑡 →±∞ 𝐿 ′′ (𝑠)
𝛽𝑡 3
with variance 𝛾𝑡2 , followed by a new performance with variance 𝛽𝑡2 . does not change at all. Since we had 𝐿 ′ (𝜇𝑡 −1 ) = 0 without this new
As a result, term, after adding the term we have
!− 1 !− 1 𝜋
1 1 2
1 1 2 lim 𝐿 ′ (𝜇𝑡 −1 ) → ∓ √ .
𝜎𝑖,𝑡 = 2 + 2 = 2 + 2 = 𝜎 𝑗,𝑡 . 𝑝𝑡 →±∞ 𝛽𝑡 3
𝜎𝑖,𝑡 −1 + 𝛾𝑡2 𝛽𝑡 2
𝜎 𝑗,𝑡 −1 + 𝛾𝑡 𝛽𝑡
Dividing by the former inequalities yields the desired result. □
The remaining conclusions are consequences of three proper-
ties: memorylessness, incentive-compatibility (Theorem 5.5), and The proof reveals that the magnitude of Δ± depends inversely
translation-invariance (ratings, skills, and performances are quanti- on that of 𝐿 ′′ in the vicinity of the current rating, which in turn
fied on a common interval scale relative to one another). is related to the derivative of the tanh terms. If a player’s perfor-
Since the Elo-MM𝜒 or Elo-MMR(∞) systems are memoryless, we mances vary wildly, the tanh terms will be widely dispersed, so
may replace the initial prior and performance histories of players any 𝑠 ∈ R will necessarily be in the tail ends of most of the terms.
with any alternate histories of our choosing, as long as our choice is Tails contribute very little to 𝐿 ′ (𝑠), enabling a larger rating change.
compatible with their current rating and uncertainty. In particular, Conversely, the tanh terms of a player with a very consistent per-
both 𝑖 and 𝑗 can be considered to have participated in the same formance history will contribute large derivatives, so the bound on
set of rounds, with 𝑖 always performing at 𝜇𝑖,𝑡 −1 . and 𝑗 always their rating change will be small.
performing at 𝜇 𝑗,𝑡 −1 . Round 𝑡 is unchanged. Thus, Elo-MMR naturally caps the rating changes of all play-
Suppose 𝑖 ≻ 𝑗. Since 𝑖’s historical performances are all equal or ers, and the cap is smaller for consistent performers. The cap will
stronger than 𝑗’s, Theorem 5.5 implies 𝜇𝑖,𝑡 > 𝜇 𝑗,𝑡 . increase after an extreme performance, providing a similar “momen-
Suppose 𝑗 ≻ 𝑖 instead. By translation-invariance, if we shift each tum” to the Topcoder and Glicko-2 systems, but without sacrificing
of 𝑗’s performances, up to round 𝑡 and including the initial prior, incentive-compatibility (Theorem 5.5).
upward by 𝜇𝑖,𝑡 −1 − 𝜇 𝑗,𝑡 −1 , the rating changes between rounds will Let’s compare the lower and upper bound in Theorem 5.7: within
be unaffected. Players 𝑖 and 𝑗 now have identical histories, except a factor of 𝜋 2 /6, their ratio corresponds to the normal term’s weight
Í
that we still have 𝑗 ≻ 𝑖 at round 𝑡. Therefore, 𝜇 𝑗,𝑡 −1 = 𝜇𝑖,𝑡 −1 and, 𝑤 0 relative to the total 𝑘 𝑤𝑘 . Recall that 𝜌 is the weight transfer
by Theorem 5.5, 𝜇 𝑗,𝑡 > 𝜇𝑖,𝑡 . Subtracting the equation from the rate: larger 𝜌 results in more weight being transferred into 𝑤 0 ; in
inequality proves the second conclusion. □ this case, the lower and upper bound tend to stay close together.
Conversely, the momentum effect is more pronounced when 𝜌
5.2 Robust response is small. In the extreme case 𝜌 = 0, 𝑤 0 vanishes for experienced
players, so a sufficiently volatile player would be subject to corre-
Another desirable property in many settings is robustness: a player’s
spondingly large rating updates.
rating should not change too much in response to any one con-
In general, according to Algorithm 2, the asymptotic steady-
test, no matter how extreme their performance. The Codeforces Í
state values of 𝑤 0 and 𝑊 := 𝑘 𝑤𝑘 must jointly solve the fixpoint
and TrueSkill systems lack this property, allowing for unbounded
equation
rating changes. Topcoder achieves robustness by clamping any
𝑤 0 = 𝜅𝑤 0 + (𝜅 − 𝜅 1+𝜌 )(𝑊 − 𝑤 0 ).
changes that exceed a cap, which is initially high for new players
but decreases with experience. Rearranging yields an expression for the steady-state ratio:
When 𝜌 > 0, Elo-MMR(𝜌) achieves robustness in a natural, 𝑤 0 𝜅 − 𝜅 1+𝜌
smoother manner. To understand how, we look at the interplay = .
𝑊 1 − 𝜅 1+𝜌
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
If we don’t expect player skill to change too rapidly, then the Dataset # contests avg. # participants / contest
system parameters should be set in such a way that 𝜅 ≈ 1. In this Codeforces 1257 3899
limit, using 1 − 𝜅 𝑥 ≈ (1 − 𝜅)𝑥 yields Topcoder 2115 391
Reddit 1000 20
𝑤0 (1 − 𝜅)𝜌 1
≈ = . CTF 1100 354
𝑊 (1 − 𝜅)(1 + 𝜌) 1 + 1/𝜌 DanceSport 18292 6
Thus, the upper bound in Theorem 5.7 is approximately propor- Synth-large 50 10000
tional to 1 + 1/𝜌. Loosely speaking, therefore, the additive term 1/𝜌 Synth-small 15000 5
may be interpreted as a momentum parameter. Table 1: Summary of test datasets.
5.3 Runtime analysis and optimizations If the contests are extremely large, so that Ω(1/𝜀 2 ) opponents
Let’s look at the computation time needed to process a round with have a rating and uncertainty in the same 𝜀-width bucket as player
participant set P, where we again omit the round subscript. Each 𝑖, then it’s possible to do even better: up to the allowed precision 𝜀,
player 𝑖 has a participation history H𝑖 . the corresponding terms can be treated as duplicates. Hence, their
Estimating 𝑃𝑖 entails finding the zero of a monotonic function sum can be determined by counting how many of these opponents
with 𝑂 (|P |) terms, and then obtaining the rating 𝜇𝑖 entails finding win, lose, or tie against player 𝑖. Given the pre-sorted list of ranks of
the zero of another monotonic function with 𝑂 (|H𝑖 |) terms. Using players in the bucket, two binary searches would yield the answer.
either of the Illinois or Newton methods, solving these equations In practice, a single bucket might not contain enough participants,
to precision 𝜀 takes 𝑂 (log log 𝜀1 ) iterations. As a result, the total so we sample enough buckets to yield the desired precision.
runtime needed to process one round of competition is Simple parallelism. Since each player’s rating computation is
! independent, the algorithm is embarrassingly parallel. Threads can
Õ 1
𝑂 (|P | + |H𝑖 |) log log . read the same global data structures, so each additional thread
𝜀 contributes only 𝑂 (1) memory overhead.
𝑖 ∈P
solves. They may also attempt to “hack” one another’s submissions versions of all the algorithms in the safe subset of Rust, parellelized
for bonus points, identifying test cases that break their solutions. using the Rayon crate; as such, the Rust compiler verifies that they
contain no data races [33]. The only exception is TrueSkill: the
Topcoder contest history. This dataset contains the current en-
inherent seqentiality of its message-passing procedure prevented
tire history of algorithm contests ever run on the topcoder.com.
us from parallelizing it.
Topcoder is a predecessor to Codeforces, with over 1.4 million
registered users, and a long history as a pioneering platform for Elo-MMR. We specialize our rating system into two types: Elo-
programming contests. It hosts a variety of contest types, including MM𝜒 with a Gaussian performance model, and Elo-MMR(𝜌) with
over 2000 algorithm contests to date. The scoring system is sim- a logistic performance model and pseudodiffusion rate 𝜌. We make
ilar to Codeforces, but with shorter rounds: typically 75 minutes use of the optimizations in Section 5.3, bounding both the number
allotted for a set of 3 problems. of sampled opponents and the history length by 500.
SubredditSimulator threads. This dataset contains data scraped
Topcoder system. The Topcoder website provides not only one
from the current top 1000 most upvoted threads on the website
of the oldest dataset of programming competitions, but also one of
reddit.com/r/SubredditSimulator. Reddit is a social news ag-
the oldest massively multiplayer deployments of a rating system.
gregation website with over 400 million monthly active users. The
The Topcoder system [10] generalizes Glicko-2, and suffers from
site itself is broken down into sub-sites called subreddits. Users
the same lack of incentive-compatibility [19]. Close variants of this
then post and comment to the subreddits, where the posts and
system are used by other contest sites, such as CodeChef [1].
comments receive votes from other users. In the subreddit Subred-
ditSimulator, users are language generation bots trained on text Codeforces system. In response to the main drawback of Topcoder,
from other subreddits. Automated posts are made by these bots to the Codeforces rating system [8] was specifically designed to be
SubredditSimulator every 3 minutes, and real users of Reddit vote incentive-compatible. It features more ad hoc choices than the other
on the best bot. Each post (and its associated comments) can thus systems: for instance, its rating updates target the geometric mean
be interpreted as a round of competition between the bots who of a player’s expected and actual ranks. Close variants of this system
commented. are used by other contest sites, such as LeetCode [7].
Capture the Flag competition history. This dataset contains data
scraped from ctftime.org, an archive site for Capture the Flag (CTF) TrueSkill. We use the improved TrueSkill algorithm of [31], bas-
style computer security contests. Teams are scored based on the ing our code on an open-source implementation of the same algo-
digital “flags" that they find by cracking computer security chal- rithm. Developed for the purpose of video game matchmaking on
lenges. CTFtime tracks over 150K teams and 1000 competitions. Microsoft’s Xbox Live platform, TrueSkill [25] is a Bayesian rating
Since these competitions are organized by a variety of groups, they system, implemented using a powerful probabilistic programming
come in a wide range of sizes. framework. Its update rules are rather complex, requiring iterations
of approximate message passing. It’s very effective on games with
DanceSport competition history. This dataset contains data scraped moderate numbers of players (typically 2 to 16), but struggles in
from results.o2cm.com. O2 CM is the dominant software package our experiments involving hundreds to thousands of players.
for hosting and managing competitive ballroom dance competitions
in North America. Its freely accessible online database includes an Glicko. The Glicko rating system [22] is a classic extension of Elo
average of one competition per week. Each competition is divided which, unlike Glicko-2, is incentive-compatible. While the Bayesian
into events based on age category, syllabus level, and dance style. mathematics of Glicko was derived only for 2-player games, a
Since these events are judged and ranked separately, we process naive baseline for 𝑁 -player games can be obtained by decomposing
them as distinct rounds, in the order listed by O2 CM. Since model- the game into its 𝑁 2 pairwise matchups (including self-draws).
ing the chemistry between dance partners is beyond this paper’s Since these outcomes are far from independent, we normalize the
scope, we simply treat each dance couple as a distinct contestant. collective weight of all 𝑁 updates applying to each player, to match
that of a hypothetical maximally informative 2-player game, i.e.,
Synthetic datasets (small and large). The small and large datasets
against an equally skilled player whose skill is completely certain.
contain 1K and 10K players respectively, with skills and perfor-
mances generated according to the logistic generative model in BAR. Bayesian Approximation Ranking [34] shares our goal of
Section 2. Players’ initial skills are drawn i.i.d. with mean 1500 and combining the accuracy of TrueSkill with the simplicity of Glicko.
variance 3502 . Players compete in all rounds, and are ranked ac- By a judicious application of simplifying approximations, it de-
cording to independent performances with variance 2002 . Between rives analytical formulas similar to the pairwise decomposition of
rounds, we add i.i.d. Gaussian increments with variance 352 to each Glicko4 . The normalization in the original paper performs poorly
of their skills. In the small dataset, each round consists of just 5 on our datasets’ large matches. To improve accuracy, just as with
players. In the large dataset, all 10K players participate in each Glicko, we normalize the collective weight of the batched updates
round. to equal that of one maximally informative 2-player game.
6.3 Evaluation metrics is in the hundreds or thousands. One case where TrueSkill outper-
To compare the different algorithms, we define two measures of formed is in the DanceSport dataset, where the average number
predictive accuracy. Each metric will be defined on individual con- of participants per contest is just 3. In preliminary experiments,
testants in each round, and then averaged: TrueSkill and Elo-MMR score about equally when the number of
Í Í ranks is less than about 60.
𝑡 𝑖 ∈ P𝑡 metric(𝑖, 𝑡) Now, we turn our attention to Table 3, which showcases the com-
aggregate(metric) := Í .
𝑡 |P𝑡 | putational efficiency of Elo-MMR. On smaller datasets, it performs
Pair inversion metric [25]. Our first metric computes the fraction comparably to the Codeforces, TrueSkill, and Topcoder algorithms.
of opponents against whom our ratings predict the correct pairwise However, the latter suffer from a quadratic time dependency on the
result, defined as the higher-rated player either winning or tying: number of contestants; as a result, Elo-MMR outperforms them by
one to two orders of magnitude on the larger Codeforces dataset.
# correctly predicted matchups Finally, in comparisons between the two Elo-MMR variants, we
pair_inversion(𝑖, 𝑡) := × 100%.
|P𝑡 | − 1 note that while Elo-MMR(𝜌) is more accurate, Elo-MM𝜒 is always
This metric was used in the original evaluation of TrueSkill [25] faster. This has to do with the skill drift modeling described in
and is related to the Kendall’s 𝜏 rank correlation coefficient. Section 4, as every update in Elo-MMR(𝜌) must process 𝑂 (log 𝜀1 )
terms of a player’s competition history.
Rank deviation. Our second metric compares the rankings with
the total ordering that would be obtained by sorting players accord- 6.5 Elo-MMR on small and large contests
ing to their prior rating. The penalty is proportional to how much The derivation in Section 2 depended on taking a limit in which the
these ranks differ for player 𝑖: number of participants in each contest went to infinity. In practice,
|actual rank − predicted rank| one might wonder how well Elo-MMR handles smaller contests. To
rank_deviation(𝑖, 𝑡) := × 100%.
|P𝑡 | − 1 find out, we simulate what would happen if each Codeforces contest
was administered separately to smaller groups of contestants. That
In the event of ties, among the ranks within the tied range, we use
is, for every chosen contest size 𝑁 , the participants of each contest
the one that comes closest to the rating-based prediction.
are split into groups of at most 𝑁 . Each group is placed in a round,
and ranked according to their relative placement in the original
6.4 Empirical results contest.
Recall that Elo-MM𝜒 has a Gaussian performance model, matching In Figure 3, we see that Elo-MMR continues to beat the other
the modeling assumptions of Topcoder and TrueSkill. Elo-MMR(𝜌), systems, regardless of contest size.
on the other hand, has a logistic performance model, matching
the modeling assumptions of Codeforces and Glicko. While 𝜌 was 79.2
included in the hyperparameter search, in practice we found that
all values between 0 and 1 produce very similar results. 79.0
To ensure that errors due to the unknown skills of new players 78.8
don’t dominate our metrics, we excluded players who had competed 78.6
Accuracy
in less than 5 total contests. In most of the datasets, this reduced the
78.4
performance of our method relative to the others, as our method
78.2
seems to converge more accurately. Despite this, we see in Table 2
Elo-MMR
78.0
Codeforces
that both versions of Elo-MMR outperform the other rating systems
in both the pairwise inversion metric and the ranking deviation
metric.
77.8 Topcoder
We highlight a few key observations. First, significant perfor- 77.6 TrueSkill
mance gains are observed on the Codeforces and Topcoder datasets, 101 102
despite these platforms’ rating systems having been designed specif- Contest size
ically for their needs. Our gains are smallest on the synthetic dataset,
for which all algorithms perform similarly. This might be in part
Figure 3: Number of participants vs. accuracy for various rat-
due to the close correspondence between the generative process
ing systems.
and the assumptions of these rating systems. Furthermore, the
synthetic players compete in all rounds, enabling the system to
converge to near-optimal ratings for every player. Finally, the im-
proved TrueSkill performed well below our expectations, despite 7 CONCLUSIONS
our best efforts to improve it. We suspect that the message-passing This paper introduces the Elo-MMR rating system, which is in part
numerics break down in contests with a large number of individual a generalization of the two-player Glicko system, allowing any
participants. The difficulties persisted in all TrueSkill implemen- number of players. By developing a Bayesian model and taking the
tations that we tried, including on Microsoft’s popular Infer.NET limit as the number of participants goes to infinity, we obtained sim-
framework [30]. To our knowledge, we are the first to present exper- ple, human-interpretable rating update formulas. Furthermore, we
iments with TrueSkill on contests where the number of participants saw that the algorithm is incentive-compatible, robust to extreme
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu
performances, asymptotically fast, and embarrassingly parallel. To 𝑓 ′ (𝑝) 𝑓𝑖 (𝑝) 2 −𝑓𝑖′ (𝑝) 𝑓𝑖 (𝑝) 2
our knowledge, our system is the first to rigorously prove all these 𝑣𝑖′ (𝑝) = 𝑖 − , 𝑙𝑖′ (𝑝) = − ,
𝐹𝑖 (𝑝) 𝐹𝑖 (𝑝) 2 1 − 𝐹𝑖 (𝑝) (1 − 𝐹𝑖 (𝑝)) 2
properties in a setting with more than two individually ranked
are negative for all 𝑝, so we conclude that
players. In terms of practical performance, we saw that it outper-
forms existing industry systems in both prediction accuracy and
𝑓 ′ (𝑝) 𝑓𝑖 (𝑝) 𝐹𝑖 (𝑝) ′
computation speed. 𝑑𝑖 (𝑝) − 𝑣𝑖 (𝑝) = 𝑖 − = 𝑣 (𝑝) < 0,
This work can be extended in several directions. First, the choices 𝑓𝑖 (𝑝) 𝐹𝑖 (𝑝) 𝑓𝑖 (𝑝) 𝑖
we made in modeling ties, pseudodiffusions, teams, and opponent 𝑓 ′ (𝑝) 𝑓𝑖 (𝑝) 1 − 𝐹𝑖 (𝑝) ′
𝑙𝑖 (𝑝) − 𝑑𝑖 (𝑝) = − 𝑖 − = 𝑙 (𝑝) < 0.
subsampling are by no means the only possibilities consistent with 𝑓𝑖 (𝑝) 1 − 𝐹𝑖 (𝑝) 𝑓𝑖 (𝑝) 𝑖
our Bayesian model of skills and performances. Second, it may □
be possible to further improve accuracy by fitting more flexible
performance and skill evolution models to domain-specific data. Theorem 3.2. Suppose that for all 𝑗, 𝑓 𝑗 is continuously differen-
Third, it would be useful to analyze convergence in realistic settings, tiable and log-concave. Then the unique maximizer of Pr(𝑃𝑖 = 𝑝 |
where the Bayesian model is not completely accurate. In particular, 𝐸𝑖𝐿 , 𝐸𝑊
𝑖 ) is given by the unique zero of
Õ Õ Õ
controlling long-term rating inflation or deflation is a challenge, 𝑄𝑖 (𝑝) = 𝑙 𝑗 (𝑝) + 𝑑 𝑗 (𝑝) + 𝑣 𝑗 (𝑝).
since we can’t directly compare players at different times. 𝑗 ≻𝑖 𝑗∼𝑖 𝑗 ≺𝑖
Over the past decade, online competition communities such as
Codeforces have grown exponentially. As such, considerable work Proof. First, we rank the players by their buckets according to
has gone into engineering scalable and reliable rating systems. ⌊𝑃 𝑗 /𝜖⌋, and take the limiting probabilities as 𝜖 → 0:
Unfortunately, many of these systems have not been rigorously 𝑃𝑗 𝑝 𝑝
Pr(⌊ ⌋ > ⌊ ⌋) = Pr(𝑝 𝑗 ≥ 𝜖 ⌊ ⌋ + 𝜖)
analyzed in the academic community. We hope that our paper and 𝜖 𝜖 𝜖
open-source release will open new explorations in this area. 𝑝
= 1 − 𝐹 𝑗 (𝜖 ⌊ ⌋ + 𝜖) → 1 − 𝐹 𝑗 (𝑝),
𝜖
ACKNOWLEDGEMENTS 𝑃𝑗 𝑝 𝑝
Pr(⌊ ⌋ < ⌊ ⌋) = Pr(𝑝 𝑗 < 𝜖 ⌊ ⌋)
The authors are indebted to Daniel Sleator and Danica J. Sutherland 𝜖 𝜖 𝜖
𝑝
for initial discussions that helped inspire this work, and to Nikita = 𝐹 𝑗 (𝜖 ⌊ ⌋) → 𝐹 𝑗 (𝑝),
Gaevoy for the open-source improved TrueSkill upon which our 𝜖
implementation is based. Experiments in this paper are funded by 1 𝑃𝑗 𝑝 1 𝑝 𝑝
Pr(⌊ ⌋ = ⌊ ⌋) = Pr(𝜖 ⌊ ⌋ ≤ 𝑃 𝑗 < 𝜖 ⌊ ⌋ + 𝜖)
a Google Cloud Research Grant. The second author is supported by 𝜖 𝜖 𝜖 𝜖 𝜖 𝜖
a VMWare Fellowship and the Natural Sciences and Engineering 1 𝑝 𝑝
= 𝐹 𝑗 (𝜖 ⌊ ⌋ + 𝜖) − 𝐹 𝑗 (𝜖 ⌊ ⌋) → 𝑓 𝑗 (𝑝).
Research Council of Canada. 𝜖 𝜖 𝜖
𝑃
𝜖 , and 𝐷 𝜖 be shorthand for the events ⌊ 𝑗 ⌋ > ⌊ ⌋, 𝑝
Let 𝐿𝜖𝑗𝑝 , 𝑊 𝑗𝑝 𝑗𝑝 𝜖 𝜖
𝑃 𝑝 𝑃 𝑝
⌊ 𝜖𝑗 ⌋ < ⌊ 𝜖 ⌋, and ⌊ 𝜖𝑗 ⌋ = ⌊ 𝜖 ⌋. respectively. These correspond to a
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
player who performs at 𝑝 losing, winning, and drawing against 𝑗, [7] LeetCode New Contest Rating Algorithm. leetcode.com/discuss/general-
respectively, when outcomes are determined by 𝜖-buckets. Then, discussion/468851/New-Contest-Rating-Algorithm-(Coming-Soon)
[8] Open Codeforces Rating System. codeforces.com/blog/entry/20762
Ö Ö Ö Pr(𝐷 𝜖𝑗𝑝 ) [9] Ratings migrated to Elo-MMR. https://dmoj.ca/post/206-ratings-migrated-to-
Pr(𝐸𝑊
𝑖 , 𝐸 𝐿
𝑖 | 𝑃𝑖 = 𝑝) = lim Pr(𝐿 𝜖
𝑗𝑝 ) Pr(𝑊 𝜖
𝑗𝑝 ) elo-mmr
𝜖→0
𝑗 ≻𝑖 𝑗 ≺𝑖 𝑗∼𝑖,𝑗≠𝑖
𝜖 [10] Topcoder Algorithm Competition Rating System. topcoder.com/community/
Ö Ö Ö competitive-programming/how-to-compete/ratings
= (1 − 𝐹 𝑗 (𝑝)) 𝐹 𝑗 (𝑝) 𝑓 𝑗 (𝑝), [11] Why Are Obstacle-Course Races So Popular? theatlantic.com/health/archive/
2018/07/why-are-obstacle-course-races-so-popular/565130/
𝑗 ≻𝑖 𝑗 ≺𝑖 𝑗∼𝑖,𝑗≠𝑖 [12] Sharad Agarwal and Jacob R. Lorch. 2009. Matchmaking for online games and
other latency-sensitive P2P systems. In SIGCOMM 2009. 315–326.
Pr(𝑃𝑖 = 𝑝 | 𝐸𝑖𝐿 , 𝐸𝑊
𝑖 ) ∝ 𝐿 𝑊
𝑓𝑖 (𝑝) Pr(𝐸𝑖 , 𝐸𝑖 | 𝑃𝑖 = 𝑝) [13] Mark Yuying An. 1997. Log-concave probability distributions: Theory and statis-
Ö Ö Ö tical testing. (1997).
= (1 − 𝐹 𝑗 (𝑝)) 𝐹 𝑗 (𝑝) 𝑓 𝑗 (𝑝), [14] Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block
𝑗 ≻𝑖 𝑗 ≺𝑖 𝑗∼𝑖 designs: I. The method of paired comparisons. Biometrika (1952), 324–345.
[15] Shuo Chen and Thorsten Joachims. 2016. Modeling Intransitivity in Matchup
d Õ Õ Õ
and Comparison Data. In WSDM 2016. 227–236.
ln Pr(𝑃𝑖 = 𝑝 | 𝐸𝑖𝐿 ,𝐸𝑊 𝑖 )= 𝑙 𝑗 (𝑝) + 𝑣 𝑗 (𝑝) + 𝑑 𝑗 (𝑝) = 𝑄𝑖 (𝑝). [16] Rémi Coulom. [n.d.]. Whole-history rating: A Bayesian rating system for players
d𝑝 𝑗 ≻𝑖 𝑗 ≺𝑖 𝑗∼𝑖 of time-varying strength. In CG 2008. Springer, 113–124.
[17] Pierre Dangauthier, Ralf Herbrich, Tom Minka, and Thore Graepel. 2007. TrueSkill
Since Lemma 3.1 tells us that 𝑄𝑖 is strictly decreasing, it only Through Time: Revisiting the History of Chess. In NeurIPS 2007. 337–344.
remains to show that it has a zero. If the zero exists, it must be [18] Arpad E. Elo. 1961. New USCF rating system. Chess Life (1961), 160–161.
unique and it will be the unique maximum of Pr(𝑃𝑖 = 𝑝 | 𝐸𝑖𝐿 , 𝐸𝑊 𝑖 ).
[19] RNDr Michal Forišek. 2009. Theoretical and Practical Aspects of Programming
To start, we want to prove the existence of 𝑝 ∗ such that 𝑄𝑖 (𝑝 ∗ ) <
Contest Ratings. (2009).
[20] David A Freedman. 1963. On the asymptotic behavior of Bayes’ estimates in the
0. Note that it’s not possible to have 𝑓 𝑗′ (𝑝) ≥ 0 for all 𝑝, as in that discrete case. The Annals of Mathematical Statistics (1963), 1386–1403.
[21] Mark E Glickman. 1995. A comprehensive guide to chess ratings. American Chess
case the density would integrate to either zero or infinity. Thus, for Journal (1995), 59–102.
each 𝑗 such that 𝑗 ∼ 𝑖, we can choose 𝑝 𝑗 such that 𝑓 𝑗′ (𝑝 𝑗 ) < 0, and [22] Mark E Glickman. 1999. Parameter estimation in large dynamic paired compari-
son experiments. Applied Statistics (1999), 377–394.
Í
so 𝑑 𝑗 (𝑝 𝑗 ) < 0. Let 𝛼 = − 𝑗∼𝑖 𝑑 𝑗 (𝑝 𝑗 ) > 0.
[23] Mark E Glickman. 2012. Example of the Glicko-2 system. Boston University
Let 𝑛 = |{ 𝑗 : 𝑗 ≺ 𝑖}|. For each 𝑗 such that 𝑗 ≺ 𝑖, since (2012), 1–6.
lim𝑝→∞ 𝑣 𝑗 (𝑝) = 0/1 = 0, we can choose 𝑝 𝑗 such that 𝑣 𝑗 (𝑝 𝑗 ) < 𝛼/𝑛. [24] Linxia Gong, Xiaochuan Feng, Dezhi Ye, Hao Li, Runze Wu, Jianrong Tao,
Let 𝑝 ∗ = max 𝑗 ⪯𝑖 𝑝 𝑗 . Then, Changjie Fan, and Peng Cui. 2020. OptMatch: Optimized Matchmaking via
Modeling the High-Order Interactions on the Arena. In KDD 2020. 2300–2310.
Õ Õ Õ
𝑙 𝑗 (𝑝 ∗ ) ≤ 0, 𝑑 𝑗 (𝑝 ∗ ) ≤ −𝛼, 𝑣 𝑗 (𝑝 ∗ ) < 𝛼 . [25] Ralf Herbrich, Tom Minka, and Thore Graepel. 2006. TrueSkillTM : A Bayesian
Skill Rating System. In NeurIPS 2006. 569–576.
𝑗 ≻𝑖 𝑗∼𝑖 𝑗 ≺𝑖 [26] Tzu-Kuo Huang, Chih-Jen Lin, and Ruby C. Weng. 2006. Ranking individuals by
Therefore, group comparisons. In ICML 2006. 425–432.
Õ Õ Õ [27] Stephanie Kovalchik. 2020. Extension of the Elo rating system to margin of
𝑄𝑖 (𝑝 ∗ ) = 𝑙 𝑗 (𝑝 ∗ ) + 𝑑 𝑗 (𝑝 ∗ ) + 𝑣 𝑗 (𝑝 ∗ ) victory. Int. J. Forecast. (2020).
[28] Yao Li, Minhao Cheng, Kevin Fujii, Fushing Hsieh, and Cho-Jui Hsieh. 2018.
𝑗 ≻𝑖 𝑗∼𝑖 𝑗 ≺𝑖 Learning from Group Comparisons: Exploiting Higher Order Interactions. In
< 0 − 𝛼 + 𝛼 = 0. NeurIPS 2018. 4986–4995.
[29] Tom Minka, Ryan Cleven, and Yordan Zaykov. 2018. TrueSkill 2: An improved
By a symmetric argument, there also exists some 𝑞 ∗ for which Bayesian skill rating system. Technical Report MSR-TR-2018-8. Microsoft.
𝑄𝑖 (𝑞 ∗ ) > 0. By the intermediate value theorem with 𝑄𝑖 continuous, [30] T. Minka, J.M. Winn, J.P. Guiver, Y. Zaykov, D. Fabian, and J. Bronskill. /Infer.NET
0.3. Microsoft Research Cambridge. http://dotnet.github.io/infer.
there exists 𝑝 ∈ (𝑞 ∗, 𝑝 ∗ ) such that 𝑄𝑖 (𝑝) = 0, as desired. □ [31] Sergey I. Nikolenko, Alexander, and V. Sirotkin. 2010. Extensions of the TrueSkill
TM rating system. In In Proceedings of the 9th International Conference on Appli-
REFERENCES cations of Fuzzy Systems and Soft Computing. 151–160.
[32] Jerneja Premelč, Goran Vučković, Nic James, and Bojan Leskošek. 2019. Reliability
[1] CodeChef Rating Mechanism. codechef.com/ratings of judging in DanceSport. Front. Psychol. (2019), 1001.
[2] Codeforces: Results of 2019. codeforces.com/blog/entry/73683 [33] Josh Stone and Nicholas D Matsakis. The Rayon library (Rust Crate). crates.io/
[3] Farming Volatility: How a major flaw in a well-known rating system takes over crates/rayon
the GBL leaderboard. reddit.com/r/TheSilphRoad/comments/hwff2d/farming_ [34] Ruby C. Weng and Chih-Jen Lin. 2011. A Bayesian Approximation Method for
volatility_how_a_major_flaw_in_a/ Online Ranking. J. Mach. Learn. Res. (2011), 267–300.
[4] Halo Xbox video game franchise: in numbers. telegraph.co.uk/technology/video- [35] John Michael Winn. 2019. Model-based machine learning.
games/11223730/Halo-in-numbers.html [36] Lin Yang, Stanko Dimitrov, and Benny Mantin. 2014. Forecasting sales of new
[5] Kaggle milestone: 5 million registered users! kaggle.com/general/164795 virtual goods with the Elo rating system. RPM (2014), 457–469.
[6] Kaggle Progression System. kaggle.com/progression