0% found this document useful (0 votes)

2 views

EloMMR

This paper presents a novel Bayesian rating system called Elo-MMR for massive multiplayer competitions, which is designed to accurately estimate player skill while being efficient and incentive-compatible. The system outperforms existing methods in prediction accuracy and computational speed, making it suitable for high-participant events like programming contests and online games. The authors provide open-source implementations and demonstrate the system's robustness against common vulnerabilities found in traditional rating systems.

Uploaded by

yamode1559

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

EloMMR

Uploaded by

yamode1559

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

An Elo-like System for Massive Multiplayer Competitions

Aram Ebtekar Paul Liu

Vancouver, BC, Canada Stanford University
[email protected] Stanford, CA, USA
[email protected]

ABSTRACT have a panel of judges who rank contestants against one another;
Skill estimation mechanisms, colloquially known as rating sys- these subjective scores are known to be noisy [32]. In all these cases,
tems, play an important role in competitive sports and games. They scores can only be used to compare and rank participants at the
provide a measure of player skill, which incentivizes competitive same event. Players, spectators, and contest organizers who are
performances and enables balanced match-ups. In this paper, we interested in comparing players’ skill levels across different compe-
present a novel Bayesian rating system for contests with many titions will need to aggregate the entire history of such rankings. A
participants. It is widely applicable to competition formats with strong player, then, is one who consistently wins against weaker
discrete ranked matches, such as online programming competitions, players. To quantify skill, we need a rating system.
obstacle courses races, and video games. The system’s simplicity Good rating systems are difficult to create, as they must bal-
allows us to prove theoretical bounds on its robustness and runtime. ance several mutually constraining objectives. First and foremost,
In addition, we show that it is incentive-compatible: a player who rating systems must be accurate, in that ratings provide useful pre-
seeks to maximize their rating will never want to underperform. dictors of contest outcomes. Second, the ratings must be efficient
Experimentally, the rating system surpasses existing systems in to compute: within video game applications, rating systems are
prediction accuracy, and computes faster than existing systems by predominantly used for matchmaking in massively multiplayer
up to an order of magnitude. online games (such as Halo, CounterStrike, League of Legends,
etc.) [25, 29, 36]. These games have hundreds of millions of players
CCS CONCEPTS playing tens of millions of games per day, necessitating certain
latency and memory requirements for the rating system [12]. Third,
• Information systems → Learning to rank; • Computing me-
rating systems must be incentive-compatible: a player’s rating
thodologies → Learning in probabilistic graphical models.
should never increase had they scored worse, and never decrease
KEYWORDS had they scored better. This is to prevent players from regretting
a win, or from throwing matches to game the system. Rating sys-
rating system, skill estimation, mechanism design, competition, tems that can be gamed often create disastrous consequences to
bayesian inference, robust, incentive-compatible, elo, glicko, trueskill player-base, potentially leading to the loss of players [3]. Finally,
ACM Reference Format: the ratings provided by the system must be human-interpretable:
Aram Ebtekar and Paul Liu. 2021. An Elo-like System for Massive Multi- ratings are typically represented to players as a single number en-
player Competitions. In Proceedings of the Web Conference 2021 (WWW ’21), capsulating their overall skill, and many players want to understand
April 19–23, 2021, Ljubljana, Slovenia. ACM, New York, NY, USA, 15 pages.
and predict how their performances affect their rating [21].
https://doi.org/10.1145/3442381.3450091
Classically, rating systems were designed for two-player games.
The famous Elo system [18], as well as its Bayesian successors
1 INTRODUCTION
Glicko and Glicko-2, have been widely applied to games such as
Competitions, in the form of sports, games, and examinations, have Chess and Go [21–23]. Both Glicko versions model each player’s
been with us since antiquity. Many competitions grade perfor- skill as a real random variable that evolves with time according
mances along a numerical scale, such as a score on a test or a to Brownian motion. Inference is done by entering these variables
completion time in a race. In the case of a college admissions exam into the Bradley-Terry model [14], which predicts probabilities of
or a track race, scores are standardized so that a given score on game outcomes. Glicko-2 refines the Glicko system by adding a
two different occasions carries the same meaning. However, in rating volatility parameter. Unfortunately, Glicko-2 is known to
events that feature novelty, subjectivity, or close interaction, stan- be flawed in practice, potentially incentivizing players to lose in
dardization is difficult. The Spartan Races, completed by millions what’s known as “volatility farming”. In some cases, these attacks
of runners, feature a variety of obstacles placed on hiking trails can inflate a user’s rating several hundred points above its natural
around the world [11]. Rock climbing, a sport to be added to the value, producing ratings that are essentially impossible to beat via
2020 Olympics, likewise has routes set specifically for each com- honest play. This was most notably exploited in the popular game
petition. DanceSport, gymnastics, and figure skating competitions of Pokemon Go [3]. See Section 5.1 for a discussion of this issue, as
This paper is published under the Creative Commons Attribution 4.0 International well as an application of this attack to the Topcoder rating system.
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their The family of Elo-like methods just described only utilize the
personal and corporate Web sites with the appropriate attribution.
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
binary outcome of a match. In settings where a scoring system
© 2021 IW3C2 (International World Wide Web Conference Committee), published provides a more fine-grained measure of match performance, Ko-
under Creative Commons CC-BY 4.0 License. valchik [27] has shown variants of Elo that are able to take advan-
ACM ISBN 978-1-4503-8312-7/21/04.
https://doi.org/10.1145/3442381.3450091 tage of score information. For competitions consisting of several set
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu

tasks, such as academic olympiads, Forišek [19] developed a model allows us to rigorously analyze its properties: the “MMR” in the
in which each task gives a different “response” to the player: the to- name stands for “Massive”, “Monotonic”, and “Robust”. “Massive”
tal response then predicts match outcomes. However, such systems means that it supports any number of players with a runtime that
are often highly application-dependent and hard to calibrate. scales linearly; “monotonic” is a synonym for incentive-compatible,
Though Elo-like systems are widely used in two-player settings, ensuring that a rating-maximizing player always wants to perform
one needn’t look far to find competitions that involve much more well; “robust” means that rating changes are bounded, with the
than two players. In response to the popularity of team-based games bound being smaller for more consistent players than for volatile
such as CounterStrike and Halo, many recent works focus on com- players. Robustness turns out to be a natural byproduct of accurately
petitions that are between two teams [15, 24, 26, 28]. Another pop- modeling performances with heavy-tailed distributions, such as
ular setting is many-player contests such as academic olympiads: the logistic. TrueSkill is believed to satisfy the first two properties,
notably, programming contest platforms such as Codeforces, Top- albeit without proof, but fails robustness. Codeforces only satisfies
coder, and Kaggle [6, 8, 10]. As with the aforementioned Spartan incentive-compatibility, and Topcoder only satisfies robustness.
races, a typical event attracts thousands of contestants. Program- Experimentally, we show that Elo-MMR achieves state-of-the-art
ming contest platforms have seen exponential growth over the past performance in terms of both prediction accuracy and runtime on
decade, collectively boasting millions of users [5]. As an example, industry datasets. In particular, we process the entire Codeforces
Codeforces gained over 200K new users in 2019 alone [2]. database of over 400K rated users and 1000 contests in well under a
In “free-for-all” settings, where 𝑁 players are ranked individually, minute, beating the existing Codeforces system by more than an or-
the Bayesian Approximation Ranking (BAR) algorithm [34] models der of magnitude while improving upon its accuracy. Furthermore,

the competition as a series of 𝑁2 independent two-player contests. we show that the well-known Topcoder system is severely vulnera-
In reality, of course, the pairwise match outcomes are far from ble to volatility farming, whereas Elo-MMR is immune to such at-
independent. Thus, TrueSkill [25] and its variants [17, 29, 31] model tacks. A difficulty we faced was the scarcity of efficient open-source
a player’s performance during each contest as a single random rating system implementations. In an effort to aid researchers and
variable. The overall rankings are assumed to reveal the total order practitioners alike, we provide open-source implementations of all
among these hidden performance variables, with various methods rating systems, dataset mining, and additional processing used in
used to model ties and teams. For a textbook treatment of these our experiments at https://github.com/EbTech/Elo-MMR.
methods, see [35]. These rating systems are efficient in practice, We note that since releasing our preprint, Elo-MMR has already
successfully rating userbases that number well into the millions (the been put in production in industry settings [9].
Halo series, for example, has over 60 million sales since 2001 [4]).
Organization. In Section 2, we formalize the details of our Bayesian
The main disadvantage of TrueSkill is its complexity: originally
model. We then show how to estimate player skill under this model
developed by Microsoft for the popular Halo video game, TrueSkill
in Section 3, and develop some intuitions of the resulting formulas.
performs approximate belief propagation, which consists of mes-
As a further refinement, Section 4 models skill evolutions from
sage passing on a factor graph, iterated until convergence. Aside
players training or atrophying between competitions. This mod-
from being less human-interpretable, this complexity means that,
eling is quite tricky as we choose to retain players’ momentum
to our knowledge, there are no proofs of key properties such as run-
while preserving incentive-compatibility. While our modeling and
time and incentive-compatibility. Even when these properties are
derivations occupy multiple sections, the system itself is succinctly
discussed [29], no rigorous justification is provided. In addition, we
presented in Algorithms 1 to 3. In Section 5, we perform a volatility
are not aware of any work that extends TrueSkill to non-Gaussian
farming attack on the Topcoder system and prove that, in contrast,
performance models, which might be desirable to limit the influence
Elo-MMR satisfies several salient properties, the most critical of
of outlier performances (see Section 5.2).
which is incentive-compatibility. Finally, in Section 6, we present
It might be for these reasons that popular platforms such as
experimental evaluations, showing improvements over industry
Codeforces and Topcoder opted for their own custom rating sys-
standards in both accuracy and speed.
tems. These systems are not published in academia and do not come
with Bayesian justifications. However, they retain the formulaic
2 A BAYESIAN MODEL FOR MASSIVE
simplicity of Elo and Glicko, extending them to settings with much
more than two players. The Codeforces system includes ad hoc COMPETITIONS
heuristics to distinguish top players, while curbing rampant infla- We now describe the setting formally, denoting random variables
tion. Topcoder’s formulas are more principled from a statistical by capital letters. A series of competitive rounds, indexed by 𝑡 =
perspective; however, it has a volatility parameter similar to Glicko- 1, 2, 3, . . ., take place sequentially in time. Each round has a set of
2, and hence suffers from similar exploits [19]. Despite their flaws, participating players P𝑡 , which may in general overlap between
these systems have been in place for over a decade, and have more rounds. A player’s skill is likely to change with time, so we repre-
recently gained adoption by additional platforms such as CodeChef sent the skill of player 𝑖 at time 𝑡 by a real random variable 𝑆𝑖,𝑡 .
and LeetCode [1, 7]. In round 𝑡, each player 𝑖 ∈ P𝑡 competes at some performance
level 𝑃𝑖,𝑡 , typically close to their current skill 𝑆𝑖,𝑡 . The deviations
Our contributions. In this paper, we describe the Elo-MMR rating {𝑃𝑖,𝑡 −𝑆𝑖,𝑡 }𝑖 ∈ P𝑡 are assumed to be i.i.d. and independent of {𝑆𝑖,𝑡 }𝑖 ∈ P𝑡 .
system, obtained by a principled approximation of a Bayesian model Performances are not observed directly; instead, a ranking gives
similar to Glicko and TrueSkill. It is fast, embarrassingly parallel, the relative order among all performances {𝑃𝑖,𝑡 }𝑖 ∈ P𝑡 . In particular,
and makes accurate predictions. Most interesting of all, its simplicity ties are modelled to occur when performances are exactly equal,
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

a zero-probability event when their distributions are continuous.1 Since our posteriors are continuous, the convergence holds for
This ranking constitutes the observational evidence 𝐸𝑡 for our all 𝑠 simultaneously. Moreover, we don’t even need the full evidence
𝐿 = {𝑗 ∈ P : 𝑃
Bayesian updates. The rating system seeks to estimate the skill 𝑆𝑖,𝑡 𝐸𝑡 . Let 𝐸𝑖,𝑡 𝑗,𝑡 > 𝑃𝑖,𝑡 } be the set of players against
of every player at the present time 𝑡, given the historical round 𝑊
whom 𝑖 lost, and 𝐸𝑖,𝑡 = { 𝑗 ∈ P : 𝑃 𝑗,𝑡 < 𝑃𝑖,𝑡 } be the set of players
rankings 𝐸 ≤𝑡 := {𝐸 1, . . . , 𝐸𝑡 }. against whom 𝑖 won. That is, we only look at who wins, draws,
We overload the notation Pr for both probabilities and probability and loses against 𝑖. 𝑃𝑖,𝑡 remains identifiable using only (𝐸𝑖,𝑡 𝐿 , 𝐸𝑊 ),
𝑖,𝑡
densities: the latter interpretation applies to zero-probability events, which will be more convenient for our purposes.
such as in Pr(𝑆𝑖,𝑡 = 𝑠). We also use colons as wildcards to denote In practice, we should care about the rate of convergence. Sup-
collections of variables differing only in a subscript: for instance, pose we want our estimate to be within 𝜀 of 𝑃𝑖,𝑡 , with probability
𝑃:,𝑡 := {𝑃𝑖,𝑡 }𝑖 ∈ P𝑡 . The joint distribution described by our Bayesian at least 1 − 𝛿. By asymptotic normality of the posterior [20], it
model factorizes as follows: suffices to have 𝑂 ( 𝜀12 log 𝛿1 ) participants. Experimentally, we see
Pr(𝑆 :,:, 𝑃:,:, 𝐸 : ) (1) in Section 6.5 that Elo-MMR is competitive on all sizes of contests.
Ö Ö Ö Ö Bayesian ratings systems, such as Glicko and TrueSkill, make
= Pr(𝑆𝑖,0 ) Pr(𝑆𝑖,𝑡 | 𝑆𝑖,𝑡 −1 ) Pr(𝑃𝑖,𝑡 | 𝑆𝑖,𝑡 ) Pr(𝐸𝑡 | 𝑃:,𝑡 ), several simplifying assumptions to render their posterior updates
𝑖 𝑖,𝑡 𝑖,𝑡 𝑡
tractable. Typically these are chosen ad hoc for convenience; how-
where Pr(𝑆𝑖,0 ) is the initial skill prior, ever, having passed to a limit in which 𝑃𝑖,≤𝑡 is identified, our frame-
Pr(𝑆𝑖,𝑡 | 𝑆𝑖,𝑡 −1 ) is the skill evolution model (Section 4), work is able to rigorously justify such simplifications. Firstly, since
Pr(𝑃𝑖,𝑡 | 𝑆𝑖,𝑡 ) is the performance model, and 𝑃𝑖, ≤𝑡 is a sufficient statistic for predicting 𝑆𝑖,𝑡 , it may be said that
Pr(𝐸𝑡 | 𝑃:,𝑡 ) is the evidence model. (𝐸𝑖,𝐿 ≤𝑡 , 𝐸𝑊
𝑖, ≤𝑡 ) are “almost sufficient” for 𝑆𝑖,𝑡 : any additional informa-
tion, such as from domain-specific scoring systems, becomes redun-
For the first three factors, we will specify log-concave distributions dant for the purposes of skill estimation. Secondly, conditioned on
(see Definition 3.1). The evidence model, on the other hand, is a 𝑃:, ≤𝑡 , the posterior skills 𝑆 :,𝑡 are independent of one another. As a
deterministic indicator. It equals one when 𝐸𝑡 is consistent with result, there are no inter-player correlations to model, and a player’s
the relative ordering among 𝑃:,𝑡 , and zero otherwise. posterior is unaffected by rounds in which they are not a partici-
Finally, our model assumes that the number of participants |P𝑡 | pant. Finally, if we’ve truly identified 𝑃𝑖,𝑡 , then rounds later than 𝑡
is large. The main idea behind our algorithm is that, in sufficiently should not prompt revisions in our estimate for 𝑃𝑖,𝑡 . This obviates
massive competitions, the evidence 𝐸𝑡 contains enough information the need for expensive whole-history update procedures [16, 17],
to infer very precise estimates for 𝑃:,𝑡 . Hence, we can treat these for the purposes of present skill estimation.2
performances as if they were observed directly. Thus, when the initial prior, performance model, and evolution
With that in mind, we’ll often discuss the distributions of vari- model are all Gaussian, treating 𝑃𝑖,𝑡 as certain is the only simplify-
ables whose round subscript is 𝑡, conditioned on either the prior ing approximation we will make; that is, in the limit |P𝑡 | → ∞, our
context 𝑃𝑖,<𝑡 or the posterior context 𝑃𝑖, ≤𝑡 : these are called prior method performs exact inference on Equation (1). In the following
and posterior distributions, respectively. In particular, suppose we sections, we focus some attention on generalizing the performance
have the skill prior: model to non-Gaussian log-concave families, parametrized by loca-
𝜋𝑖,𝑡 (𝑠) := Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,<𝑡 ). (2) tion and scale; here, a few minor approximations keep the deriva-
tions tractable. We will use the logistic distribution as a running
Now, we observe 𝐸𝑡 . By Equation (1), it is conditionally indepen- example and see that it induces robustness; however, our framework
dent of 𝑆𝑖,𝑡 , given 𝑃𝑖, ≤𝑡 . By the law of total probability, is agnostic to the specific distributions used.
The prior rating 𝜇𝑖,𝑡 𝜋 and posterior rating 𝜇 of player 𝑖 at round
Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,<𝑡 , 𝐸𝑡 ) 𝑖,𝑡
∫ 𝑡 should be statistics that summarize the player’s prior and posterior
= Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,<𝑡 , 𝑃𝑖,𝑡 = 𝑝) Pr(𝑃𝑖,𝑡 = 𝑝 | 𝑃𝑖,<𝑡 , 𝐸𝑡 ) d𝑝. skill distribution, respectively. We’ll use the mode: thus, 𝜇𝑖,𝑡 is the
maximum a posteriori (MAP) estimate, obtained by setting 𝑠 to
This integral is intractable in general, since the performance maximize the posterior Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,≤𝑡 ). By Bayes’ rule,
posterior Pr(𝑃𝑖,𝑡 = 𝑝 | 𝑃𝑖,<𝑡 , 𝐸𝑡 ) depends not only on player 𝑖, but 𝜋
also on our beliefs regarding the skills of all 𝑗 ∈ P𝑡 . However, in 𝜇𝑖,𝑡 := arg max 𝜋𝑖,𝑡 (𝑠),
𝑠
the limit of infinite participants, Doob’s consistency theorem [20]
𝜇𝑖,𝑡 := arg max 𝜋𝑖,𝑡 (𝑠) Pr(𝑃𝑖,𝑡 | 𝑆𝑖,𝑡 = 𝑠). (3)
implies that the posterior concentrates at the true value 𝑃𝑖,𝑡 . That 𝑠
is, with probability one, as |P𝑡 | → ∞,
This objective suggests a two-phase algorithm to update each
player 𝑖 ∈ P𝑡 in response to the results of round 𝑡. In phase one,
Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,<𝑡 , 𝐸𝑡 ) 𝐿 , 𝐸𝑊 ). By Doob’s consistency theorem,
we estimate 𝑃𝑖,𝑡 from (𝐸𝑖,𝑡 𝑖,𝑡
∫ our estimate is extremely precise when |P𝑡 | is large, so we assume
→ Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖, ≤𝑡 ) Pr(𝑃𝑖,𝑡 = 𝑝 | 𝑃𝑖,<𝑡 , 𝐸𝑡 ) d𝑝 it to be exact. In phase two, we update our posterior for 𝑆𝑖,𝑡 and
= Pr(𝑆𝑖,𝑡 = 𝑠 | 𝑃𝑖,≤𝑡 ). the rating 𝜇𝑖,𝑡 according to Equation (3).

1 The relevant limiting procedure is to treat performances within 𝜀 -width buckets as 2 As opposed to historical skill estimation, which is concerned with 𝑃 (𝑆𝑖,𝑡 | 𝑃𝑖,≤𝑡 ′ )
ties, and letting 𝜀 → 0. This technicality appears in the proof of Theorem 3.2. for 𝑡 ′ > 𝑡 . Whole-history methods can take advantage of future information.
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu

L2 LR Normal Logistic
3 SKILL ESTIMATION IN TWO PHASES
12
3.1 Performance estimation 0.4
10
In this section, we describe the first phase of Elo-MMR. For nota- 8 0.3
tional convenience, we assume all probability expressions to be 6 0.2
conditioned on the prior context 𝑃𝑖,<𝑡 , and omit the subscript 𝑡. 4
Our prior belief on each player’s skill 𝑆𝑖 implies a prior distri- 0.1
2
bution on 𝑃𝑖 . Let’s denote its probability density function (pdf)
by -4 -2 2 4 -4 -2 2 4
∫
𝑓𝑖 (𝑝) := Pr(𝑃𝑖 = 𝑝) = 𝜋𝑖 (𝑠) Pr(𝑃𝑖 = 𝑝 | 𝑆𝑖 = 𝑠) d𝑠, (4) Figure 1: 𝐿2 versus 𝐿𝑅 for typical values (left). Gaussian ver-
sus logistic probability density functions (right).
where 𝜋𝑖 (𝑠) was defined in Equation (2). Let
∫ 𝑝
𝐹𝑖 (𝑝) := Pr(𝑃𝑖 ≤ 𝑝) = 𝑓𝑖 (𝑥) d𝑥,
−∞ Theorem 3.2. Suppose that for all 𝑗, 𝑓 𝑗 is continuously differen-
be the corresponding cumulative distribution function (cdf). We’ll tiable and log-concave. Then the maximizer of Pr(𝑃𝑖 = 𝑝 | 𝐸𝑖𝐿 , 𝐸𝑊𝑖 )
also define the following functions, which will be associated with is unique and given by the unique zero of
losses, draws, and wins, respectively: Õ Õ Õ
𝑄𝑖 (𝑝) := 𝑙 𝑗 (𝑝) + 𝑑 𝑗 (𝑝) + 𝑣 𝑗 (𝑝).
d −𝑓𝑖 (𝑝)
𝑙𝑖 (𝑝) := ln(1 − 𝐹𝑖 (𝑝)) = , 𝑗 ≻𝑖 𝑗∼𝑖 𝑗 ≺𝑖
d𝑝 1 − 𝐹𝑖 (𝑝)
d 𝑓 ′ (𝑝) The proof appears in the appendix. Intuitively, we’re saying
𝑑𝑖 (𝑝) := ln 𝑓𝑖 (𝑝) = 𝑖 ,
d𝑝 𝑓𝑖 (𝑝) that the performance is the balance point between appropriately
d 𝑓𝑖 (𝑝) weighted wins, draws, and losses. Let’s look at two specializations
𝑣𝑖 (𝑝) := ln 𝐹𝑖 (𝑝) = . of our general model, to serve as running examples in this paper.
d𝑝 𝐹𝑖 (𝑝)
Evidently, 𝑙𝑖 (𝑝) < 0 < 𝑣𝑖 (𝑝). Now we define what it means for Gaussian performance model. If both 𝑆 𝑗 and 𝑃 𝑗 − 𝑆 𝑗 are assumed
the deviation 𝑃𝑖 − 𝑆𝑖 to be log-concave. to be Gaussian with known means and variances, then their inde-
pendent sum 𝑃 𝑗 will also be a known Gaussian. It is analytic and
Definition 3.1. An absolutely continuous random variable on a log-concave, so Theorem 3.2 applies.
convex domain is log-concave if its probability density function 𝑓 is We substitute the well-known Gaussian pdf and cdf for 𝑓 𝑗 and 𝐹 𝑗 ,
positive on its domain and satisfies respectively. A simple binary search, or faster numerical techniques
𝑓 (𝜃𝑥 + (1 − 𝜃 )𝑦) > 𝑓 (𝑥)𝜃 𝑓 (𝑦) 1−𝜃 , ∀𝜃 ∈ (0, 1), 𝑥 ≠ 𝑦. such as the Illinois algorithm or Newton’s method, can be employed
to solve for the unique zero of 𝑄𝑖 .
Log-concave distributions appear widely, and include the Gauss-
ian and logistic distributions used in Glicko, TrueSkill, and many Logistic performance model. Now we assume the performance
others. We’ll see inductively that our prior 𝜋𝑖 is log-concave at deviation 𝑃 𝑗 − 𝑆 𝑗 has a logistic distribution with mean 0 and vari-
every round. Since log-concave densities are closed under convolu- ance 𝛽 2 . In general, the rating system administrator is free to set 𝛽
tion [13], the independent sum 𝑃𝑖 = 𝑆𝑖 + (𝑃𝑖 −𝑆𝑖 ) is also log-concave. differently for each contest. Since shorter contests tend to be more
Log-concavity is made very convenient by the following lemma, variable, one reasonable choice might be to make 1/𝛽 2 proportional
proved in the appendix: to the contest duration.
Given the mean and variance of the skill prior, the independent
Lemma 3.1. If 𝑓𝑖 is continuously differentiable and log-concave, sum 𝑃 𝑗 = 𝑆 𝑗 + (𝑃 𝑗 − 𝑆 𝑗 ) would have the same mean, and a variance
then the functions 𝑙𝑖 , 𝑑𝑖 , 𝑣𝑖 are continuous, strictly decreasing, and that’s increased by 𝛽 2 . Unfortunately, we’ll see that the logistic
𝑙𝑖 (𝑝) < 𝑑𝑖 (𝑝) < 𝑣𝑖 (𝑝) for all 𝑝. performance model implies a form of skill prior from which it’s
tough to extract a mean and variance. Even if we could, the sum
For the remainder of this section, we fix the analysis with respect does not yield a simple distribution.
to some player 𝑖. As argued in Section 2, 𝑃𝑖 concentrates very For experienced players, we expect 𝑆 𝑗 to contribute much less
narrowly in the posterior. Hence, we can estimate 𝑃𝑖 by its MAP, variance than 𝑃 𝑗 − 𝑆 𝑗 ; thus, in our heuristic approximation, we take
choosing 𝑝 so as to maximize: 𝑃 𝑗 to have the same form of distribution as the latter. That is, we
take 𝑃 𝑗 to be logistic, centered at the prior rating 𝜇 𝜋𝑗 = arg max 𝜋 𝑗 ,
Pr(𝑃𝑖 = 𝑝 | 𝐸𝑖𝐿 , 𝐸𝑊 𝐿 𝑊
𝑖 ) ∝ 𝑓𝑖 (𝑝) Pr(𝐸𝑖 , 𝐸𝑖 | 𝑃𝑖 = 𝑝). with variance 𝛿 2𝑗 = 𝜎 2𝑗 + 𝛽 2 , where 𝜎 𝑗 will be given by Equation (8).
Define 𝑗 ≻ 𝑖, 𝑗 ≺ 𝑖, 𝑗 ∼ 𝑖 as shorthand for 𝑗 ∈ 𝐸𝑖𝐿 ,
𝑗 ∈ 𝐸𝑊
𝑖 ,
This distribution is analytic and log-concave, so the same methods
𝑗 ∈ P \ (𝐸𝑖𝐿 ∪ 𝐸𝑊
𝑖 ) (that is, 𝑃 𝑗 > 𝑃 ,
𝑖 𝑗𝑃 < 𝑃 ,
𝑖 𝑗𝑃 = 𝑃𝑖 ), respectively. based on Theorem 3.2 apply.
The following theorem yields our MAP estimate: Let’s derive 𝑄𝑖 explicitly in this case, since it has a rather intuitive
form. The logistic distribution with variance 𝛿 2𝑗 has scale parameter
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

√
𝛿¯𝑗 := 𝜋3 𝛿 𝑗 ; its cdf and pdf are: Logistic performance model. When the performance model is non-
! Gaussian, the pointwise product of pdfs does not simplify so easily.
1 1 𝑝 − 𝜇 𝜋𝑗
𝐹 𝑗 (𝑝) = = 1 + tanh , By Equation (3), each round contributes an additional factor to the
−(𝑝−𝜇 𝜋𝑗 )/𝛿¯𝑗 2 2𝛿¯𝑗 belief distribution. In general, we allow it to consist of a collection
1+𝑒
(𝑝−𝜇 𝜋𝑗 )/𝛿¯𝑗 𝑝 − 𝜇 𝜋𝑗 of simple log-concave factors, one for each round in which player 𝑖
𝑒 1 2
𝑓 𝑗 (𝑝) = = sech . has participated. Denote 𝑖’s participation history by
(𝑝−𝜇 𝜋𝑗 )/𝛿¯𝑗 2 4𝛿¯𝑗 2𝛿¯𝑗

𝛿¯𝑗 1 + 𝑒 H𝑖,𝑡 := {𝑘 ∈ {1, . . . , 𝑡 } : 𝑖 ∈ P𝑘 }.
They satisfy two very convenient relations: Since the factors deal with only a single player, we’ll omit the
𝐹 𝑗′ (𝑝) = 𝑓 𝑗 (𝑝) = 𝐹 𝑗 (𝑝)(1 − 𝐹 𝑗 (𝑝))/𝛿¯𝑗 , subscript 𝑖. Specializing to the logistic setting, each 𝑘 ∈ H𝑡 con-
tributes a logistic factor to the posterior, with mean 𝑝𝑘 and variance
𝑓 ′ (𝑝) = 𝑓 𝑗 (𝑝)(1 − 2𝐹 𝑗 (𝑝))/𝛿¯𝑗 ,
𝑗 𝛽𝑘2 . We still use a Gaussian initial prior, with mean and variance
from which it follows that denoted by 𝑝 0 and 𝛽 02 , respectively. Postponing the discussion of
1 − 2𝐹 𝑗 (𝑝) −𝐹 𝑗 (𝑝) 1 − 𝐹 𝑗 (𝑝) skill evolution to Section 4, for the moment we assume that 𝑆𝑘 = 𝑆 0
𝑑 𝑗 (𝑝) = = + = 𝑙 𝑗 (𝑝) + 𝑣 𝑗 (𝑝). for all 𝑘. The posterior pdf, up to normalization, is then
𝛿¯ 𝛿¯ 𝛿¯
Ö
In other words, a tie counts as the sum of a win and a loss. 𝜋 0 (𝑠) Pr(𝑃𝑘 = 𝑝𝑘 | 𝑆𝑘 = 𝑠)
This can be compared to the approach (used in Elo, Glicko, BAR, 𝑘 ∈H𝑡
Topcoder, and Codeforces) of treating each tie as half a win plus !
half a loss.3 (𝑠 − 𝑝 0 ) 2 Ö 2 𝜋 𝑠 − 𝑝𝑘
∝ exp − sech √ . (5)
Finally, putting everything together: 2𝛽 02 𝑘 ∈H𝑡 12 𝛽𝑘
Õ Õ Õ
𝑄𝑖 (𝑝) = 𝑙 𝑗 (𝑝) + 𝑙 𝑗 (𝑝) + 𝑣 𝑗 (𝑝) + 𝑣 𝑗 (𝑝) Maximizing the posterior density amounts to minimizing its
𝑗 ≻𝑖 𝑗∼𝑖 𝑗 ≺𝑖 negative logarithm. Up to a constant offset, this is given by
Õ Õ
= 𝑙 𝑗 (𝑝) + 𝑣 𝑗 (𝑝)
𝑠 − 𝑝0
Õ
𝑠 − 𝑝𝑘

𝑗 ⪰𝑖 𝑗 ⪯𝑖 𝐿(𝑠) := 𝐿2 + 𝐿𝑅 ,
𝛽0 𝛽𝑘
Õ −𝐹 𝑗 (𝑝) Õ 1 − 𝐹 𝑗 (𝑝) 𝑘 ∈H𝑡
= + . 1

𝜋𝑥

𝑗 ⪰𝑖 𝛿¯𝑗 𝑗 ⪯𝑖𝛿¯𝑗 where 𝐿2 (𝑥) := 𝑥 2 and 𝐿𝑅 (𝑥) := 2 ln cosh √ .
2 12
Our estimate for 𝑃𝑖 is the zero of this expression. Its terms cor-
respond to probabilities, weighted by 1/𝛿¯𝑗 , of losing and winning 𝑠 − 𝑝0 Õ 𝜋 (𝑠 − 𝑝𝑘 )𝜋
Í Thus, 𝐿 ′ (𝑠) = + √ tanh √ . (6)
against each player 𝑗. Accordingly, we can interpret 𝑗 ∈ P (1 − 𝛽02
𝛽
𝑘 ∈H 𝑘 3 𝛽𝑘 12
𝑡
𝐹 𝑗 (𝑝))/𝛿¯𝑗 as a weighted expected rank of a player whose perfor-
𝐿 ′ is continuous and strictly increasing in 𝑠, so its zero is unique:
mance is 𝑝. 𝑃𝑖 can thus be viewed as the performance level at
it is the MAP 𝜇𝑡 . Similar to what we did in the first phase, we can
which one’s expected rank would equal 𝑖’s actual rank. While the
solve for 𝜇𝑡 with binary search or other root-solving methods.
Codeforces and Topcoder systems compute performance values in
Furthermore, Equation (6) reveals a rather intuitive interpreta-
a similar manner, here we’ve derived the formula from Bayesian
tion for the rating 𝜇𝑡 as an aggregate of the historical performances
principles.
𝑝 ≤𝑡 : Gaussian factors in 𝐿 become 𝐿2 penalty terms, whereas logis-
tic factors appear as the more interesting 𝐿𝑅 terms. In Figure 1, we
3.2 Belief update
see that 𝐿𝑅 behaves quadratically near the origin, but linearly at the
Having estimated 𝑃𝑖,𝑡 in the first phase, the second phase is more extremities. It’s essentially a smoothed Huber loss, interpolating
straightforward. Ignoring normalizing constants, Equation (3) tells between 𝐿2 and 𝐿1 over a scale of magnitude 𝛽𝑘 .
us that the pdf of the skill posterior can be obtained as the pointwise It is well-known that minimizing a sum of 𝐿2 terms pushes the
product of the pdfs of the skill prior and the performance model. argument towards a weighted mean, while minimizing a sum of
When both factors are differentiable and log-concave, then so is 𝐿1 terms pushes the argument towards a weighted median. With
their product. Its maximum is the new rating 𝜇𝑖,𝑡 ; let’s see how to 𝐿𝑅 terms, the net effect is that 𝜇𝑡 acts like a robust average of the
compute it for the same two specializations of our model. historical performances 𝑝 ≤𝑡 . Specifically, one can check that
Gaussian performance model. When the skill prior and perfor-
Í
𝑘 𝑤 𝑘 𝑝𝑘 1
mance model are Gaussian with known means and variances, multi- 𝜇𝑡 = Í , where 𝑤 0 := 2 and
𝑘 𝑤𝑘 𝛽0
plying their pdfs yields another known Gaussian. Hence, the poste-
rior is compactly represented by its mean 𝜇𝑖,𝑡 , which coincides with 𝜋 (𝜇𝑡 − 𝑝𝑘 )𝜋
2 , which is our uncertainty
the MAP and rating; and its variance 𝜎𝑖,𝑡 𝑤𝑘 := √ tanh √ for 𝑘 ∈ H𝑡 . (7)
(𝜇𝑡 − 𝑝𝑘 )𝛽𝑘 3 𝛽𝑘 12
regarding the player’s skill.
𝑤𝑘 is close to 1/𝛽𝑘2 for typical performances, but can be up to
3 Elo-MMR, 2
too, can be modified to split ties into half win plus half loss. It’s easy to 𝜋 /6 times more as |𝜇𝑡 − 𝑝𝑘 | → 0, or vanish entirely as |𝜇𝑡 − 𝑝𝑘 | →
check that Lemma 3.1 still holds if 𝑑 𝑗 (𝑝) is replaced by 𝑤𝑙 𝑙 𝑗 (𝑝) + 𝑤𝑣 𝑣 𝑗 (𝑝) , provided
that 𝑤𝑙 , 𝑤𝑣 ∈ [0, 1] and |𝑤𝑙 − 𝑤𝑣 | < 1. In particular, we can set 𝑤𝑙 = 𝑤𝑣 = 0.5. The ∞. The latter feature is due to the thicker tails of the logistic dis-
results in Section 5 won’t be altered by this change. tribution, as compared to the Gaussian, resulting in an algorithm
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu

that resists drastic rating changes in the presence of a few unusu- overall performance. By simply distributing the credit equally, we
ally good or bad performances. We’ll formally state this robustness ensure that every individual’s incentive is perfectly aligned with
property in Theorem 5.7. winning as a team.
Estimating skill uncertainty. While there is no easy way to com-
4 SKILL EVOLUTION OVER TIME
pute the variance of a posterior in the form of Equation (5), it will
be useful to have some estimate 𝜎𝑡2 of uncertainty. There is a simple Over time, as a player trains or rests, a player’s skill can change. If
formula in the case where all factors are Gaussian. Since moment- we model skill as a static variable, our system will eventually grow
matched logistic and normal distributions are relatively close (cf. so confident in its estimate that it will refuse to admit substantial
Figure 1), we apply the same formula: changes. To remedy this, we introduce a skill evolution model, so
1 Õ 1 that in general 𝑆𝑡 ≠ 𝑆𝑡 ′ for 𝑡 ≠ 𝑡 ′ . Rather than simply being equal
2
:= . (8) to the previous round’s posterior, now the skill prior at round 𝑡 is
𝜎𝑡 𝛽2
𝑘 ∈ {0}∪H 𝑘 given by
𝑡
∫
3.3 Team competitions 𝜋𝑡 (𝑠) = Pr(𝑆𝑡 = 𝑠 | 𝑆𝑡 −1 = 𝑥) Pr(𝑆𝑡 −1 = 𝑥 | 𝑃 <𝑡 ) d𝑥 . (9)
While our main focus is on ranked competitions between a large
number of individuals, Elo-MMR can be adapted to ranked compe- The factors in the integrand are the skill evolution model and the
titions between a large number of teams. In this setting, round 𝑡’s previous round’s posterior, respectively. Following other Bayesian
set of participants P𝑡 is partitioned into a disjoint union of teams rating systems (e.g., Glicko, Glicko-2, and TrueSkill [22, 23, 25]),
Ã
𝜏 ∈ T𝑡 : formally, P𝑡 = 𝜏 ∈ T𝑡 𝜏. we model the skill changes 𝑆𝑡 − 𝑆𝑡 −1 as independent zero-mean
Instead of ranking individual 𝑖 by their performance 𝑃𝑖 , the com- Gaussians. That is, Pr(𝑆𝑡 | 𝑆𝑡 −1 = 𝑥) is a Gaussian with mean 𝑥
petition ranks an entire team 𝜏 by a performance variable 𝑃𝜏 , which and some variance 𝛾𝑡2 .
depends on the skills {𝑆𝑖 : 𝑖 ∈ 𝜏 } of all its members. In general, the There is some flexibility in how 𝛾𝑡 is set. Glicko, in its origi-
probabilistic team performance model should be domain-specific: nal presentation, sets 𝛾𝑡2 proportionally to the time elapsed since
depending, for instance, on whether game outcomes are most heav- the last update, corresponding to a continuous Brownian motion.
ily influenced by a team’s weakest or strongest player. A default Codeforces and Topcoder simply set 𝛾𝑡 to a constant when a player
choice that credits team members equally is the sum of their indi- participates, and zero otherwise, corresponding to changes that are
vidual performances: in proportion to how often the player competes. Now we are ready
Õ Õ Õ to complete the two specializations of our rating system.
𝑃𝜏 := 𝑃𝑖 = 𝑆𝑖 + (𝑃𝑖 − 𝑆𝑖 ).
𝑖 ∈𝜏 𝑖 ∈𝜏 𝑖 ∈𝜏 Gaussian performance model. If the performance model and the
Thus, 𝑃𝜏 is a sum of 2|𝜏 | independently distributed terms. Just prior on 𝑆𝑡 −1 are both Gaussian, then the posterior on 𝑆𝑡 −1 is also
as before, we approximate this sum by a single Gaussian or logistic Gaussian. Since 𝑆𝑡 = 𝑆𝑡 −1 + (𝑆𝑡 − 𝑆𝑡 −1 ) is a sum of independent
term with matching moments. Instead of the moments (𝜇𝑖𝜋 , 𝛿𝑖 ) of Gaussians, its prior is Gaussian as well. By induction, the skill belief
𝑃𝑖 in Algorithm 1, we’ll have distribution forever remains Gaussian. As we’ll see in Section 5.2,
Õ this Gaussian specialization of the Elo-MMR framework lacks the
𝜇𝜏𝜋 ← 𝜇𝑖 , R for robustness, so we call it Elo-MM𝜒.
𝑖 ∈𝜏
s Õ Logistic performance model. After a player’s first participation,
𝛿𝜏 ← |𝜏 |𝛽 2 + 𝜎𝑖2 . the posterior in Equation (5) becomes non-Gaussian, rendering the
𝑖 ∈𝜏 integral in Equation (9) intractable.
With this change, the algorithm proceeds almost exactly as be- A very simple approach would be to replace the full posterior in
fore, with the performance estimation step operating at the level of Equation (5) by a Gaussian approximation with mean 𝜇𝑡 (equal to
teams instead of individuals, 𝑃𝜏 , 𝜇𝜏𝜋 , 𝛿𝜏 replacing 𝑃𝑖 , 𝜇𝑖𝜋 , 𝛿𝑖 . the posterior MAP) and variance 𝜎𝑡2 (given by Equation (8)). Then,
The main caveat is that, in our limit of large competitions, we as in the previous case, the intractable integral specializes to a
only obtain precise estimates of the team performance 𝑃𝜏 . To esti- simple addition of Gaussian random variables.
mate the individual performance 𝑃𝑖 , which in turn approximates With this approximation, no memory is kept of the individual
𝑆𝑖 , we subtract all of 𝑖’s teammates’ ratings from 𝑃𝜏 . Since performances 𝑃𝑡 . Priors are simply Gaussian, while the pdf of a
Õ Õ skill posterior is the product of two factors: the Gaussian prior, and
𝑆𝑖 = 𝑃𝜏 − 𝑆𝑗 − (𝑃 𝑗 − 𝑆 𝑗 ), a logistic factor corresponding to the latest performance. To ensure
𝑗 ∈𝜏,𝑗≠𝑖 𝑗 ∈𝜏
Í robustness (see Section 5.2), 𝜇𝑡 is computed as the arg max of this
the variance of this estimate is not 𝛽 2 , but |𝜏 |𝛽 2 + 𝑗 ∈𝜏,𝑗≠𝑖 𝜎 2𝑗 . posterior before replacement by its Gaussian approximation. We
Since we don’t know who to credit for a team outcome, it’s impossi- call the rating system that takes this approach Elo-MMR(∞).
ble to precisely estimate 𝑃𝑖 . As a result, the independence argument As the name implies, it turns out to be a limiting case of Elo-
in Section 2 ceases to hold. Nonetheless, Elo-MMR for team contests MMR(𝜌). In the general setting with 𝜌 ∈ [0, ∞), we keep the full
continues to enjoy the properties described in Section 5. posterior from Equation (5). Since we cannot tractably compute the
While smarter credit-assignment schemes may be considered in effect of a Gaussian diffusion, we seek a heuristic derivation of the
future work, one should be wary of the risk that such mechanisms next round’s prior, retaining a form similar to Equation (5) while
may motivate players to seek credit, even at the expense of a team’s satisfying many of the same properties as the intended diffusion.
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

4.1 Desirable properties of a “pseudodiffusion” Algorithm 1 Elo-MMR(𝜌, 𝛽, 𝛾, 𝜇𝑖𝑛𝑖𝑡 , 𝜎𝑖𝑛𝑖𝑡 )

We begin by listing some properties that our skill evolution algo- for all rounds 𝑡 do
rithm, henceforth called a “pseudodiffusion”, should satisfy. It will P, ⪯, ⪰ ← outcome of round 𝑡
have a size parameter 𝛾 2 , analogous to the variance of a Gaussian for all players 𝑖 ∈ P in parallel do
diffusion. The first two properties are natural: if 𝑖 has never competed before then
𝜇𝑖 , 𝜎𝑖 ← 𝜇𝑖𝑛𝑖𝑡 , 𝜎𝑖𝑛𝑖𝑡
• Incentive-compatibility. First and foremost, the pseudodiffusion
𝑝𝑖 , 𝑤𝑖 ← [𝜇𝑖 ], [1/𝜎𝑖2 ]
must not break the incentive-compatibility of our rating system.
diffuse(𝑖) q
That is, a rating-maximizing player should never be motivated
to lose on purpose (see Theorem 5.5). 𝜇𝑖𝜋 , 𝛿𝑖 ← 𝜇𝑖 , 𝜎𝑖2 + 𝛽 2
• Rating preservation. The pseudodiffusion must not alter the arg max for all players 𝑖 ∈ P in parallel do
of the belief density. That is, the rating of a player should not update(𝑖)
change: 𝜇𝑡𝜋 = 𝜇𝑡 −1 .
In addition, we borrow four properties of Gaussian diffusions: To make the idea precise, we generalize the posterior from Equa-
tion (5) with fractional multiplicities 𝜔𝑘 : the 𝑘’th factor is raised
• Correct magnitude. A pseudodiffusion of size 𝛾 2
must increase to the power 𝜔𝑘 . As a result, Equations (6) and (8) become:
the skill uncertainty, as measured by Equation (8), by 𝛾 2 .
• Composability. Two pseudodiffusions applied in sequence, first 𝜔 0 (𝑠 − 𝑝 0 ) Õ 𝜔 𝜋 (𝑠 − 𝑝𝑘 )𝜋
with size 𝛾 12 and then with size 𝛾 22 , must have the same effect as 𝐿 ′ (𝑠) = 2
+ 𝑘
√ tanh √ ,
𝛽0 𝛽 3 𝛽𝑘 12
a single pseudodiffusion of size 𝛾 12 + 𝛾 22 . 𝑘 ∈H𝑡 𝑘
• Zero diffusion. In the limit as 𝛾 → 0, the effect of a pseudodiffusion 1 Õ 𝜔
:= 𝑤𝑘 , where 𝑤𝑘 := 𝑘2 . (10)
must vanish, i.e., not alter the belief distribution. 𝜎𝑡2 𝑘 ∈ {0}∪H
𝛽𝑘
𝑡
• Zero uncertainty. In the limit as 𝜎𝑡 −1 → 0 (i.e., when the previous
rating 𝜇𝑡 −1 is a perfect estimate of 𝑆𝑡 −1 ), our prior on 𝑆𝑡 must For 𝜌 ∈ [0, ∞), the Elo-MMR(𝜌) algorithm continuously and
become Gaussian with mean 𝜇𝑡 −1 and variance 𝛾 2 . Finer-grained simultaneously performs transfer and decay, with transfer proceed-
information regarding the history 𝑃 <𝑡 must be erased. ing at 𝜌 times the rate of decay. Of course, for 𝜌 = ∞, the transfer
is instantaneous and only the 0’th term survives. Holding 𝛽𝑘 fixed,
In particular, Elo-MMR(∞) fails the zero diffusion property because changes to 𝜔𝑘 can be described in terms of changes to 𝑤𝑘 :
it simplifies the belief distribution, even when 𝛾 = 0. In the proof of
Theorem 4.1, we’ll see that Elo-MMR(0) fails the zero uncertainty d𝑤 0 Õ
= −𝑟 (𝑡)𝑤 0 + 𝜌𝑟 (𝑡) 𝑤𝑘 ,
property. Thus, it is in fact necessary to have 𝜌 strictly positive and d𝑡
𝑘 ∈H𝑡
finite. In Section 5.2, we’ll come to interpret 𝜌 as a kind of inverse d𝑤𝑘
momentum. = −(1 + 𝜌)𝑟 (𝑡)𝑤𝑘 for 𝑘 ∈ H𝑡 ,
d𝑡
where the arbitrary decay rate 𝑟 (𝑡) can be eliminated by a change
4.2 A heuristic pseudodiffusion algorithm of variable d𝜏 = 𝑟 (𝑡)d𝑡. The evolution from the end of round 𝑡 − 1
Each factor in the posterior (see Equation (5)) has a parameter 𝛽𝑘 . to the start of round 𝑡 corresponds to some interval Δ𝜏, over which
Define a factor’s weight to be 𝑤𝑘 := 1/𝛽𝑘2 , which by Equation (8) the total weight will have decayed by a factor 𝜅𝑡 := 𝑒 −Δ𝜏 . Solving
Í
contributes to the total weight 𝑘 𝑤𝑘 = 1/𝜎𝑡2 . Here, unlike in the differential equations yields the new weights, distinguished by
Equation (7), 𝑤𝑘 does not depend on |𝜇𝑡 − 𝑝𝑘 |. their round 𝑡 subscripts:
Recall that the approximation step of Elo-MMR(∞) replaces all Õ
1+𝜌
𝑤 0,𝑡 = 𝜅𝑡 𝑤 0,𝑡 −1 + 𝜅𝑡 − 𝜅𝑡 𝑤𝑘,𝑡 −1,
the logistic factors by a single Gaussian whose variance is chosen
𝑘 ∈H𝑡
to ensure that the total weight is preserved. In addition, its mean is
1+𝜌
chosen to preserve the player’s rating, given by the unique zero of 𝑤𝑘,𝑡 = 𝜅𝑡 𝑤𝑘,𝑡 −1 for 𝑘 ∈ H𝑡 . (11)
Equation (6). Finally, the diffusion step of Elo-MMR(∞) increases The correct magnitude property requires the uncertainty to in-
the Gaussian’s variance, and hence the player’s skill uncertainty, crease from 𝜎𝑡2−1 to 𝜎𝑡2−1 + 𝛾𝑡2 . By Equations (10) and (11),
by 𝛾𝑡2 ; this corresponds to a decay in the weight.
To generalize the idea, we interleave the two steps in a contin- 1 Õ Õ 𝜅𝑡
= 𝑤 𝑘,𝑡 = 𝜅 𝑡 𝑤𝑘,𝑡 −1 = 2 ,
uous manner. The approximation step becomes a transfer step: 𝜎𝑡2−1 + 𝛾𝑡2 𝑘 ∈ {0}∪H 𝑘 ∈ {0}∪H
𝜎 𝑡 −1
𝑡 𝑡
rather than replace the logistic factors outright, we take away equal
Solving for the decay factor:
fractions from each of their weights, and place the sum of removed
! −1
weights onto a new Gaussian factor. In order for this operation to 𝛾𝑡2
preserve ratings, the new factor must be centered at 𝜇𝑡 −1 . Since 𝜅𝑡 = 1 + 2 .
𝜎𝑡 −1
Gaussian pdfs compose, the prior Gaussian factor can be combined
with the new one. The diffusion step becomes a decay step, re- Algorithm 1 details the full Elo-MMR(𝜌) rating system. Each
ducing each factor’s weight by equal fractions (possibly different round of competition yields a set of participants P𝑡 , along with
from the fractions in the transfer step), chosen such that the overall their rank-ordering. New players are initialized with a Gaussian
uncertainty is increased by 𝛾𝑡2 . prior. Changes in player skill are modeled by Algorithm 2; note
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu

Algorithm 2 diffuse(𝑖) 5 THEORETICAL PROPERTIES

𝜅 ← (1 + 𝛾 2 /𝜎𝑖2 ) −1 In this section, we see how the simplicity of the Elo-MMR formulas
Í
𝑤𝐺 , 𝑤 𝐿 ← 𝜅 𝜌 𝑤𝑖,0, (1 − 𝜅 𝜌 ) 𝑘 ≥0 𝑤𝑖,𝑘 enables us to rigorously prove that the rating system is incentive-
𝑝𝑖,0 ← (𝑤𝐺 𝑝𝑖,0 + 𝑤 𝐿 𝜇𝑖 )/(𝑤𝐺 + 𝑤 𝐿 ) compatible, robust, and computationally efficient.
𝑤𝑖,0 ← 𝜅 (𝑤𝐺 + 𝑤 𝐿 ) 5.1 Incentive-compatibility
for all 𝑘 > 0 do
𝑤𝑖,𝑘 ← 𝜅 1+𝜌 𝑤𝑖,𝑘 To demonstrate the need for incentive-compatibility, let’s look at
√ the consequences of violating this property in the Topcoder and
𝜎𝑖 ← 𝜎𝑖 / 𝜅
Glicko-2 rating systems. These systems track a “volatility” for each
Algorithm 3 update(𝑖) player, which estimates the variance of their performances. A player
whose recent performance history is more consistent would be
Í 1 𝑥−𝜇 𝜋𝑗 Í 1 𝑥−𝜇 𝜋𝑗
𝑝 ← zero of 𝛿 tanh ¯ − 1 + 𝛿 tanh ¯ + 1 assigned a lower volatility score, than one with wild swings in
𝑥 ∈R 𝑗 ⪯𝑖 𝑗 2𝛿 𝑗 𝑗 ⪰𝑖 𝑗 2𝛿 𝑗
𝑝𝑖 .push(𝑝) performance. The volatility acts as a multiplier on rating changes;
𝑤𝑖 .push(1/𝛽 2 ) thus, players with an extremely low or high performance will have
Í 𝑤𝑖,𝑘 𝛽 2 𝑥−𝑝𝑖,𝑘 their subsequent rating changes amplified.
𝜇𝑖 ← zero of 𝑤𝑖,0 (𝑥 − 𝑝𝑖,0 ) + tanh
𝑥 ∈R 𝑘>0 𝛽¯ 2𝛽¯ While it may seem like a good idea to boost changes for players
whose ratings are poor predictors of their performance, this fea-
ture has an exploit. By intentionally performing at a weaker level, a
how the updated Gaussian term blends its old value with the new player can amplify future increases to an extent that more than com-
Gaussian term created by the transfer process. The first phase of pensates for the immediate hit to their rating. A player may even
Algorithm 3 estimates 𝑃𝑡 as the zero of a function of 𝑥. Finally, the “farm” volatility by alternating between very strong and very weak
second phase computes 𝜇𝑡 as the zero of another function. performances. After acquiring a sufficiently high volatility score,
The hyperparameters 𝜌, 𝛽, 𝛾 are domain-dependent, and can be the strategic player exerts their honest maximum performance over
set by standard hyperparameter search techniques. The system’s a series of contests. The amplification eventually results in a rat-
invariance to translation and scale allows 𝜇𝑖𝑛𝑖𝑡 , 𝜎𝑖𝑛𝑖𝑡 to be set ar- ing that exceeds what would have been obtained via honest play.
bitrarily; a common choice is 1500, 350 [23]. For convenience,
√
we This type of exploit was discovered in Glicko-2 as applied to the
¯ 3
assume 𝛽 and 𝛾 are fixed and use the shorthand 𝛽 := 𝜋 𝛽. Whereas Pokemon Go video game [3]. Table 5.3 of [19] presents a milder
our exposition used global round indices, here a subscript 𝑘 corre- violation in Topcoder competitions.
sponds to the 𝑘’th round in player 𝑖’s participation history. To get a realistic estimate of the severity of this exploit, we per-
formed a simple experiment on the first five years of the Codeforces
Theorem 4.1. Algorithm 2 with 𝜌 ∈ (0, ∞) meets all of the prop- contest dataset (see Section 6.1). In Figure 2, we plot the rating evo-
erties listed in Section 4.1. lution of the world’s #1 ranked competitive programmer, Gennady
Korotkevich, better known as tourist. In the control setting, we
plot his ratings according to the Topcoder and Elo-MMR(1) systems.
We contrast these against an adversarial setting, in which we have
Proof. We go through each of the six properties in order. tourist employ the following strategy: for his first 45 contests,
tourist plays normally (exactly as in the unaltered data). For his
• Incentive-compatibility. This property will be stated in Theo- next 45 contests, tourist purposely falls to last place whenever his
rem 5.5. To ensure that its proof carries through, the relevant facts Topcoder rating is above 2975. Finally, tourist returns to playing
to note here are that the pseudodiffusion algorithm ignores the normally for an additional 15 contests.
performances 𝑝𝑘 , and centers the transferred Gaussian weight at This strategy mirrors the Glicko-2 exploit documented in [3],
the rating 𝜇𝑡 −1 , which is trivially monotonic in 𝜇𝑡 −1 . and does not require unrealistic assumptions (e.g., we don’t demand
• Rating preservation. Recall that the rating is the unique zero of 𝐿 ′ tourist to exercise very precise control over his performances).
in Equation (10). To see that this zero is preserved, note that the Compared to a consistently honest tourist, the volatility farm-
decay and transfer operations multiply 𝐿 ′ by constants (𝜅𝑡 and ing tourist ended up 523 rating points ahead by the end of the
𝜌
𝜅𝑡 , respectively), before adding the new Gaussian term, whose experiment, with almost 1000 rating points gained in the last 15
contribution to 𝐿 ′ is zero at its center. contests alone. Transferring the same sequence of performances
• Correct magnitude. Follows from our derivation for 𝜅𝑡 . to the Elo-MMR(1) system, we see that it not only is immune to
• Composability. Follows from correct magnitude and the fact that such volatility-farming attacks, but it also penalizes the dishonest
every pseudodiffusion follows the same differential equations. strategy with a rating loss that decays exponentially once honest
• Zero diffusion. As 𝛾 → 0, 𝜅𝑡 → 1. Provided that 𝜌 < ∞, we also play resumes.
𝜌
have 𝜅𝑡 → 1. Hence, for all 𝑘 ∈ {0} ∪ H𝑡 , 𝑤𝑘,𝑡 → 𝑤𝑘,𝑡 −1 . Recall that a key purpose of modeling volatility in Topcoder
• Zero uncertainty. As 𝜎𝑡 −1 → 0, 𝜅𝑡 → 0. The total weight decays and Glicko-2 was to boost rating changes for inconsistent players.
from 1/𝜎𝑡2−1 , which becomes extremely large in this limit, to 𝛾 2 . Remarkably, Elo-MMR achieves the same effect: we’ll see in Sec-
𝜌
Provided that 𝜌 > 0, we also have 𝜅𝑡 → 0, so these weights tion 5.2 that, for 𝜌 ∈ [0, ∞), Elo-MMR(𝜌) also boosts changes to
transfer in their entirety, leaving behind a Gaussian with mean inconsistent players. And yet, we’ll now prove that no strategic
𝜇𝑡 −1 , variance 𝛾 2 , and no additional history. □ incentive for purposely losing exists in any version of Elo-MMR.
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

3750 Proof. If 𝑖 ≻ 𝑗 with 𝑖, 𝑗 adjacent in the rankings, then

Topcoder (honest) Õ Õ
3500 Topcoder (adversarial) 𝑄𝑖 (𝑝) − 𝑄 𝑗 (𝑝) = (𝑑𝑘 (𝑝) − 𝑙𝑘 (𝑝)) + (𝑣𝑘 (𝑝) − 𝑑𝑘 (𝑝)) > 0,
𝑘∼𝑖 𝑘∼𝑗
3250
for all 𝑝. Since 𝑄𝑖 and 𝑄 𝑗 are decreasing functions, it follows that
3000
Rating

𝑝𝑖 > 𝑝 𝑗 . By induction, the conclusion also holds for 𝑖, 𝑗 that are not
2750 adjacent in the rankings. □
2500 What matters for incentives is that performance scores be coun-
terfactually monotonic; meaning, if we were to alter the round
2250
standings, a strategic player will always prefer to place higher:
2000
Lemma 5.3. In any given round, holding fixed the relative rankings
1750 among all players other than 𝑖 (and holding fixed all preceding rounds),
0 20 40 60 80 100
Contest # the performance 𝑝𝑖 is a monotonic function of player i’s prior rating
and of player 𝑖’s rank in this round.
3750
Elo-MMR (honest) Proof. 𝑄𝑖 (𝑝) depends on the prior rating 𝜇𝑖𝜋 only through the
3500 Elo-MMR (adversarial) self-tie term 𝑑𝑖 , which in turn depends only on 𝑝 − 𝜇𝑖𝜋 . Thus, a
change in 𝜇𝑖𝜋 has the same effect as an opposite change in 𝑝. By
3250
Lemma 3.1, 𝑑𝑖 is monotonically increasing in 𝜇𝑖𝜋 , from which it
3000 follows that 𝑝𝑖 is also monotonically increasing in 𝜇𝑖𝜋 .
Rating

2750 Now, since an upward shift in 𝑖’s ranking can only convert losses
to ties and ties to wins, Lemma 5.1 implies that 𝑝𝑖 is also monotoni-
2500 cally increasing in improvements to 𝑖’s rank. □
2250
Having established the relationship between round rankings
2000 and performance scores, the next step is to prove that, even with
1750
hindsight, players will always prefer their performance scores to
0 20 40 60 80 100 be as high as possible:
Contest #
Lemma 5.4. Holding fixed the set of contest rounds in which a
player has participated, their current rating is monotonic in each of
Figure 2: Volatility farming attack on the Topcoder system. their past performance scores.

To this end, we need a few lemmas. Recall that, for the purposes Proof. The player’s rating is given by the zero of 𝐿 ′ in Equa-
of the algorithm, the performance 𝑝𝑖 is defined to be the unique tion (10). This expression contains the variables 𝛽 :, 𝜔 :, 𝑝 :, and 𝑠. As
Í Í Í
zero of the function 𝑄𝑖 (𝑝) := 𝑗 ≻𝑖 𝑙 𝑗 (𝑝) + 𝑗∼𝑖 𝑑 𝑗 (𝑝) + 𝑗 ≺𝑖 𝑣 𝑗 (𝑝), 𝑝𝑘 is varied, 𝛽 : and 𝜔 : do not change: although the pseudodiffusions
whose terms 𝑙 𝑗 , 𝑑 𝑗 , 𝑣 𝑗 are contributed by opponents against whom 𝑖 of Section 4 do modify 𝜔 : , these changes are agnostic to 𝑝𝑘 . On the
lost, drew, or won, respectively. Wins always contribute positively other hand, 𝐿 ′ (𝑠) is monotonically increasing in 𝑠 and decreasing
to a player’s performance score, while losses contribute negatively: in each of the 𝑝𝑘 . Therefore, its zero is monotonically increasing in
each of the 𝑝𝑘 .
Lemma 5.1. Adding a win term to 𝑄𝑖 , or replacing a tie term by a This is almost what we wanted to prove, except that 𝑝 0 is not
win term, always increases its zero. Conversely, adding a loss term, or a performance. Due to the pseudodiffusion’s transfer step (or the
replacing a tie term by a loss term, always decreases it. actual diffusion, in the case of Elo-MM𝜒), 𝑝 0 is a weighted average
of its previous value and the prior rating, and so it is monotonic in
Proof. By Lemma 3.1, 𝑄𝑖 (𝑝) is decreasing in 𝑝. Thus, adding a both. Using this same lemma in the previous round as an inductive
positive term will increase its zero whereas adding a negative term hypothesis, it follows that 𝑝 0 is monotonic in past performances.
will decrease it. The desired conclusion follows by noting that, for By induction, the proof is complete. □
all 𝑗 and 𝑝,
Finally, we conclude that a rating-maximizing player is always
𝑣 𝑗 (𝑝) > 0, 𝑣 𝑗 (𝑝) − 𝑑 𝑗 (𝑝) > 0, motivated to improve their round rankings:
𝑙 𝑗 (𝑝) < 0, 𝑙 𝑗 (𝑝) − 𝑑 𝑗 (𝑝) < 0. Theorem 5.5 (Incentive-compatibility). Holding fixed the set
of contest rounds in which each player has participated, and the
□ historical ratings and relative rankings among all players other than
𝑖, player 𝑖’s current rating is monotonic in each of 𝑖’s past rankings.
While not needed for our main result, a similar argument shows
that performance scores are monotonic across the round standings: Proof. Choose any contest round in player 𝑖’s history, and con-
sider improving player 𝑖’s rank in that round while holding every-
Theorem 5.2. If 𝑖 ≻ 𝑗 (that is, 𝑖 beats 𝑗) in a given round, then thing else fixed. It suffices to show that player 𝑖’s current rating
the players’ performance estimates satisfy 𝑝𝑖 > 𝑝 𝑗 . would necessarily increase as a result.
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu

In the altered round, by Lemma 5.3, 𝑝𝑖 is increased; and by between Gaussian and logistic factors in the posterior. Recall the
Lemma 5.4, player 𝑖’s post-round rating is increased. By Lemma 5.3 notation in Equation (10), describing the loss function and weights.
again, this increases player 𝑖’s performance score in the following
Theorem 5.7. In the Elo-MMR(𝜌) rating system, let
round. Proceeding inductively, we find that performance scores and
ratings from this point onward are all increased. □ Δ+ := lim 𝜇𝑡 − 𝜇𝑡 −1, Δ− := lim 𝜇𝑡 −1 − 𝜇𝑡 .
𝑝𝑡 →+∞ 𝑝𝑡 →−∞
In the special cases of Elo-MM𝜒 or Elo-MMR(∞), the rating sys- Then, for Δ± ∈ {Δ+, Δ− },
tem is “memoryless”: the only data retained for each player are −1
the current rating 𝜇𝑖,𝑡 and uncertainty 𝜎𝑖,𝑡 ; detailed performance 𝜋 © 𝜋2 Õ 𝜋 1
√ 𝑤 0 + ≤ Δ± ≤ √
ª
𝑤𝑘 ® .
history is not saved. In this setting, we present a natural mono- 𝛽𝑡 3 6 𝛽𝑡 3 𝑤 0
𝑘 ∈H𝑡 −1
tonicity theorem. A similar theorem was previously stated for the « ¬
Codeforces system, albeit in an informal context without proof [8]. Proof. The limits exist, by monotonicity. Using the fact that
𝑑 tanh(𝑥) ≤ 1, differentiating 𝐿 ′ in Equation (10) yields
0 < 𝑑𝑥
Theorem 5.6 (Memoryless Monotonicity). In either the Elo-
MM𝜒 or Elo-MMR(∞) system, suppose 𝑖 and 𝑗 are two participants of 𝜋2 Õ
round 𝑡. Suppose that the ratings and corresponding uncertainties sat- ∀𝑠 ∈ R, 𝑤 0 ≤ 𝐿 ′′ (𝑠) ≤ 𝑤 0 + 𝑤𝑘 .
6
isfy 𝜇𝑖,𝑡 −1 ≥ 𝜇 𝑗,𝑡 −1, 𝜎𝑖,𝑡 −1 = 𝜎 𝑗,𝑡 −1 . Then, 𝜎𝑖,𝑡 = 𝜎 𝑗,𝑡 . Furthermore: 𝑘 ∈H𝑡 −1
If 𝑖 ≻ 𝑗 in round 𝑡, then 𝜇𝑖,𝑡 > 𝜇 𝑗,𝑡 . Now, the performance at round 𝑡 adds a new term with multiplic-
(𝑠−𝑝𝑘 )𝜋
If 𝑗 ≻ 𝑖 in round 𝑡, then 𝜇 𝑗,𝑡 − 𝜇 𝑗,𝑡 −1 > 𝜇𝑖,𝑡 − 𝜇𝑖,𝑡 −1 . ity one to 𝐿 ′ (𝑠): its value is 𝜋√ tanh √ . As a result, for every
𝛽𝑘 3 𝛽𝑘 12
Proof. The new contest round will add a rating perturbation 𝑠 ∈ R, lim𝑝𝑡 →±∞ 𝐿 ′ (𝑠) increases by ∓ 𝜋√ , while lim𝑝𝑡 →±∞ 𝐿 ′′ (𝑠)
𝛽𝑡 3
with variance 𝛾𝑡2 , followed by a new performance with variance 𝛽𝑡2 . does not change at all. Since we had 𝐿 ′ (𝜇𝑡 −1 ) = 0 without this new
As a result, term, after adding the term we have
!− 1 !− 1 𝜋
1 1 2
1 1 2 lim 𝐿 ′ (𝜇𝑡 −1 ) → ∓ √ .
𝜎𝑖,𝑡 = 2 + 2 = 2 + 2 = 𝜎 𝑗,𝑡 . 𝑝𝑡 →±∞ 𝛽𝑡 3
𝜎𝑖,𝑡 −1 + 𝛾𝑡2 𝛽𝑡 2
𝜎 𝑗,𝑡 −1 + 𝛾𝑡 𝛽𝑡
Dividing by the former inequalities yields the desired result. □
The remaining conclusions are consequences of three proper-
ties: memorylessness, incentive-compatibility (Theorem 5.5), and The proof reveals that the magnitude of Δ± depends inversely
translation-invariance (ratings, skills, and performances are quanti- on that of 𝐿 ′′ in the vicinity of the current rating, which in turn
fied on a common interval scale relative to one another). is related to the derivative of the tanh terms. If a player’s perfor-
Since the Elo-MM𝜒 or Elo-MMR(∞) systems are memoryless, we mances vary wildly, the tanh terms will be widely dispersed, so
may replace the initial prior and performance histories of players any 𝑠 ∈ R will necessarily be in the tail ends of most of the terms.
with any alternate histories of our choosing, as long as our choice is Tails contribute very little to 𝐿 ′ (𝑠), enabling a larger rating change.
compatible with their current rating and uncertainty. In particular, Conversely, the tanh terms of a player with a very consistent per-
both 𝑖 and 𝑗 can be considered to have participated in the same formance history will contribute large derivatives, so the bound on
set of rounds, with 𝑖 always performing at 𝜇𝑖,𝑡 −1 . and 𝑗 always their rating change will be small.
performing at 𝜇 𝑗,𝑡 −1 . Round 𝑡 is unchanged. Thus, Elo-MMR naturally caps the rating changes of all play-
Suppose 𝑖 ≻ 𝑗. Since 𝑖’s historical performances are all equal or ers, and the cap is smaller for consistent performers. The cap will
stronger than 𝑗’s, Theorem 5.5 implies 𝜇𝑖,𝑡 > 𝜇 𝑗,𝑡 . increase after an extreme performance, providing a similar “momen-
Suppose 𝑗 ≻ 𝑖 instead. By translation-invariance, if we shift each tum” to the Topcoder and Glicko-2 systems, but without sacrificing
of 𝑗’s performances, up to round 𝑡 and including the initial prior, incentive-compatibility (Theorem 5.5).
upward by 𝜇𝑖,𝑡 −1 − 𝜇 𝑗,𝑡 −1 , the rating changes between rounds will Let’s compare the lower and upper bound in Theorem 5.7: within
be unaffected. Players 𝑖 and 𝑗 now have identical histories, except a factor of 𝜋 2 /6, their ratio corresponds to the normal term’s weight
Í
that we still have 𝑗 ≻ 𝑖 at round 𝑡. Therefore, 𝜇 𝑗,𝑡 −1 = 𝜇𝑖,𝑡 −1 and, 𝑤 0 relative to the total 𝑘 𝑤𝑘 . Recall that 𝜌 is the weight transfer
by Theorem 5.5, 𝜇 𝑗,𝑡 > 𝜇𝑖,𝑡 . Subtracting the equation from the rate: larger 𝜌 results in more weight being transferred into 𝑤 0 ; in
inequality proves the second conclusion. □ this case, the lower and upper bound tend to stay close together.
Conversely, the momentum effect is more pronounced when 𝜌
5.2 Robust response is small. In the extreme case 𝜌 = 0, 𝑤 0 vanishes for experienced
players, so a sufficiently volatile player would be subject to corre-
Another desirable property in many settings is robustness: a player’s
spondingly large rating updates.
rating should not change too much in response to any one con-
In general, according to Algorithm 2, the asymptotic steady-
test, no matter how extreme their performance. The Codeforces Í
state values of 𝑤 0 and 𝑊 := 𝑘 𝑤𝑘 must jointly solve the fixpoint
and TrueSkill systems lack this property, allowing for unbounded
equation
rating changes. Topcoder achieves robustness by clamping any
𝑤 0 = 𝜅𝑤 0 + (𝜅 − 𝜅 1+𝜌 )(𝑊 − 𝑤 0 ).
changes that exceed a cap, which is initially high for new players
but decreases with experience. Rearranging yields an expression for the steady-state ratio:
When 𝜌 > 0, Elo-MMR(𝜌) achieves robustness in a natural, 𝑤 0 𝜅 − 𝜅 1+𝜌
smoother manner. To understand how, we look at the interplay = .
𝑊 1 − 𝜅 1+𝜌
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

If we don’t expect player skill to change too rapidly, then the Dataset # contests avg. # participants / contest
system parameters should be set in such a way that 𝜅 ≈ 1. In this Codeforces 1257 3899
limit, using 1 − 𝜅 𝑥 ≈ (1 − 𝜅)𝑥 yields Topcoder 2115 391
Reddit 1000 20
𝑤0 (1 − 𝜅)𝜌 1
≈ = . CTF 1100 354
𝑊 (1 − 𝜅)(1 + 𝜌) 1 + 1/𝜌 DanceSport 18292 6
Thus, the upper bound in Theorem 5.7 is approximately propor- Synth-large 50 10000
tional to 1 + 1/𝜌. Loosely speaking, therefore, the additive term 1/𝜌 Synth-small 15000 5
may be interpreted as a momentum parameter. Table 1: Summary of test datasets.

5.3 Runtime analysis and optimizations If the contests are extremely large, so that Ω(1/𝜀 2 ) opponents
Let’s look at the computation time needed to process a round with have a rating and uncertainty in the same 𝜀-width bucket as player
participant set P, where we again omit the round subscript. Each 𝑖, then it’s possible to do even better: up to the allowed precision 𝜀,
player 𝑖 has a participation history H𝑖 . the corresponding terms can be treated as duplicates. Hence, their
Estimating 𝑃𝑖 entails finding the zero of a monotonic function sum can be determined by counting how many of these opponents
with 𝑂 (|P |) terms, and then obtaining the rating 𝜇𝑖 entails finding win, lose, or tie against player 𝑖. Given the pre-sorted list of ranks of
the zero of another monotonic function with 𝑂 (|H𝑖 |) terms. Using players in the bucket, two binary searches would yield the answer.
either of the Illinois or Newton methods, solving these equations In practice, a single bucket might not contain enough participants,
to precision 𝜀 takes 𝑂 (log log 𝜀1 ) iterations. As a result, the total so we sample enough buckets to yield the desired precision.
runtime needed to process one round of competition is Simple parallelism. Since each player’s rating computation is
! independent, the algorithm is embarrassingly parallel. Threads can
Õ 1
𝑂 (|P | + |H𝑖 |) log log . read the same global data structures, so each additional thread
𝜀 contributes only 𝑂 (1) memory overhead.
𝑖 ∈P

This complexity is more than adequate for Codeforces-style com- 6 EXPERIMENTS

petitions with thousands of contestants and history lengths up to a
few hundred. Indeed, we were able to process the entire history of In this section, we describe experiments on real-world datasets,
Codeforces on a small laptop in less than half an hour. Nonetheless, mined from several sources that will be described in Section 6.1. We
it may be cost-prohibitive in truly massive settings, where |P | or compare the rating systems described in Section 6.2, on the metrics
|H𝑖 | number in the millions. Fortunately, it turns out that both of runtime and predictive accuracy, as described in Section 6.3. All
functions may be compressed down to a bounded number of terms, experiments were run on a 2.3 GHz 8-core Skylake machine with
with negligible loss of precision. 32 GB of memory. Implementations of all rating systems, dataset
mining, and additional processing used in our experiments can be
Adaptive subsampling. In Section 2, we used Doob’s consistency found at https://github.com/EbTech/Elo-MMR.
theorem to argue that our estimate for 𝑃𝑖 is consistent. Specifically,
Hyperparameter search. To ensure fair comparisons, we ran a
we saw that 𝑂 (1/𝜀 2 ) opponents are needed to get the typical error
separate grid search for each triple of algorithm, dataset, and metric,
below 𝜀. Thus, we can subsample the set of opponents to include in
over all of the algorithm’s hyperparameters. The hyperparameter
the estimation, omitting the rest. Random sampling is one approach.
set that performed best on the first 10% of the dataset, was then
A more efficient approach chooses a fixed number of opponents
used to test the algorithm on the remaining 90% of the dataset.
whose ratings are closest to that of player 𝑖, as these are more likely
to provide informative match-ups. On the other hand, if the setting
6.1 Datasets
requires incentive-compatibility to hold exactly, then one must
avoid choosing different opponents for each player. Due to the scarcity of public domain datasets for rating systems, we
mined five datasets to analyze the effectiveness of our system. The
History compression. Similarly, it’s possible to bound the number datasets were mined using data from each source website’s incep-
of stored factors in the posterior. Our skill-evolution algorithm tion up to February 5th, 2022. We also created synthetic datasets to
decays the weights of old performances at an exponential rate. test our system’s performance when the data generating process
Thus, the contributions of all but the most recent 𝑂 (log 𝜀1 ) terms matches our theoretical model. Summary statistics of the datasets
are negligible. Rather than erase the older logistic terms outright, we are presented in Table 1.
recommend replacing them with moment-matched Gaussian terms,
Codeforces contest history. This dataset contains the current en-
similar to the transfers in Section 4 with 𝜅𝑡 = 0. Since Gaussians
tire history of rated contests ever run on codeforces.com, the dom-
compose easily, a single term can then summarize an arbitrarily
inant platform for online programming competitions. The Code-
long prefix of the history.
forces platform has over 1 million registered users, over 400K of
Substituting 1/𝜀 2 and log 𝜀1 for |P | and |H𝑖 |, respectively, the
whom are rated, and has hosted over 1000 contests to date. Typically,
runtime of Elo-MMR with both optimizations becomes
each contest has a few thousand participants, takes 2 to 3 hours, and

|P | 1 contains 5 to 8 problems. Players are ranked by total points, with
𝑂 log log .
𝜀2 𝜀 more points typically awarded for tougher problems and for early
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu

solves. They may also attempt to “hack” one another’s submissions versions of all the algorithms in the safe subset of Rust, parellelized
for bonus points, identifying test cases that break their solutions. using the Rayon crate; as such, the Rust compiler verifies that they
contain no data races [33]. The only exception is TrueSkill: the
Topcoder contest history. This dataset contains the current en-
inherent seqentiality of its message-passing procedure prevented
tire history of algorithm contests ever run on the topcoder.com.
us from parallelizing it.
Topcoder is a predecessor to Codeforces, with over 1.4 million
registered users, and a long history as a pioneering platform for Elo-MMR. We specialize our rating system into two types: Elo-
programming contests. It hosts a variety of contest types, including MM𝜒 with a Gaussian performance model, and Elo-MMR(𝜌) with
over 2000 algorithm contests to date. The scoring system is sim- a logistic performance model and pseudodiffusion rate 𝜌. We make
ilar to Codeforces, but with shorter rounds: typically 75 minutes use of the optimizations in Section 5.3, bounding both the number
allotted for a set of 3 problems. of sampled opponents and the history length by 500.
SubredditSimulator threads. This dataset contains data scraped
Topcoder system. The Topcoder website provides not only one
from the current top 1000 most upvoted threads on the website
of the oldest dataset of programming competitions, but also one of
reddit.com/r/SubredditSimulator. Reddit is a social news ag-
the oldest massively multiplayer deployments of a rating system.
gregation website with over 400 million monthly active users. The
The Topcoder system [10] generalizes Glicko-2, and suffers from
site itself is broken down into sub-sites called subreddits. Users
the same lack of incentive-compatibility [19]. Close variants of this
then post and comment to the subreddits, where the posts and
system are used by other contest sites, such as CodeChef [1].
comments receive votes from other users. In the subreddit Subred-
ditSimulator, users are language generation bots trained on text Codeforces system. In response to the main drawback of Topcoder,
from other subreddits. Automated posts are made by these bots to the Codeforces rating system [8] was specifically designed to be
SubredditSimulator every 3 minutes, and real users of Reddit vote incentive-compatible. It features more ad hoc choices than the other
on the best bot. Each post (and its associated comments) can thus systems: for instance, its rating updates target the geometric mean
be interpreted as a round of competition between the bots who of a player’s expected and actual ranks. Close variants of this system
commented. are used by other contest sites, such as LeetCode [7].
Capture the Flag competition history. This dataset contains data
scraped from ctftime.org, an archive site for Capture the Flag (CTF) TrueSkill. We use the improved TrueSkill algorithm of [31], bas-
style computer security contests. Teams are scored based on the ing our code on an open-source implementation of the same algo-
digital “flags" that they find by cracking computer security chal- rithm. Developed for the purpose of video game matchmaking on
lenges. CTFtime tracks over 150K teams and 1000 competitions. Microsoft’s Xbox Live platform, TrueSkill [25] is a Bayesian rating
Since these competitions are organized by a variety of groups, they system, implemented using a powerful probabilistic programming
come in a wide range of sizes. framework. Its update rules are rather complex, requiring iterations
of approximate message passing. It’s very effective on games with
DanceSport competition history. This dataset contains data scraped moderate numbers of players (typically 2 to 16), but struggles in
from results.o2cm.com. O2 CM is the dominant software package our experiments involving hundreds to thousands of players.
for hosting and managing competitive ballroom dance competitions
in North America. Its freely accessible online database includes an Glicko. The Glicko rating system [22] is a classic extension of Elo
average of one competition per week. Each competition is divided which, unlike Glicko-2, is incentive-compatible. While the Bayesian
into events based on age category, syllabus level, and dance style. mathematics of Glicko was derived only for 2-player games, a
Since these events are judged and ranked separately, we process naive baseline for 𝑁 -player games can be obtained by decomposing
them as distinct rounds, in the order listed by O2 CM. Since model- the game into its 𝑁 2 pairwise matchups (including self-draws).
ing the chemistry between dance partners is beyond this paper’s Since these outcomes are far from independent, we normalize the
scope, we simply treat each dance couple as a distinct contestant. collective weight of all 𝑁 updates applying to each player, to match
that of a hypothetical maximally informative 2-player game, i.e.,
Synthetic datasets (small and large). The small and large datasets
against an equally skilled player whose skill is completely certain.
contain 1K and 10K players respectively, with skills and perfor-
mances generated according to the logistic generative model in BAR. Bayesian Approximation Ranking [34] shares our goal of
Section 2. Players’ initial skills are drawn i.i.d. with mean 1500 and combining the accuracy of TrueSkill with the simplicity of Glicko.
variance 3502 . Players compete in all rounds, and are ranked ac- By a judicious application of simplifying approximations, it de-
cording to independent performances with variance 2002 . Between rives analytical formulas similar to the pairwise decomposition of
rounds, we add i.i.d. Gaussian increments with variance 352 to each Glicko4 . The normalization in the original paper performs poorly
of their skills. In the small dataset, each round consists of just 5 on our datasets’ large matches. To improve accuracy, just as with
players. In the large dataset, all 10K players participate in each Glicko, we normalize the collective weight of the batched updates
round. to equal that of one maximally informative 2-player game.

6.2 Rating systems

We compare our rating system against several academic and industry- 4 Specifically,
we use the Bradley-Terry model with full-pair, listed under Algorithm 1
tested alternatives. For a fairer comparison, we hand-coded efficient in the source paper [34].
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

6.3 Evaluation metrics is in the hundreds or thousands. One case where TrueSkill outper-
To compare the different algorithms, we define two measures of formed is in the DanceSport dataset, where the average number
predictive accuracy. Each metric will be defined on individual con- of participants per contest is just 3. In preliminary experiments,
testants in each round, and then averaged: TrueSkill and Elo-MMR score about equally when the number of
Í Í ranks is less than about 60.
𝑡 𝑖 ∈ P𝑡 metric(𝑖, 𝑡) Now, we turn our attention to Table 3, which showcases the com-
aggregate(metric) := Í .
𝑡 |P𝑡 | putational efficiency of Elo-MMR. On smaller datasets, it performs
Pair inversion metric [25]. Our first metric computes the fraction comparably to the Codeforces, TrueSkill, and Topcoder algorithms.
of opponents against whom our ratings predict the correct pairwise However, the latter suffer from a quadratic time dependency on the
result, defined as the higher-rated player either winning or tying: number of contestants; as a result, Elo-MMR outperforms them by
one to two orders of magnitude on the larger Codeforces dataset.
# correctly predicted matchups Finally, in comparisons between the two Elo-MMR variants, we
pair_inversion(𝑖, 𝑡) := × 100%.
|P𝑡 | − 1 note that while Elo-MMR(𝜌) is more accurate, Elo-MM𝜒 is always
This metric was used in the original evaluation of TrueSkill [25] faster. This has to do with the skill drift modeling described in
and is related to the Kendall’s 𝜏 rank correlation coefficient. Section 4, as every update in Elo-MMR(𝜌) must process 𝑂 (log 𝜀1 )
terms of a player’s competition history.
Rank deviation. Our second metric compares the rankings with
the total ordering that would be obtained by sorting players accord- 6.5 Elo-MMR on small and large contests
ing to their prior rating. The penalty is proportional to how much The derivation in Section 2 depended on taking a limit in which the
these ranks differ for player 𝑖: number of participants in each contest went to infinity. In practice,
|actual rank − predicted rank| one might wonder how well Elo-MMR handles smaller contests. To
rank_deviation(𝑖, 𝑡) := × 100%.
|P𝑡 | − 1 find out, we simulate what would happen if each Codeforces contest
was administered separately to smaller groups of contestants. That
In the event of ties, among the ranks within the tied range, we use
is, for every chosen contest size 𝑁 , the participants of each contest
the one that comes closest to the rating-based prediction.
are split into groups of at most 𝑁 . Each group is placed in a round,
and ranked according to their relative placement in the original
6.4 Empirical results contest.
Recall that Elo-MM𝜒 has a Gaussian performance model, matching In Figure 3, we see that Elo-MMR continues to beat the other
the modeling assumptions of Topcoder and TrueSkill. Elo-MMR(𝜌), systems, regardless of contest size.
on the other hand, has a logistic performance model, matching
the modeling assumptions of Codeforces and Glicko. While 𝜌 was 79.2
included in the hyperparameter search, in practice we found that
all values between 0 and 1 produce very similar results. 79.0
To ensure that errors due to the unknown skills of new players 78.8
don’t dominate our metrics, we excluded players who had competed 78.6
Accuracy

in less than 5 total contests. In most of the datasets, this reduced the
78.4
performance of our method relative to the others, as our method
78.2
seems to converge more accurately. Despite this, we see in Table 2
Elo-MMR
78.0
Codeforces
that both versions of Elo-MMR outperform the other rating systems
in both the pairwise inversion metric and the ranking deviation
metric.
77.8 Topcoder
We highlight a few key observations. First, significant perfor- 77.6 TrueSkill
mance gains are observed on the Codeforces and Topcoder datasets, 101 102
despite these platforms’ rating systems having been designed specif- Contest size
ically for their needs. Our gains are smallest on the synthetic dataset,
for which all algorithms perform similarly. This might be in part
Figure 3: Number of participants vs. accuracy for various rat-
due to the close correspondence between the generative process
ing systems.
and the assumptions of these rating systems. Furthermore, the
synthetic players compete in all rounds, enabling the system to
converge to near-optimal ratings for every player. Finally, the im-
proved TrueSkill performed well below our expectations, despite 7 CONCLUSIONS
our best efforts to improve it. We suspect that the message-passing This paper introduces the Elo-MMR rating system, which is in part
numerics break down in contests with a large number of individual a generalization of the two-player Glicko system, allowing any
participants. The difficulties persisted in all TrueSkill implemen- number of players. By developing a Bayesian model and taking the
tations that we tried, including on Microsoft’s popular Infer.NET limit as the number of participants goes to infinity, we obtained sim-
framework [30]. To our knowledge, we are the first to present exper- ple, human-interpretable rating update formulas. Furthermore, we
iments with TrueSkill on contests where the number of participants saw that the algorithm is incentive-compatible, robust to extreme
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu

Codeforces Topcoder TrueSkill Elo-MM𝝌 Elo-MMR(𝝆)

Dataset
pair inv. rank dev. pair inv. rank dev. pair inv. rank dev. pair inv. rank dev. pair inv. rank dev.
Codeforces 78.9% 14.5% 79.0% 14.4% 70.5% 19.8% 79.0% 14.4% 79.0% 14.4%
Topcoder 72.8% 18.4% 72.5% 18.5% 70.2% 20.0% 73.3% 18.1% 73.2% 18.1%
Reddit 61.5% 27.3% 61.5% 27.3% 61.3% 27.3% 61.6% 27.2% 61.6% 27.2%
CTF 71.1% 20.0% 71.0% 20.1% 70.9% 20.2% 70.6% 20.4% 71.1% 20.0%
DanceSport 71.0% 26.0% 70.9% 26.2% 73.0% 24.5% 72.0% 25.0% 71.8% 25.6%
Synth-large 84.0% 11.1% 84.1% 11.0% 83.3% 11.6% 84.0% 11.1% 84.0% 11.1%
Synth-small 83.4% 15.2% 83.4% 15.2% 83.3% 15.3% 83.6% 15.0% 83.7% 15.0%
Table 2: Performance of each rating system on the pairwise inversion and rank deviation metrics. Bolded entries denote the
best performances (highest pair inv. or lowest rank dev.) on each metric and dataset.

Dataset CF TC TS Elo-MM𝝌 Elo-MMR(𝝆) APPENDIX

Codeforces 1298.3 455.8 260.7 39.4 47.6 Lemma 3.1. If 𝑓𝑖 is continuously differentiable and log-concave,
Topcoder 26.8 13.8 61.0 13.9 15.3 then the functions 𝑙𝑖 , 𝑑𝑖 , 𝑣𝑖 are continuous, strictly decreasing, and
Reddit 4.6 4.7 4.2 4.7 4.7
CTF 20.1 8.13 39.2 7.4 7.7 𝑙𝑖 (𝑝) < 𝑑𝑖 (𝑝) < 𝑣𝑖 (𝑝) for all 𝑝.
DanceSport 74.4 71.0 66.6 73.5 73.8 Proof. Continuity of 𝐹𝑖 , 𝑓𝑖 , 𝑓𝑖′ implies that of 𝑙𝑖 , 𝑑𝑖 , 𝑣𝑖 . It’s known [13]
Synth-Large 10442.0 3024.0 320.3 42.6 37.1 that log-concavity of 𝑓𝑖 implies log-concavity of both 𝐹𝑖 and 1 − 𝐹𝑖 .
Synth-Small 62.7 60.6 56.0 62.0 61.7 As a result, 𝑙𝑖 , 𝑑𝑖 , and 𝑣𝑖 are derivatives of strictly concave functions;
Table 3: Total compute time over entire dataset, in seconds. therefore, they are strictly decreasing. In particular, each of

performances, asymptotically fast, and embarrassingly parallel. To 𝑓 ′ (𝑝) 𝑓𝑖 (𝑝) 2 −𝑓𝑖′ (𝑝) 𝑓𝑖 (𝑝) 2
our knowledge, our system is the first to rigorously prove all these 𝑣𝑖′ (𝑝) = 𝑖 − , 𝑙𝑖′ (𝑝) = − ,
𝐹𝑖 (𝑝) 𝐹𝑖 (𝑝) 2 1 − 𝐹𝑖 (𝑝) (1 − 𝐹𝑖 (𝑝)) 2
properties in a setting with more than two individually ranked
are negative for all 𝑝, so we conclude that
players. In terms of practical performance, we saw that it outper-
forms existing industry systems in both prediction accuracy and
𝑓 ′ (𝑝) 𝑓𝑖 (𝑝) 𝐹𝑖 (𝑝) ′
computation speed. 𝑑𝑖 (𝑝) − 𝑣𝑖 (𝑝) = 𝑖 − = 𝑣 (𝑝) < 0,
This work can be extended in several directions. First, the choices 𝑓𝑖 (𝑝) 𝐹𝑖 (𝑝) 𝑓𝑖 (𝑝) 𝑖
we made in modeling ties, pseudodiffusions, teams, and opponent 𝑓 ′ (𝑝) 𝑓𝑖 (𝑝) 1 − 𝐹𝑖 (𝑝) ′
𝑙𝑖 (𝑝) − 𝑑𝑖 (𝑝) = − 𝑖 − = 𝑙 (𝑝) < 0.
subsampling are by no means the only possibilities consistent with 𝑓𝑖 (𝑝) 1 − 𝐹𝑖 (𝑝) 𝑓𝑖 (𝑝) 𝑖
our Bayesian model of skills and performances. Second, it may □
be possible to further improve accuracy by fitting more flexible
performance and skill evolution models to domain-specific data. Theorem 3.2. Suppose that for all 𝑗, 𝑓 𝑗 is continuously differen-
Third, it would be useful to analyze convergence in realistic settings, tiable and log-concave. Then the unique maximizer of Pr(𝑃𝑖 = 𝑝 |
where the Bayesian model is not completely accurate. In particular, 𝐸𝑖𝐿 , 𝐸𝑊
𝑖 ) is given by the unique zero of
Õ Õ Õ
controlling long-term rating inflation or deflation is a challenge, 𝑄𝑖 (𝑝) = 𝑙 𝑗 (𝑝) + 𝑑 𝑗 (𝑝) + 𝑣 𝑗 (𝑝).
since we can’t directly compare players at different times. 𝑗 ≻𝑖 𝑗∼𝑖 𝑗 ≺𝑖
Over the past decade, online competition communities such as
Codeforces have grown exponentially. As such, considerable work Proof. First, we rank the players by their buckets according to
has gone into engineering scalable and reliable rating systems. ⌊𝑃 𝑗 /𝜖⌋, and take the limiting probabilities as 𝜖 → 0:
Unfortunately, many of these systems have not been rigorously 𝑃𝑗 𝑝 𝑝
Pr(⌊ ⌋ > ⌊ ⌋) = Pr(𝑝 𝑗 ≥ 𝜖 ⌊ ⌋ + 𝜖)
analyzed in the academic community. We hope that our paper and 𝜖 𝜖 𝜖
open-source release will open new explorations in this area. 𝑝
= 1 − 𝐹 𝑗 (𝜖 ⌊ ⌋ + 𝜖) → 1 − 𝐹 𝑗 (𝑝),
𝜖
ACKNOWLEDGEMENTS 𝑃𝑗 𝑝 𝑝
Pr(⌊ ⌋ < ⌊ ⌋) = Pr(𝑝 𝑗 < 𝜖 ⌊ ⌋)
The authors are indebted to Daniel Sleator and Danica J. Sutherland 𝜖 𝜖 𝜖
𝑝
for initial discussions that helped inspire this work, and to Nikita = 𝐹 𝑗 (𝜖 ⌊ ⌋) → 𝐹 𝑗 (𝑝),
Gaevoy for the open-source improved TrueSkill upon which our 𝜖
implementation is based. Experiments in this paper are funded by 1 𝑃𝑗 𝑝 1 𝑝 𝑝
Pr(⌊ ⌋ = ⌊ ⌋) = Pr(𝜖 ⌊ ⌋ ≤ 𝑃 𝑗 < 𝜖 ⌊ ⌋ + 𝜖)
a Google Cloud Research Grant. The second author is supported by 𝜖 𝜖 𝜖 𝜖 𝜖 𝜖
a VMWare Fellowship and the Natural Sciences and Engineering 1 𝑝 𝑝
= 𝐹 𝑗 (𝜖 ⌊ ⌋ + 𝜖) − 𝐹 𝑗 (𝜖 ⌊ ⌋) → 𝑓 𝑗 (𝑝).
Research Council of Canada. 𝜖 𝜖 𝜖
𝑃
𝜖 , and 𝐷 𝜖 be shorthand for the events ⌊ 𝑗 ⌋ > ⌊ ⌋, 𝑝
Let 𝐿𝜖𝑗𝑝 , 𝑊 𝑗𝑝 𝑗𝑝 𝜖 𝜖
𝑃 𝑝 𝑃 𝑝
⌊ 𝜖𝑗 ⌋ < ⌊ 𝜖 ⌋, and ⌊ 𝜖𝑗 ⌋ = ⌊ 𝜖 ⌋. respectively. These correspond to a
An Elo-like System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

player who performs at 𝑝 losing, winning, and drawing against 𝑗, [7] LeetCode New Contest Rating Algorithm. leetcode.com/discuss/general-
respectively, when outcomes are determined by 𝜖-buckets. Then, discussion/468851/New-Contest-Rating-Algorithm-(Coming-Soon)
[8] Open Codeforces Rating System. codeforces.com/blog/entry/20762
Ö Ö Ö Pr(𝐷 𝜖𝑗𝑝 ) [9] Ratings migrated to Elo-MMR. https://dmoj.ca/post/206-ratings-migrated-to-
Pr(𝐸𝑊
𝑖 , 𝐸 𝐿
𝑖 | 𝑃𝑖 = 𝑝) = lim Pr(𝐿 𝜖
𝑗𝑝 ) Pr(𝑊 𝜖
𝑗𝑝 ) elo-mmr
𝜖→0
𝑗 ≻𝑖 𝑗 ≺𝑖 𝑗∼𝑖,𝑗≠𝑖
𝜖 [10] Topcoder Algorithm Competition Rating System. topcoder.com/community/
Ö Ö Ö competitive-programming/how-to-compete/ratings
= (1 − 𝐹 𝑗 (𝑝)) 𝐹 𝑗 (𝑝) 𝑓 𝑗 (𝑝), [11] Why Are Obstacle-Course Races So Popular? theatlantic.com/health/archive/
2018/07/why-are-obstacle-course-races-so-popular/565130/
𝑗 ≻𝑖 𝑗 ≺𝑖 𝑗∼𝑖,𝑗≠𝑖 [12] Sharad Agarwal and Jacob R. Lorch. 2009. Matchmaking for online games and
other latency-sensitive P2P systems. In SIGCOMM 2009. 315–326.
Pr(𝑃𝑖 = 𝑝 | 𝐸𝑖𝐿 , 𝐸𝑊
𝑖 ) ∝ 𝐿 𝑊
𝑓𝑖 (𝑝) Pr(𝐸𝑖 , 𝐸𝑖 | 𝑃𝑖 = 𝑝) [13] Mark Yuying An. 1997. Log-concave probability distributions: Theory and statis-
Ö Ö Ö tical testing. (1997).
= (1 − 𝐹 𝑗 (𝑝)) 𝐹 𝑗 (𝑝) 𝑓 𝑗 (𝑝), [14] Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block
𝑗 ≻𝑖 𝑗 ≺𝑖 𝑗∼𝑖 designs: I. The method of paired comparisons. Biometrika (1952), 324–345.
[15] Shuo Chen and Thorsten Joachims. 2016. Modeling Intransitivity in Matchup
d Õ Õ Õ
and Comparison Data. In WSDM 2016. 227–236.
ln Pr(𝑃𝑖 = 𝑝 | 𝐸𝑖𝐿 ,𝐸𝑊 𝑖 )= 𝑙 𝑗 (𝑝) + 𝑣 𝑗 (𝑝) + 𝑑 𝑗 (𝑝) = 𝑄𝑖 (𝑝). [16] Rémi Coulom. [n.d.]. Whole-history rating: A Bayesian rating system for players
d𝑝 𝑗 ≻𝑖 𝑗 ≺𝑖 𝑗∼𝑖 of time-varying strength. In CG 2008. Springer, 113–124.
[17] Pierre Dangauthier, Ralf Herbrich, Tom Minka, and Thore Graepel. 2007. TrueSkill
Since Lemma 3.1 tells us that 𝑄𝑖 is strictly decreasing, it only Through Time: Revisiting the History of Chess. In NeurIPS 2007. 337–344.
remains to show that it has a zero. If the zero exists, it must be [18] Arpad E. Elo. 1961. New USCF rating system. Chess Life (1961), 160–161.
unique and it will be the unique maximum of Pr(𝑃𝑖 = 𝑝 | 𝐸𝑖𝐿 , 𝐸𝑊 𝑖 ).
[19] RNDr Michal Forišek. 2009. Theoretical and Practical Aspects of Programming
To start, we want to prove the existence of 𝑝 ∗ such that 𝑄𝑖 (𝑝 ∗ ) <
Contest Ratings. (2009).
[20] David A Freedman. 1963. On the asymptotic behavior of Bayes’ estimates in the
0. Note that it’s not possible to have 𝑓 𝑗′ (𝑝) ≥ 0 for all 𝑝, as in that discrete case. The Annals of Mathematical Statistics (1963), 1386–1403.
[21] Mark E Glickman. 1995. A comprehensive guide to chess ratings. American Chess
case the density would integrate to either zero or infinity. Thus, for Journal (1995), 59–102.
each 𝑗 such that 𝑗 ∼ 𝑖, we can choose 𝑝 𝑗 such that 𝑓 𝑗′ (𝑝 𝑗 ) < 0, and [22] Mark E Glickman. 1999. Parameter estimation in large dynamic paired compari-
son experiments. Applied Statistics (1999), 377–394.
Í
so 𝑑 𝑗 (𝑝 𝑗 ) < 0. Let 𝛼 = − 𝑗∼𝑖 𝑑 𝑗 (𝑝 𝑗 ) > 0.
[23] Mark E Glickman. 2012. Example of the Glicko-2 system. Boston University
Let 𝑛 = |{ 𝑗 : 𝑗 ≺ 𝑖}|. For each 𝑗 such that 𝑗 ≺ 𝑖, since (2012), 1–6.
lim𝑝→∞ 𝑣 𝑗 (𝑝) = 0/1 = 0, we can choose 𝑝 𝑗 such that 𝑣 𝑗 (𝑝 𝑗 ) < 𝛼/𝑛. [24] Linxia Gong, Xiaochuan Feng, Dezhi Ye, Hao Li, Runze Wu, Jianrong Tao,
Let 𝑝 ∗ = max 𝑗 ⪯𝑖 𝑝 𝑗 . Then, Changjie Fan, and Peng Cui. 2020. OptMatch: Optimized Matchmaking via
Modeling the High-Order Interactions on the Arena. In KDD 2020. 2300–2310.
Õ Õ Õ
𝑙 𝑗 (𝑝 ∗ ) ≤ 0, 𝑑 𝑗 (𝑝 ∗ ) ≤ −𝛼, 𝑣 𝑗 (𝑝 ∗ ) < 𝛼 . [25] Ralf Herbrich, Tom Minka, and Thore Graepel. 2006. TrueSkillTM : A Bayesian
Skill Rating System. In NeurIPS 2006. 569–576.
𝑗 ≻𝑖 𝑗∼𝑖 𝑗 ≺𝑖 [26] Tzu-Kuo Huang, Chih-Jen Lin, and Ruby C. Weng. 2006. Ranking individuals by
Therefore, group comparisons. In ICML 2006. 425–432.
Õ Õ Õ [27] Stephanie Kovalchik. 2020. Extension of the Elo rating system to margin of
𝑄𝑖 (𝑝 ∗ ) = 𝑙 𝑗 (𝑝 ∗ ) + 𝑑 𝑗 (𝑝 ∗ ) + 𝑣 𝑗 (𝑝 ∗ ) victory. Int. J. Forecast. (2020).
[28] Yao Li, Minhao Cheng, Kevin Fujii, Fushing Hsieh, and Cho-Jui Hsieh. 2018.
𝑗 ≻𝑖 𝑗∼𝑖 𝑗 ≺𝑖 Learning from Group Comparisons: Exploiting Higher Order Interactions. In
< 0 − 𝛼 + 𝛼 = 0. NeurIPS 2018. 4986–4995.
[29] Tom Minka, Ryan Cleven, and Yordan Zaykov. 2018. TrueSkill 2: An improved
By a symmetric argument, there also exists some 𝑞 ∗ for which Bayesian skill rating system. Technical Report MSR-TR-2018-8. Microsoft.
𝑄𝑖 (𝑞 ∗ ) > 0. By the intermediate value theorem with 𝑄𝑖 continuous, [30] T. Minka, J.M. Winn, J.P. Guiver, Y. Zaykov, D. Fabian, and J. Bronskill. /Infer.NET
0.3. Microsoft Research Cambridge. http://dotnet.github.io/infer.
there exists 𝑝 ∈ (𝑞 ∗, 𝑝 ∗ ) such that 𝑄𝑖 (𝑝) = 0, as desired. □ [31] Sergey I. Nikolenko, Alexander, and V. Sirotkin. 2010. Extensions of the TrueSkill
TM rating system. In In Proceedings of the 9th International Conference on Appli-
REFERENCES cations of Fuzzy Systems and Soft Computing. 151–160.
[32] Jerneja Premelč, Goran Vučković, Nic James, and Bojan Leskošek. 2019. Reliability
[1] CodeChef Rating Mechanism. codechef.com/ratings of judging in DanceSport. Front. Psychol. (2019), 1001.
[2] Codeforces: Results of 2019. codeforces.com/blog/entry/73683 [33] Josh Stone and Nicholas D Matsakis. The Rayon library (Rust Crate). crates.io/
[3] Farming Volatility: How a major flaw in a well-known rating system takes over crates/rayon
the GBL leaderboard. reddit.com/r/TheSilphRoad/comments/hwff2d/farming_ [34] Ruby C. Weng and Chih-Jen Lin. 2011. A Bayesian Approximation Method for
volatility_how_a_major_flaw_in_a/ Online Ranking. J. Mach. Learn. Res. (2011), 267–300.
[4] Halo Xbox video game franchise: in numbers. telegraph.co.uk/technology/video- [35] John Michael Winn. 2019. Model-based machine learning.
games/11223730/Halo-in-numbers.html [36] Lin Yang, Stanko Dimitrov, and Benny Mantin. 2014. Forecasting sales of new
[5] Kaggle milestone: 5 million registered users! kaggle.com/general/164795 virtual goods with the Elo rating system. RPM (2014), 457–469.
[6] Kaggle Progression System. kaggle.com/progression

Morse 4400 Manual
100% (1)
Morse 4400 Manual
18 pages
Rating Scale
100% (6)
Rating Scale
10 pages
Create-a-Game Assessment Rubric: Category 4 3 2 1 0 Creativity
100% (2)
Create-a-Game Assessment Rubric: Category 4 3 2 1 0 Creativity
4 pages
IP-III Tutorial
No ratings yet
IP-III Tutorial
174 pages
Generalizing Elo Arxiv
No ratings yet
Generalizing Elo Arxiv
25 pages
A Bayesian Approximation Method For Online Ranking: Ruby C. Weng
No ratings yet
A Bayesian Approximation Method For Online Ranking: Ruby C. Weng
34 pages
Cloud Ight Coding Contest 30 October 2020: Event Organizer
No ratings yet
Cloud Ight Coding Contest 30 October 2020: Event Organizer
9 pages
Justin Dastous Research Paper
No ratings yet
Justin Dastous Research Paper
53 pages
Nolan - Dominance Matrices Project
No ratings yet
Nolan - Dominance Matrices Project
16 pages
Adaptivity Challenges in Games and Simulations A Survey
No ratings yet
Adaptivity Challenges in Games and Simulations A Survey
15 pages
True Skill 2
No ratings yet
True Skill 2
24 pages
3 NIPS-2007-trueskill-through-time-revisiting-the-history-of-chess-Paper
No ratings yet
3 NIPS-2007-trueskill-through-time-revisiting-the-history-of-chess-Paper
8 pages
05 Handout 1
No ratings yet
05 Handout 1
3 pages
Chance PDF
No ratings yet
Chance PDF
17 pages
Ilya O. Ryzhov Awais Tariq Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA
No ratings yet
Ilya O. Ryzhov Awais Tariq Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA
12 pages
Ranking Practices and Distinction in League of Legends: Yubo Kou Xinning Gui Yong Ming Kow
No ratings yet
Ranking Practices and Distinction in League of Legends: Yubo Kou Xinning Gui Yong Ming Kow
6 pages
Mark
No ratings yet
Mark
3 pages
LNAI 2903 Dynamic Games to Assess Network Value and Performance 1st Edition by Gregory Calbert, Peter Smet, Jason Scholz , Hing Wah Kwok ISBN 9783540206460 354020646X pdf download
100% (1)
LNAI 2903 Dynamic Games to Assess Network Value and Performance 1st Edition by Gregory Calbert, Peter Smet, Jason Scholz , Hing Wah Kwok ISBN 9783540206460 354020646X pdf download
54 pages
Level 4
No ratings yet
Level 4
7 pages
MEEGA-Scale-instructions
No ratings yet
MEEGA-Scale-instructions
2 pages
Gameful Experience Questionnaire (GAMEFULQUEST) : An Instrument For Measuring The Perceived Gamefulness of System Use
No ratings yet
Gameful Experience Questionnaire (GAMEFULQUEST) : An Instrument For Measuring The Perceived Gamefulness of System Use
42 pages
IB High Level Math A&A Exploration
No ratings yet
IB High Level Math A&A Exploration
19 pages
Leaderboard Relative Weighted Ranking Algorithm
No ratings yet
Leaderboard Relative Weighted Ranking Algorithm
3 pages
Level 5
No ratings yet
Level 5
7 pages
Williams-GBL-160 (1)
No ratings yet
Williams-GBL-160 (1)
9 pages
2410.19006v2
No ratings yet
2410.19006v2
9 pages
Player Ratings
No ratings yet
Player Ratings
26 pages
Objective-Based Reward System
No ratings yet
Objective-Based Reward System
9 pages
308-Article Text-1644-1-10-20190920 PDF
No ratings yet
308-Article Text-1644-1-10-20190920 PDF
14 pages
Predicting Round and Game Winners in CSGO
No ratings yet
Predicting Round and Game Winners in CSGO
10 pages
Daang Jover Thesis 8
No ratings yet
Daang Jover Thesis 8
15 pages
Regan 2012
No ratings yet
Regan 2012
14 pages
From Points To Progression: A Scoping Review of Game Elements in Gamification Research With A Content Analysis of 280 Research Papers
No ratings yet
From Points To Progression: A Scoping Review of Game Elements in Gamification Research With A Content Analysis of 280 Research Papers
21 pages
FGTG9.8
No ratings yet
FGTG9.8
94 pages
GCN Open Play Rating System: ELO System Mathematical Details
No ratings yet
GCN Open Play Rating System: ELO System Mathematical Details
3 pages
Score Design For Meaningful Gamification
No ratings yet
Score Design For Meaningful Gamification
4 pages
Saint Benilde: International School (Calamba), Inc
No ratings yet
Saint Benilde: International School (Calamba), Inc
5 pages
A Computer-in-the-Loop Approach For Detecting Bullies in The Classroom
No ratings yet
A Computer-in-the-Loop Approach For Detecting Bullies in The Classroom
8 pages
game design_course_pdf
No ratings yet
game design_course_pdf
10 pages
Quiz Game Project
No ratings yet
Quiz Game Project
14 pages
Paper 05
No ratings yet
Paper 05
17 pages
Huynh and Lida (2018) Otro Muy Similar
No ratings yet
Huynh and Lida (2018) Otro Muy Similar
4 pages
AI L6 - Game Theory
No ratings yet
AI L6 - Game Theory
56 pages
The Glicko System
No ratings yet
The Glicko System
6 pages
7588 Re Evaluating Evaluation
No ratings yet
7588 Re Evaluating Evaluation
12 pages
Biggs Et Al. (2022) - 209-221
No ratings yet
Biggs Et Al. (2022) - 209-221
13 pages
1
No ratings yet
1
24 pages
Landfriend and Mocskos - TrueSkill Through Time: reliable initial skill estimates and historical comparability with Julia, Python, and R
No ratings yet
Landfriend and Mocskos - TrueSkill Through Time: reliable initial skill estimates and historical comparability with Julia, Python, and R
43 pages
development-of-a-scoring-system-for-the-team-effectiveness-questionnaire-teq
No ratings yet
development-of-a-scoring-system-for-the-team-effectiveness-questionnaire-teq
8 pages
SIDO Performance Model 2024
No ratings yet
SIDO Performance Model 2024
56 pages
AI Unit 3 PDF
No ratings yet
AI Unit 3 PDF
12 pages
MyNotes Unit 1
No ratings yet
MyNotes Unit 1
5 pages
Game Structure
No ratings yet
Game Structure
17 pages
Elo Rating System
No ratings yet
Elo Rating System
19 pages
3 CSE3013 Adversarial Search
No ratings yet
3 CSE3013 Adversarial Search
48 pages
RATING SCALE Complete
No ratings yet
RATING SCALE Complete
15 pages
Chapter Five: Elo's System
No ratings yet
Chapter Five: Elo's System
14 pages
Comparing Fair Ranking Metrics
No ratings yet
Comparing Fair Ranking Metrics
9 pages
(A) What Would Be A Reasonable Definition of Fairness in Games?
No ratings yet
(A) What Would Be A Reasonable Definition of Fairness in Games?
2 pages
Advanced LibGDX: Engineering Complex Java Games: LibGDX series
From Everand
Advanced LibGDX: Engineering Complex Java Games: LibGDX series
Kameron Hussain
No ratings yet
Grinding for Gold: Play-to-Earn Gaming from Ponzi to Profession
From Everand
Grinding for Gold: Play-to-Earn Gaming from Ponzi to Profession
Alexander J. Clarke
No ratings yet
Enacting Platforms: Feminist Technoscience and the Unreal Engine
From Everand
Enacting Platforms: Feminist Technoscience and the Unreal Engine
James Malazita
No ratings yet
Civilsyll
No ratings yet
Civilsyll
32 pages
Earth Science: Quarter 1 - Module 5
67% (6)
Earth Science: Quarter 1 - Module 5
41 pages
Container Load Plan As Mate Receipt - China Po# 4001377580 - NGC Id#5011959816
No ratings yet
Container Load Plan As Mate Receipt - China Po# 4001377580 - NGC Id#5011959816
2 pages
Alyssa Milano Meal Plan
No ratings yet
Alyssa Milano Meal Plan
6 pages
BCA C Programming
No ratings yet
BCA C Programming
103 pages
TLS06F006-C Covidien PB540 PB560 Spec 2982400 Rev 2 - 7
No ratings yet
TLS06F006-C Covidien PB540 PB560 Spec 2982400 Rev 2 - 7
24 pages
Serbian Prepositions
No ratings yet
Serbian Prepositions
3 pages
Flame Tests and Spectroscopy Lab. C.S.
No ratings yet
Flame Tests and Spectroscopy Lab. C.S.
3 pages
3rd Grade
No ratings yet
3rd Grade
2 pages
3 Quarter Enrichment Activity:: NAME: - SECTION
No ratings yet
3 Quarter Enrichment Activity:: NAME: - SECTION
2 pages
Garden Rail - Issue 306 - February 2020 PDF
No ratings yet
Garden Rail - Issue 306 - February 2020 PDF
66 pages
Discussions On IE Irodov's Problems in General Physics Arihant Books ArihantBooks
No ratings yet
Discussions On IE Irodov's Problems in General Physics Arihant Books ArihantBooks
2 pages
Experiment 22 Deflection of An Electron Beam With An Electric Field
No ratings yet
Experiment 22 Deflection of An Electron Beam With An Electric Field
4 pages
Hindi Atoms & Molecules in One Shot Anubha
100% (1)
Hindi Atoms & Molecules in One Shot Anubha
112 pages
Viessmann, SCU224-Solar Control Unit For Multi-Load Solar Systems Installationand Operating Manual
No ratings yet
Viessmann, SCU224-Solar Control Unit For Multi-Load Solar Systems Installationand Operating Manual
24 pages
Letter To The Editor Example - Google Search
No ratings yet
Letter To The Editor Example - Google Search
1 page
A Study of Dairy Industry in India PDF
100% (1)
A Study of Dairy Industry in India PDF
12 pages
A Semi-Detailed Lesson Plan in Science For Grade-5 LEARNING COMPETENCIES: Investigate Extent of Soil Erosion in The
No ratings yet
A Semi-Detailed Lesson Plan in Science For Grade-5 LEARNING COMPETENCIES: Investigate Extent of Soil Erosion in The
6 pages
Divyanshu Sharma - Christ University
No ratings yet
Divyanshu Sharma - Christ University
18 pages
Science
No ratings yet
Science
3 pages
SSC CHSL 2023 August 4 Shift 1
No ratings yet
SSC CHSL 2023 August 4 Shift 1
29 pages
Paediatric Optometry Part 1
100% (1)
Paediatric Optometry Part 1
5 pages
Online Bus Ticket Booking
50% (2)
Online Bus Ticket Booking
6 pages
Lesson Plan in Relative and Absolute Dating
No ratings yet
Lesson Plan in Relative and Absolute Dating
8 pages
EFFC Tremie Concrete Guide Final
No ratings yet
EFFC Tremie Concrete Guide Final
52 pages
List of Government Officials 2017
No ratings yet
List of Government Officials 2017
42 pages
The Impacts of Plastic Pollution On Sea Turtles
No ratings yet
The Impacts of Plastic Pollution On Sea Turtles
17 pages
Thesis Topics Civil Engineering
100% (3)
Thesis Topics Civil Engineering
5 pages

EloMMR

Uploaded by

EloMMR

Uploaded by

An Elo-like System for Massive Multiplayer Competitions

Aram Ebtekar Paul Liu

4.1 Desirable properties of a “pseudodiffusion” Algorithm 1 Elo-MMR(𝜌, 𝛽, 𝛾, 𝜇𝑖𝑛𝑖𝑡 , 𝜎𝑖𝑛𝑖𝑡 )

Algorithm 2 diffuse(𝑖) 5 THEORETICAL PROPERTIES

3750 Proof. If 𝑖 ≻ 𝑗 with 𝑖, 𝑗 adjacent in the rankings, then

This complexity is more than adequate for Codeforces-style com- 6 EXPERIMENTS

6.2 Rating systems

Codeforces Topcoder TrueSkill Elo-MM𝝌 Elo-MMR(𝝆)

Dataset CF TC TS Elo-MM𝝌 Elo-MMR(𝝆) APPENDIX

You might also like